Intermediate Topics: JSON Schema and Validation
As you start working with JSON in AI applications, especially when relying on LLMs to generate structured data, you’ll quickly encounter the need for data consistency and reliability. How do you ensure that the JSON an LLM outputs, or the JSON you feed into it, always adheres to a specific structure and contains the right types of data? The answer lies in JSON Schema.
This chapter will introduce you to JSON Schema, a powerful tool for defining, validating, and documenting the structure of JSON data. Mastering JSON Schema is crucial for building robust and predictable AI systems.
4.1 What is JSON Schema?
JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. Think of it as a blueprint or a contract for your JSON data. It’s written in JSON itself, making it easy to integrate into existing JSON workflows.
Why is JSON Schema particularly important in AI?
- Reliable LLM Output: When you ask an LLM for structured output (e.g., “Extract product details in JSON format”), you can provide it with a JSON Schema. The LLM then uses this schema as a guide, significantly increasing the chances of getting correctly formatted and valid data.
- Input Validation: Before sending data to an LLM or any downstream system, you can validate your JSON inputs against a schema to catch errors early.
- Data Quality: Enforce data types, required fields, and value constraints, ensuring the quality and consistency of your AI’s data pipelines.
- Documentation: A JSON Schema serves as excellent, machine-readable documentation for your data structures, helping developers understand the expected format.
4.2 Basic JSON Schema Structure
A JSON Schema is itself a JSON document. It typically starts with some metadata:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://example.com/product.schema.json",
"title": "Product Schema",
"description": "Schema for a product object",
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Unique product identifier"
},
"name": {
"type": "string",
"description": "Name of the product"
},
"price": {
"type": "number",
"minimum": 0
},
"inStock": {
"type": "boolean"
}
},
"required": ["id", "name", "price"]
}
Let’s break down the key parts:
$schema: Specifies which version of the JSON Schema standard the schema is using. This is important for validators.$id: A unique URI for this schema. Good practice for referencing schemas.title,description: Human-readable metadata. Excellent for documentation.type: Defines the basic data type of the JSON instance this schema applies to (e.g.,"object","array","string","number","boolean","null").properties: For objects, this defines the schema for each of its properties (keys). Each property’s value is itself a JSON Schema.required: An array of strings, listing the names of properties that must be present in a valid JSON object.
4.3 Data Types and Basic Keywords
JSON Schema supports keywords for defining common data type constraints:
4.3.1 type
The type keyword defines the expected data type of a JSON value.
"string": For text."number": For integers or floating-point numbers."integer": For whole numbers."boolean": Fortrueorfalse."array": For ordered lists of values."object": For unordered collections of key-value pairs."null": For thenullvalue.
You can also specify multiple types using an array: "type": ["string", "null"] means the value can be either a string or null.
4.3.2 String Keywords
minLength,maxLength: Define the minimum and maximum length of a string.pattern: A regular expression that the string must match.format: Suggests a semantic meaning for the string (e.g.,"email","date-time","uri"). Validators may enforce these, but it’s often a hint.
Example:
{
"type": "string",
"minLength": 5,
"maxLength": 50,
"pattern": "^[a-zA-Z0-9_]+$",
"format": "email"
}
4.3.3 Number Keywords
minimum,maximum: Define the inclusive lower and upper bounds.exclusiveMinimum,exclusiveMaximum: Define exclusive bounds.multipleOf: The number must be a multiple of this value.
Example:
{
"type": "number",
"minimum": 0,
"exclusiveMaximum": 100,
"multipleOf": 0.5
}
4.3.4 Object Keywords
properties: Already seen. Defines schemas for explicit properties.required: Already seen. Lists properties that must be present.additionalProperties: Iffalse, only properties listed inpropertiesare allowed. Iftrue(default), other properties are allowed. Can also be a schema for additional properties.minProperties,maxProperties: Minimum/maximum number of properties an object can have.
Example:
{
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["name"],
"additionalProperties": false,
"minProperties": 1
}
4.3.5 Array Keywords
items: Defines the schema for all items in the array (if all items are of the same type) OR an array of schemas for items at specific positions.minItems,maxItems: Minimum/maximum number of items in the array.uniqueItems: Iftrue, all items in the array must be unique.contains: A schema that at least one item in the array must match.
Example:
{
"type": "array",
"items": {
"type": "string"
},
"minItems": 1,
"maxItems": 5,
"uniqueItems": true
}
Exercise 4.3.1: Create a User Profile Schema
Create a JSON Schema for a UserProfile object. It should have the following properties and constraints:
userId: Required, string, minLength 3, maxLength 10, pattern^[a-zA-Z0-9]+$email: Required, string, format “email”age: Optional, integer, minimum 18, maximum 99interests: Optional, array of strings, minItems 1, maxItems 5, uniqueItems trueisActive: Required, booleanlastLogin: Optional, string, format “date-time”- No additional properties should be allowed.
4.4 Advanced Schema Features
4.4.1 Combining Schemas: allOf, anyOf, oneOf, not
These keywords allow for complex logical combinations of schemas.
allOf: The data must be valid against all of the subschemas.anyOf: The data must be valid against at least one of the subschemas.oneOf: The data must be valid against exactly one of the subschemas.not: The data must not be valid against the given subschema.
Example (oneOf for polymorphic data):
Imagine an event log where an event can be either a "login" event or a "purchase" event, each with different properties.
{
"title": "Event Schema",
"oneOf": [
{
"title": "Login Event",
"type": "object",
"properties": {
"eventType": { "type": "string", "enum": ["login"] },
"userId": { "type": "string" },
"timestamp": { "type": "string", "format": "date-time" },
"ipAddress": { "type": "string", "format": "ipv4" }
},
"required": ["eventType", "userId", "timestamp", "ipAddress"]
},
{
"title": "Purchase Event",
"type": "object",
"properties": {
"eventType": { "type": "string", "enum": ["purchase"] },
"orderId": { "type": "string" },
"userId": { "type": "string" },
"timestamp": { "type": "string", "format": "date-time" },
"amount": { "type": "number", "minimum": 0 },
"currency": { "type": "string", "enum": ["USD", "EUR"] }
},
"required": ["eventType", "orderId", "userId", "timestamp", "amount", "currency"]
}
]
}
4.4.2 if/then/else
For conditional validation based on the value of a property.
Example: If country is “USA”, then state is required and must be a 2-letter code.
{
"type": "object",
"properties": {
"country": { "type": "string" },
"state": { "type": "string" }
},
"if": {
"properties": { "country": { "const": "USA" } },
"required": ["country"]
},
"then": {
"properties": { "state": { "type": "string", "pattern": "^[A-Z]{2}$" } },
"required": ["state"]
}
}
Exercise 4.4.1: Conditional Product Schema
Extend your Product Schema from Exercise 4.3.1. Add a new property category (string).
- If
categoryis “electronics”, then awarrantyYearsproperty (integer, minimum 1, maximum 5) is required. - If
categoryis “books”, then anisbnproperty (string, pattern^(?=(?:\D*\d){10}(?:(?:\D*\d){3})?$)[\d-]+$) is required. - For any other category, no specific additional properties are required.
4.5 JSON Schema Validation Tools
Writing a schema is only half the battle; you need tools to validate your JSON data against it.
4.5.1 Python: jsonschema library
The jsonschema library is a robust and widely used validator in Python.
Install:
pip install jsonschemaUsage:
import json from jsonschema import validate, ValidationError # Your JSON Schema user_schema = { "$schema": "http://json-schema.org/draft-07/schema#", "title": "Simple User", "type": "object", "properties": { "name": {"type": "string", "minLength": 2}, "age": {"type": "integer", "minimum": 0} }, "required": ["name", "age"] } # Valid data valid_user_data = {"name": "Charlie", "age": 30} # Invalid data (missing 'age') invalid_user_data_1 = {"name": "Bob"} # Invalid data (name too short) invalid_user_data_2 = {"name": "A", "age": 25} print("--- Validating Data ---") try: validate(instance=valid_user_data, schema=user_schema) print(f"'{valid_user_data}' is valid.") except ValidationError as e: print(f"Validation Error for '{valid_user_data}': {e.message}") try: validate(instance=invalid_user_data_1, schema=user_schema) print(f"'{invalid_user_data_1}' is valid.") except ValidationError as e: print(f"Validation Error for '{invalid_user_data_1}': {e.message}") try: validate(instance=invalid_user_data_2, schema=user_schema) print(f"'{invalid_user_data_2}' is valid.") except ValidationError as e: print(f"Validation Error for '{invalid_user_data_2}': {e.message}")
4.5.2 Node.js: ajv library
AJV (Another JSON Schema Validator) is a high-performance JSON Schema validator for JavaScript.
Install:
npm install ajvUsage:
import Ajv from 'ajv'; const ajv = new Ajv(); // Your JSON Schema const taskSchema = { "$schema": "http://json-schema.org/draft-07/schema#", "title": "Task Item", "type": "object", "properties": { "taskId": { "type": "string", "pattern": "^T\\d{3}$" }, "description": { "type": "string", "minLength": 10 }, "isComplete": { "type": "boolean" } }, "required": ["taskId", "description"] }; // Compile the schema const validateTask = ajv.compile(taskSchema); // Valid data const validTask = { taskId: "T001", description: "Implement user authentication module.", isComplete: false }; // Invalid data (missing 'description') const invalidTask1 = { taskId: "T002", isComplete: true }; // Invalid data (taskId wrong pattern) const invalidTask2 = { taskId: "task-003", description: "Fix critical bug in payment gateway.", isComplete: false }; console.log("--- Validating Data ---"); if (validateTask(validTask)) { console.log(`'${JSON.stringify(validTask)}' is valid.`); } else { console.log(`Validation Error for '${JSON.stringify(validTask)}':`); console.log(validateTask.errors); } if (validateTask(invalidTask1)) { console.log(`'${JSON.stringify(invalidTask1)}' is valid.`); } else { console.log(`Validation Error for '${JSON.stringify(invalidTask1)}':`); console.log(validateTask.errors); } if (validateTask(invalidTask2)) { console.log(`'${JSON.stringify(invalidTask2)}' is valid.`); } else { console.log(`Validation Error for '${JSON.stringify(invalidTask2)}':`); console.log(validateTask.errors); }
Exercise 4.5.1: Implement Validation for Your Schema
- Take your final
Product Schemafrom Exercise 4.4.1 (including conditional logic). - Choose either Python (
jsonschema) or Node.js (ajv). - Write a script that attempts to validate the following JSON data against your schema:
- Valid Product (Electronics):
{ "id": "E1001", "name": "Smartphone", "price": 799.99, "category": "electronics", "warrantyYears": 2, "inStock": true } - Invalid Product (Electronics - missing warranty):
{ "id": "E1002", "name": "Laptop", "price": 1200.00, "category": "electronics", "inStock": false } - Valid Product (Books):
{ "id": "B2001", "name": "The Great Novel", "price": 15.00, "category": "books", "isbn": "978-3-16-148410-0", "inStock": true } - Invalid Product (Books - wrong ISBN pattern):
{ "id": "B2002", "name": "Tech Guide", "price": 25.00, "category": "books", "isbn": "ABC-123", "inStock": true } - Valid Product (General):
{ "id": "G3001", "name": "Coffee Mug", "price": 8.50, "category": "homeware", "inStock": true }
- Valid Product (Electronics):
- For each validation attempt, print whether the data is valid or, if invalid, display the validation errors.
This exercise will give you practical experience in defining complex data structures with JSON Schema and then programmatically validating against them, a crucial skill for building reliable AI applications.