Intermediate Topics: JSON Schema and Validation

Intermediate Topics: JSON Schema and Validation

As you start working with JSON in AI applications, especially when relying on LLMs to generate structured data, you’ll quickly encounter the need for data consistency and reliability. How do you ensure that the JSON an LLM outputs, or the JSON you feed into it, always adheres to a specific structure and contains the right types of data? The answer lies in JSON Schema.

This chapter will introduce you to JSON Schema, a powerful tool for defining, validating, and documenting the structure of JSON data. Mastering JSON Schema is crucial for building robust and predictable AI systems.

4.1 What is JSON Schema?

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. Think of it as a blueprint or a contract for your JSON data. It’s written in JSON itself, making it easy to integrate into existing JSON workflows.

Why is JSON Schema particularly important in AI?

  • Reliable LLM Output: When you ask an LLM for structured output (e.g., “Extract product details in JSON format”), you can provide it with a JSON Schema. The LLM then uses this schema as a guide, significantly increasing the chances of getting correctly formatted and valid data.
  • Input Validation: Before sending data to an LLM or any downstream system, you can validate your JSON inputs against a schema to catch errors early.
  • Data Quality: Enforce data types, required fields, and value constraints, ensuring the quality and consistency of your AI’s data pipelines.
  • Documentation: A JSON Schema serves as excellent, machine-readable documentation for your data structures, helping developers understand the expected format.

4.2 Basic JSON Schema Structure

A JSON Schema is itself a JSON document. It typically starts with some metadata:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://example.com/product.schema.json",
  "title": "Product Schema",
  "description": "Schema for a product object",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "Unique product identifier"
    },
    "name": {
      "type": "string",
      "description": "Name of the product"
    },
    "price": {
      "type": "number",
      "minimum": 0
    },
    "inStock": {
      "type": "boolean"
    }
  },
  "required": ["id", "name", "price"]
}

Let’s break down the key parts:

  • $schema: Specifies which version of the JSON Schema standard the schema is using. This is important for validators.
  • $id: A unique URI for this schema. Good practice for referencing schemas.
  • title, description: Human-readable metadata. Excellent for documentation.
  • type: Defines the basic data type of the JSON instance this schema applies to (e.g., "object", "array", "string", "number", "boolean", "null").
  • properties: For objects, this defines the schema for each of its properties (keys). Each property’s value is itself a JSON Schema.
  • required: An array of strings, listing the names of properties that must be present in a valid JSON object.

4.3 Data Types and Basic Keywords

JSON Schema supports keywords for defining common data type constraints:

4.3.1 type

The type keyword defines the expected data type of a JSON value.

  • "string": For text.
  • "number": For integers or floating-point numbers.
  • "integer": For whole numbers.
  • "boolean": For true or false.
  • "array": For ordered lists of values.
  • "object": For unordered collections of key-value pairs.
  • "null": For the null value.

You can also specify multiple types using an array: "type": ["string", "null"] means the value can be either a string or null.

4.3.2 String Keywords

  • minLength, maxLength: Define the minimum and maximum length of a string.
  • pattern: A regular expression that the string must match.
  • format: Suggests a semantic meaning for the string (e.g., "email", "date-time", "uri"). Validators may enforce these, but it’s often a hint.

Example:

{
  "type": "string",
  "minLength": 5,
  "maxLength": 50,
  "pattern": "^[a-zA-Z0-9_]+$",
  "format": "email"
}

4.3.3 Number Keywords

  • minimum, maximum: Define the inclusive lower and upper bounds.
  • exclusiveMinimum, exclusiveMaximum: Define exclusive bounds.
  • multipleOf: The number must be a multiple of this value.

Example:

{
  "type": "number",
  "minimum": 0,
  "exclusiveMaximum": 100,
  "multipleOf": 0.5
}

4.3.4 Object Keywords

  • properties: Already seen. Defines schemas for explicit properties.
  • required: Already seen. Lists properties that must be present.
  • additionalProperties: If false, only properties listed in properties are allowed. If true (default), other properties are allowed. Can also be a schema for additional properties.
  • minProperties, maxProperties: Minimum/maximum number of properties an object can have.

Example:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" }
  },
  "required": ["name"],
  "additionalProperties": false,
  "minProperties": 1
}

4.3.5 Array Keywords

  • items: Defines the schema for all items in the array (if all items are of the same type) OR an array of schemas for items at specific positions.
  • minItems, maxItems: Minimum/maximum number of items in the array.
  • uniqueItems: If true, all items in the array must be unique.
  • contains: A schema that at least one item in the array must match.

Example:

{
  "type": "array",
  "items": {
    "type": "string"
  },
  "minItems": 1,
  "maxItems": 5,
  "uniqueItems": true
}

Exercise 4.3.1: Create a User Profile Schema

Create a JSON Schema for a UserProfile object. It should have the following properties and constraints:

  • userId: Required, string, minLength 3, maxLength 10, pattern ^[a-zA-Z0-9]+$
  • email: Required, string, format “email”
  • age: Optional, integer, minimum 18, maximum 99
  • interests: Optional, array of strings, minItems 1, maxItems 5, uniqueItems true
  • isActive: Required, boolean
  • lastLogin: Optional, string, format “date-time”
  • No additional properties should be allowed.

4.4 Advanced Schema Features

4.4.1 Combining Schemas: allOf, anyOf, oneOf, not

These keywords allow for complex logical combinations of schemas.

  • allOf: The data must be valid against all of the subschemas.
  • anyOf: The data must be valid against at least one of the subschemas.
  • oneOf: The data must be valid against exactly one of the subschemas.
  • not: The data must not be valid against the given subschema.

Example (oneOf for polymorphic data):

Imagine an event log where an event can be either a "login" event or a "purchase" event, each with different properties.

{
  "title": "Event Schema",
  "oneOf": [
    {
      "title": "Login Event",
      "type": "object",
      "properties": {
        "eventType": { "type": "string", "enum": ["login"] },
        "userId": { "type": "string" },
        "timestamp": { "type": "string", "format": "date-time" },
        "ipAddress": { "type": "string", "format": "ipv4" }
      },
      "required": ["eventType", "userId", "timestamp", "ipAddress"]
    },
    {
      "title": "Purchase Event",
      "type": "object",
      "properties": {
        "eventType": { "type": "string", "enum": ["purchase"] },
        "orderId": { "type": "string" },
        "userId": { "type": "string" },
        "timestamp": { "type": "string", "format": "date-time" },
        "amount": { "type": "number", "minimum": 0 },
        "currency": { "type": "string", "enum": ["USD", "EUR"] }
      },
      "required": ["eventType", "orderId", "userId", "timestamp", "amount", "currency"]
    }
  ]
}

4.4.2 if/then/else

For conditional validation based on the value of a property.

Example: If country is “USA”, then state is required and must be a 2-letter code.

{
  "type": "object",
  "properties": {
    "country": { "type": "string" },
    "state": { "type": "string" }
  },
  "if": {
    "properties": { "country": { "const": "USA" } },
    "required": ["country"]
  },
  "then": {
    "properties": { "state": { "type": "string", "pattern": "^[A-Z]{2}$" } },
    "required": ["state"]
  }
}

Exercise 4.4.1: Conditional Product Schema

Extend your Product Schema from Exercise 4.3.1. Add a new property category (string).

  • If category is “electronics”, then a warrantyYears property (integer, minimum 1, maximum 5) is required.
  • If category is “books”, then an isbn property (string, pattern ^(?=(?:\D*\d){10}(?:(?:\D*\d){3})?$)[\d-]+$) is required.
  • For any other category, no specific additional properties are required.

4.5 JSON Schema Validation Tools

Writing a schema is only half the battle; you need tools to validate your JSON data against it.

4.5.1 Python: jsonschema library

The jsonschema library is a robust and widely used validator in Python.

  1. Install: pip install jsonschema

  2. Usage:

    import json
    from jsonschema import validate, ValidationError
    
    # Your JSON Schema
    user_schema = {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "title": "Simple User",
      "type": "object",
      "properties": {
        "name": {"type": "string", "minLength": 2},
        "age": {"type": "integer", "minimum": 0}
      },
      "required": ["name", "age"]
    }
    
    # Valid data
    valid_user_data = {"name": "Charlie", "age": 30}
    
    # Invalid data (missing 'age')
    invalid_user_data_1 = {"name": "Bob"}
    
    # Invalid data (name too short)
    invalid_user_data_2 = {"name": "A", "age": 25}
    
    print("--- Validating Data ---")
    
    try:
        validate(instance=valid_user_data, schema=user_schema)
        print(f"'{valid_user_data}' is valid.")
    except ValidationError as e:
        print(f"Validation Error for '{valid_user_data}': {e.message}")
    
    try:
        validate(instance=invalid_user_data_1, schema=user_schema)
        print(f"'{invalid_user_data_1}' is valid.")
    except ValidationError as e:
        print(f"Validation Error for '{invalid_user_data_1}': {e.message}")
    
    try:
        validate(instance=invalid_user_data_2, schema=user_schema)
        print(f"'{invalid_user_data_2}' is valid.")
    except ValidationError as e:
        print(f"Validation Error for '{invalid_user_data_2}': {e.message}")
    

4.5.2 Node.js: ajv library

AJV (Another JSON Schema Validator) is a high-performance JSON Schema validator for JavaScript.

  1. Install: npm install ajv

  2. Usage:

    import Ajv from 'ajv';
    
    const ajv = new Ajv();
    
    // Your JSON Schema
    const taskSchema = {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "title": "Task Item",
      "type": "object",
      "properties": {
        "taskId": { "type": "string", "pattern": "^T\\d{3}$" },
        "description": { "type": "string", "minLength": 10 },
        "isComplete": { "type": "boolean" }
      },
      "required": ["taskId", "description"]
    };
    
    // Compile the schema
    const validateTask = ajv.compile(taskSchema);
    
    // Valid data
    const validTask = {
      taskId: "T001",
      description: "Implement user authentication module.",
      isComplete: false
    };
    
    // Invalid data (missing 'description')
    const invalidTask1 = {
      taskId: "T002",
      isComplete: true
    };
    
    // Invalid data (taskId wrong pattern)
    const invalidTask2 = {
      taskId: "task-003",
      description: "Fix critical bug in payment gateway.",
      isComplete: false
    };
    
    console.log("--- Validating Data ---");
    
    if (validateTask(validTask)) {
      console.log(`'${JSON.stringify(validTask)}' is valid.`);
    } else {
      console.log(`Validation Error for '${JSON.stringify(validTask)}':`);
      console.log(validateTask.errors);
    }
    
    if (validateTask(invalidTask1)) {
      console.log(`'${JSON.stringify(invalidTask1)}' is valid.`);
    } else {
      console.log(`Validation Error for '${JSON.stringify(invalidTask1)}':`);
      console.log(validateTask.errors);
    }
    
    if (validateTask(invalidTask2)) {
      console.log(`'${JSON.stringify(invalidTask2)}' is valid.`);
    } else {
      console.log(`Validation Error for '${JSON.stringify(invalidTask2)}':`);
      console.log(validateTask.errors);
    }
    

Exercise 4.5.1: Implement Validation for Your Schema

  1. Take your final Product Schema from Exercise 4.4.1 (including conditional logic).
  2. Choose either Python (jsonschema) or Node.js (ajv).
  3. Write a script that attempts to validate the following JSON data against your schema:
    • Valid Product (Electronics):
      {
        "id": "E1001",
        "name": "Smartphone",
        "price": 799.99,
        "category": "electronics",
        "warrantyYears": 2,
        "inStock": true
      }
      
    • Invalid Product (Electronics - missing warranty):
      {
        "id": "E1002",
        "name": "Laptop",
        "price": 1200.00,
        "category": "electronics",
        "inStock": false
      }
      
    • Valid Product (Books):
      {
        "id": "B2001",
        "name": "The Great Novel",
        "price": 15.00,
        "category": "books",
        "isbn": "978-3-16-148410-0",
        "inStock": true
      }
      
    • Invalid Product (Books - wrong ISBN pattern):
      {
        "id": "B2002",
        "name": "Tech Guide",
        "price": 25.00,
        "category": "books",
        "isbn": "ABC-123",
        "inStock": true
      }
      
    • Valid Product (General):
      {
        "id": "G3001",
        "name": "Coffee Mug",
        "price": 8.50,
        "category": "homeware",
        "inStock": true
      }
      
  4. For each validation attempt, print whether the data is valid or, if invalid, display the validation errors.

This exercise will give you practical experience in defining complex data structures with JSON Schema and then programmatically validating against them, a crucial skill for building reliable AI applications.