Guided Project 1: Building a Structured Data Extraction Agent

Guided Project 1: Building a Structured Data Extraction Agent

This project will guide you through building a simple AI agent that extracts structured information from various product reviews. You’ll use JSON Schema to define the exact output format the LLM should adhere to, and then leverage TOON (for inputs, if applicable) and JSON (for outputs, post-validation) within a Python or Node.js application.

Project Objective: Create an agent that processes product review text and extracts key details like the product mentioned, sentiment, rating, and identified pros/cons.

Technologies Used:

  • Python or Node.js
  • JSON
  • JSON Schema
  • TOON (for potential input optimization)
  • An LLM API (e.g., OpenAI, Anthropic, Gemini - you’ll need an API key for your chosen LLM)

Project Setup

Make sure you have your environment set up as described in Chapter 1. Specifically, you’ll need:

  • Python with pip install openai jsonschema python-toon
  • OR Node.js with npm install openai ajv @toon-format/toon
  • Your LLM API key set as an environment variable (e.g., OPENAI_API_KEY).

Create a new directory for this project:

mkdir structured-extraction-agent
cd structured-extraction-agent
# For Python: touch main.py product_schema.json
# For Node.js: touch index.js product_schema.json

Step 1: Define the Output Structure with JSON Schema

First, let’s create a JSON Schema that dictates the exact structure of the information we want to extract from each review. Save this as product_schema.json.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://example.com/product_review_analysis.schema.json",
  "title": "Product Review Analysis",
  "description": "Schema for extracting structured information from a product review.",
  "type": "object",
  "properties": {
    "productName": {
      "type": "string",
      "description": "The name of the product being reviewed.",
      "minLength": 3
    },
    "sentiment": {
      "type": "string",
      "description": "Overall sentiment of the review (positive, neutral, negative).",
      "enum": ["positive", "neutral", "negative"]
    },
    "rating": {
      "type": "integer",
      "description": "Numeric rating given by the user, if available (e.g., 1-5 stars).",
      "minimum": 1,
      "maximum": 5
    },
    "pros": {
      "type": "array",
      "description": "A list of positive aspects mentioned in the review.",
      "items": { "type": "string" },
      "uniqueItems": true
    },
    "cons": {
      "type": "array",
      "description": "A list of negative aspects or issues mentioned in the review.",
      "items": { "type": "string" },
      "uniqueItems": true
    },
    "summary": {
      "type": "string",
      "description": "A brief, concise summary of the review's main points.",
      "minLength": 10,
      "maxLength": 150
    }
  },
  "required": ["productName", "sentiment", "summary"],
  "additionalProperties": false
}

Explanation:

  • We define productName, sentiment, and summary as required fields.
  • sentiment is restricted to an enum (positive, neutral, negative).
  • rating is an integer between 1 and 5.
  • pros and cons are arrays of unique strings.
  • additionalProperties: false ensures the LLM doesn’t generate extra, unexpected fields.

Step 2: Write the AI Agent Logic

Now, let’s write the code to:

  1. Load the JSON Schema.
  2. Define sample product reviews.
  3. Construct a prompt for the LLM to extract data based on the schema.
  4. Call the LLM API.
  5. Validate the LLM’s response against the schema.
  6. (Optional) Encode input review to TOON for LLM for efficiency.

Choose your language:

Python (main.py)

import os
import json
from openai import OpenAI
from jsonschema import validate, ValidationError
from toon import encode as toon_encode, decode as toon_decode # Import TOON functions

# 1. Load the JSON Schema
with open('product_schema.json', 'r') as f:
    review_schema = json.load(f)

# Initialize OpenAI client (make sure OPENAI_API_KEY is set as an environment variable)
client = OpenAI()

def extract_review_data(review_text: str, schema: dict) -> dict:
    """
    Extracts structured data from a review using an LLM and validates it against a schema.
    """
    schema_str = json.dumps(schema, indent=2)

    # Convert the review_text to TOON if it's a long, structured piece that benefits
    # For this simple review string, JSON is fine or you can wrap it in a TOON string.
    # We will simply pass it as a regular string in the prompt.
    # If you had a list of reviews and wanted to pass them as structured input:
    # toon_review_data = toon_encode({"review": review_text}, indent=2)

    prompt = f"""
    You are an expert product review analyst. Your task is to extract structured information
    from the following product review.
    The output MUST STRICTLY conform to the provided JSON Schema.
    Do NOT include any extra text, comments, or explanations outside the JSON object.

    JSON Schema for output:
    `json
    {schema_str}
    `

    Product Review to Analyze:
    ---
    {review_text}
    ---

    Your JSON output:
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini", # Or another suitable model like "gpt-3.5-turbo"
            messages=[
                {"role": "system", "content": "You are a helpful assistant that provides structured JSON output."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}, # Critical for getting JSON output
            temperature=0.0 # Keep temperature low for structured extraction
        )

        llm_output_str = response.choices[0].message.content
        extracted_data = json.loads(llm_output_str)

        # Validate the extracted data
        validate(instance=extracted_data, schema=schema)
        print("Validation successful!")
        return extracted_data

    except json.JSONDecodeError as e:
        print(f"Error decoding JSON from LLM: {e}")
        print(f"LLM Raw Output:\n{llm_output_str}")
        return None
    except ValidationError as e:
        print(f"Validation failed for LLM output: {e.message}")
        print(f"Path: {' -> '.join(map(str, e.path))}")
        print(f"LLM Raw Output:\n{llm_output_str}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# Sample Reviews
sample_reviews = [
    """
    I recently bought the new 'Spectra 4K Monitor'. The display is absolutely stunning with vibrant colors.
    Setup was a breeze. Only minor con is that the built-in speakers are a bit weak, but I use external ones anyway.
    Overall, a fantastic monitor, worth every penny! I'd give it 5 stars.
    """,
    """
    The 'ErgoFit Office Chair' looked good online, but assembly was a nightmare.
    Some parts didn't align, and the instructions were unclear.
    It's reasonably comfortable once built, but the armrests feel a bit flimsy.
    I'm giving it 3 stars due to the frustrating assembly.
    """,
    """
    Got the 'SonicBlast Headphones'. Sound quality is amazing for the price.
    Comfortable for long periods. Battery life is decent, around 8 hours.
    A good purchase for casual listening. I'd say 4/5.
    """
]

# Process each review
for i, review in enumerate(sample_reviews):
    print(f"\n--- Processing Review {i+1} ---")
    print(f"Review Text:\n{review}")
    extracted = extract_review_data(review, review_schema)
    if extracted:
        print("\nExtracted Data (Validated):")
        print(json.dumps(extracted, indent=2))
    else:
        print("Failed to extract or validate data.")

Node.js (index.js)

import fs from 'fs';
import OpenAI from 'openai';
import Ajv from 'ajv';
import { encode as toonEncode, decode as toonDecode } from '@toon-format/toon'; // Import TOON functions

// 1. Load the JSON Schema
const reviewSchema = JSON.parse(fs.readFileSync('product_schema.json', 'utf8'));

// Initialize AJV validator
const ajv = new Ajv();
const validateReview = ajv.compile(reviewSchema);

// Initialize OpenAI client (make sure OPENAI_API_KEY is set as an environment variable)
const openai = new OpenAI();

async function extractReviewData(reviewText, schema) {
    const schemaStr = JSON.stringify(schema, null, 2);

    // Convert the review_text to TOON if it's a long, structured piece that benefits
    // For this simple review string, JSON is fine or you can wrap it in a TOON string.
    // We will simply pass it as a regular string in the prompt.
    // If you had a list of reviews and wanted to pass them as structured input:
    // const toonReviewData = toonEncode({ review: reviewText }, { indent: 2 });

    const prompt = `
    You are an expert product review analyst. Your task is to extract structured information
    from the following product review.
    The output MUST STRICTLY conform to the provided JSON Schema.
    Do NOT include any extra text, comments, or explanations outside the JSON object.

    JSON Schema for output:
    \`\`\`json
    ${schemaStr}
    \`\`\`

    Product Review to Analyze:
    ---
    ${reviewText}
    ---

    Your JSON output:
    `;

    let llmOutputStr = '';
    try {
        const response = await openai.chat.completions.create({
            model: "gpt-4o-mini", // Or another suitable model
            messages: [
                { role: "system", content: "You are a helpful assistant that provides structured JSON output." },
                { role: "user", content: prompt }
            ],
            response_format: { type: "json_object" }, // Critical for getting JSON output
            temperature: 0.0 // Keep temperature low for structured extraction
        });

        llmOutputStr = response.choices[0].message.content;
        const extractedData = JSON.parse(llmOutputStr);

        // Validate the extracted data
        const isValid = validateReview(extractedData);
        if (isValid) {
            console.log("Validation successful!");
            return extractedData;
        } else {
            console.log(`Validation failed for LLM output:`);
            console.log(validateReview.errors);
            console.log(`LLM Raw Output:\n${llmOutputStr}`);
            return null;
        }

    } catch (e) {
        console.error(`Error processing review: ${e.message}`);
        if (e instanceof SyntaxError) { // JSON parsing error
            console.error(`Error decoding JSON from LLM. Raw output:\n${llmOutputStr}`);
        }
        return null;
    }
}

// Sample Reviews
const sampleReviews = [
    `
    I recently bought the new 'Spectra 4K Monitor'. The display is absolutely stunning with vibrant colors.
    Setup was a breeze. Only minor con is that the built-in speakers are a bit weak, but I use external ones anyway.
    Overall, a fantastic monitor, worth every penny! I'd give it 5 stars.
    `,
    `
    The 'ErgoFit Office Chair' looked good online, but assembly was a nightmare.
    Some parts didn't align, and the instructions were unclear.
    It's reasonably comfortable once built, but the armrests feel a bit flimsy.
    I'm giving it 3 stars due to the frustrating assembly.
    `,
    `
    Got the 'SonicBlast Headphones'. Sound quality is amazing for the price.
    Comfortable for long periods. Battery life is decent, around 8 hours.
    A good purchase for casual listening. I'd say 4/5.
    `
];

// Process each review
async function runExtraction() {
    for (let i = 0; i < sampleReviews.length; i++) {
        console.log(`\n--- Processing Review ${i + 1} ---`);
        const review = sampleReviews[i];
        console.log(`Review Text:\n${review}`);
        const extracted = await extractReviewData(review, reviewSchema);
        if (extracted) {
            console.log("\nExtracted Data (Validated):");
            console.log(JSON.stringify(extracted, null, 2));
        } else {
            console.log("Failed to extract or validate data.");
        }
    }
}

runExtraction();

Step 3: Run and Test the Agent

  1. Set your API Key: Ensure your OPENAI_API_KEY (or equivalent for other LLMs) is set in your environment variables.
  2. Run the script:
    • Python: python main.py
    • Node.js: node index.js

Observe the output. You should see:

  • The review text being processed.
  • The LLM’s raw JSON output.
  • Confirmation of successful validation or detailed validation errors if the LLM output deviates from the schema.
  • The extracted and validated structured data.

Step 4: Mini-Challenge and Refinement

Challenge 1: Handle Validation Errors Gracefully

Modify your extract_review_data function. If validation fails, instead of just printing the error, try to ask the LLM to correct its output.

  • You’ll need to send a follow-up prompt to the LLM.
  • The prompt could include the original review, the LLM’s invalid output, and the validation error message, asking it to regenerate a valid JSON.
  • Implement a retry mechanism (e.g., 1 or 2 retries).

Challenge 2: Integrate TOON for Batch Processing (Advanced)

Imagine you have 100s of reviews to process. Instead of sending one review at a time, you want to send a batch of 5 reviews in a single prompt.

  1. Define a new JSON Schema for an array of review analysis objects.
  2. Format your input reviews as a TOON tabular array (if possible for the review content structure, or a list of objects if they’re complex).
  3. Modify the prompt to instruct the LLM to process the batch of TOON reviews and output an array of review analysis objects, strictly conforming to your new schema.
  4. Implement the TOON encoding for the input reviews (e.g., if you structure them as reviews[N]{id,text}: with an ID and the review text) and JSON parsing/validation for the LLM’s output array.

This project demonstrates how JSON Schema brings reliability to LLM-generated structured data and how to integrate these concepts into a practical AI agent.