Core Concepts: Understanding TOON

Core Concepts: Understanding TOON

Now that we have a solid grasp of JSON, it’s time to explore its token-efficient cousin, TOON (Token-Oriented Object Notation). While JSON is a general-purpose data format, TOON is purpose-built for AI, specifically to optimize data exchange with Large Language Models (LLMs). This chapter will break down TOON’s unique syntax and its core principles.

3.1 The Philosophy Behind TOON

The primary motivation for TOON is to reduce token consumption when interacting with LLMs. Every character in a prompt or response translates to tokens, and tokens equate to computational cost and context window usage. JSON, with its repetitive keys, quotes, and structural punctuation (braces, brackets, commas), can be quite verbose and expensive in an LLM context.

TOON tackles this verbosity by:

  • Minimizing Syntactic Overhead: Reducing the need for repetitive characters.
  • Leveraging Tabular Data: Optimizing the representation of uniform arrays of objects, which are very common in AI workflows (e.g., lists of users, products, log entries).
  • Enhancing LLM Parsing: Providing explicit structural cues (like array lengths and field headers) that help LLMs parse and validate data more reliably, which can even improve accuracy.

It’s important to remember: TOON is primarily for LLM input, not necessarily output everywhere. You’d often convert your programmatic JSON data into TOON for sending to an LLM, and convert the LLM’s (hopefully TOON-formatted) response back to JSON for your application logic.

3.2 TOON’s Core Structures

TOON adopts a blend of YAML’s indentation-based nesting and CSV’s tabular efficiency, all while adding LLM-specific optimizations.

3.2.1 Simple Key-Value Pairs (Objects)

For simple objects, TOON removes the curly braces and quotes around keys, similar to YAML. Values can be strings, numbers, booleans, or null.

JSON:

{
  "name": "Alice",
  "age": 30,
  "isActive": true
}

TOON:

name: Alice
age: 30
isActive: true

Notice the absence of quotes around name, age, and isActive as keys, and Alice as a string value. TOON only quotes strings when necessary (e.g., if they contain delimiters like commas, colons, or leading/trailing whitespace).

Exercise 3.2.1: Simple TOON Object

Convert the following JSON object to its TOON equivalent:

JSON:

{
  "deviceName": "Smart Thermostat",
  "location": "Living Room",
  "temperatureCelsius": 22.5,
  "online": false,
  "firmwareVersion": "2.1.3"
}

3.2.2 Nested Objects

Nesting in TOON is achieved through indentation, typically 2 spaces, just like in YAML. This eliminates the need for opening and closing curly braces, saving tokens.

JSON:

{
  "user": {
    "id": 101,
    "profile": {
      "email": "user@example.com",
      "preferences": {
        "theme": "dark",
        "notifications": true
      }
    }
  }
}

TOON:

user:
  id: 101
  profile:
    email: user@example.com
    preferences:
      theme: dark
      notifications: true

Each level of indentation represents a nested object.

Exercise 3.2.2: Nested TOON Object

Convert the following JSON into TOON, ensuring correct indentation for nested objects:

JSON:

{
  "config": {
    "database": {
      "host": "localhost",
      "port": 5432,
      "user": "admin"
    },
    "logging": {
      "level": "info",
      "filePath": "/var/log/app.log"
    }
  }
}

3.2.3 Primitive Arrays (Inline)

Arrays of primitive values (strings, numbers, booleans) are often represented inline, with the array length declared upfront in square brackets []. This helps the LLM understand the expected number of items without having to parse all delimiters.

JSON:

{
  "tags": ["AI", "LLM", "Data"],
  "scores": [95, 88, 72, 91]
}

TOON:

tags[3]: AI,LLM,Data
scores[4]: 95,88,72,91

Here, tags[3] tells the LLM there are 3 tags, and scores[4] indicates 4 scores.

Exercise 3.2.3: TOON Primitive Arrays

Convert the following JSON arrays into their TOON representation:

JSON:

{
  "permissions": ["read", "write", "delete"],
  "sensorReadings": [10.2, 11.5, 9.8, 12.1]
}

3.2.4 Tabular Arrays (TOON’s Superpower)

This is where TOON truly shines for LLM token efficiency. When an array consists of objects that all have the same keys and only primitive values, TOON converts them into a highly compact, tabular (CSV-like) format. The field names are declared once in a header, and then only the values follow.

Requirements for Tabular Format:

  1. All elements in the array must be objects.
  2. All objects must have the exact same keys in the exact same order.
  3. All values within these objects must be primitive (no nested objects or arrays).

JSON:

{
  "users": [
    { "id": 1, "name": "Alice", "role": "admin" },
    { "id": 2, "name": "Bob", "role": "user" },
    { "id": 3, "name": "Charlie", "role": "editor" }
  ]
}

TOON:

users[3]{id,name,role}:
1,Alice,admin
2,Bob,user
3,Charlie,editor

Let’s break down users[3]{id,name,role}::

  • users: The key for this array.
  • [3]: The explicit length of the array (3 items).
  • {id,name,role}: The field header, declaring all keys once.
  • :: Separator indicating the header ends and data rows begin.

Subsequent lines contain the data, separated by commas, corresponding to the declared fields. This dramatically reduces token count by removing repeated key names, quotes, and object braces.

Exercise 3.2.4: Tabular TOON Array

Convert the following JSON data representing a list of products into its TOON tabular array format:

JSON:

{
  "products": [
    { "sku": "P001", "name": "Laptop", "price": 1200.00, "inStock": true },
    { "sku": "P002", "name": "Mouse", "price": 25.50, "inStock": true },
    { "sku": "P003", "name": "Keyboard", "price": 75.99, "inStock": false }
  ]
}

Exercise 3.2.5: Identify Non-Tabular Data

Which of the following JSON arrays cannot be represented in TOON’s tabular format, and why?

  1. {
      "events": [
        { "id": "e1", "type": "click", "timestamp": "..." },
        { "id": "e2", "type": "scroll", "timestamp": "..." }
      ]
    }
    
  2. {
      "configOptions": [
        { "name": "theme", "value": "dark" },
        { "name": "debugMode", "value": true, "level": "verbose" }
      ]
    }
    
  3. {
      "dataPoints": [
        { "x": 10, "y": 20 },
        { "x": 15, "y": 25, "z": 5 }
      ]
    }
    
  4. {
      "usersWithDetails": [
        { "id": 1, "name": "Alice", "address": { "street": "Main St" } },
        { "id": 2, "name": "Bob", "address": { "street": "Elm St" } }
      ]
    }
    

3.2.5 List Arrays (Non-Uniform Data)

When an array does not meet the strict requirements for a tabular array (e.g., objects have different keys, or values are nested objects/arrays), TOON falls back to a list-like format using hyphens -, similar to YAML lists. Each item is typically on its own line, prefixed by -.

JSON (non-uniform objects):

{
  "items": [
    { "type": "book", "title": "1984" },
    { "type": "movie", "title": "Inception", "year": 2010 }
  ]
}

TOON:

items[2]:
  - type: book
    title: 1984
  - type: movie
    title: Inception
    year: 2010

Notice that the second item has an extra year field, preventing it from being a tabular array. Each item still uses indentation for its internal object structure.

JSON (array with mixed primitive and object values):

{
  "mixedContent": [
    "simple string",
    { "id": 1, "status": "active" },
    123,
    null
  ]
}

TOON:

mixedContent[4]:
  - simple string
  - id: 1
    status: active
  - 123
  - null

Exercise 3.2.6: Non-Tabular TOON Array

Convert the following JSON array into its TOON representation. This array contains non-uniform objects.

JSON:

{
  "dashboardWidgets": [
    { "type": "chart", "title": "Sales Trend", "dataUrl": "/api/sales" },
    { "type": "metrics", "title": "Current Users", "value": 1250, "unit": "users" }
  ]
}

3.3 Delimiters in TOON

TOON supports flexible delimiters for tabular arrays, allowing you to choose the most token-efficient or data-compatible option. The default is a comma ,. Other options include tab \t or pipe |.

Example with pipe delimiter:

users[3]{id|name|role}:
1|Alice|admin
2|Bob|user
3|Charlie|editor

Choosing the right delimiter can sometimes further optimize token count for specific LLM tokenizers or handle data that naturally contains commas (e.g., addresses).

3.4 Smart Quoting in TOON

Unlike JSON, where all string values must be quoted, TOON uses “smart quoting.” It only adds quotes to string values if they:

  • Are empty ("").
  • Have leading or trailing whitespace (" value ").
  • Contain the active delimiter (e.g., ,"value with, comma", when using comma delimiter).
  • Contain a colon (:).
  • Contain quotes or backslashes.
  • Look like a boolean ("true", "false").
  • Look like a number ("123", "-4.5").
  • Look like null ("null").
  • Start with list syntax ("- item").
  • Look like structural syntax ("[5]", "{key}").

This minimizes quote characters, saving tokens.

Example:

name: Alice           # No quotes needed
description: "Hello, World!" # Quotes needed because of comma
status: "true"       # Quotes needed to distinguish from boolean true
productCode: "456"   # Quotes needed to distinguish from number 456

3.5 Working with TOON in Python and Node.js

The python-toon and @toon-format/toon libraries make it straightforward to convert between JSON-like Python/JavaScript objects and TOON strings.

Python Example:

import json
from toon import encode, decode

# Example JSON data (Python dictionary)
product_data = {
    "products": [
        {"sku": "LPT-1234", "name": "ProLaptop", "price": 1299.99, "stock": 150},
        {"sku": "MSC-5678", "name": "Wireless Mouse", "price": 25.50, "stock": 80}
    ]
}

# Encode Python dictionary to TOON string
toon_string = encode(product_data)
print("--- Encoded TOON String ---")
print(toon_string)

# Decode TOON string back to Python dictionary
decoded_data = decode(toon_string)
print("\n--- Decoded Python Dictionary ---")
print(decoded_data)
print(f"First product name: {decoded_data['products'][0]['name']}")

# You can also specify delimiter or indentation during encoding
# toon_string_tab = encode(product_data, delimiter='\t', indent=2)
# print("\n--- TOON with Tab Delimiter ---")
# print(toon_string_tab)

Node.js (JavaScript) Example:

import { encode, decode } from "@toon-format/toon";

// Example JSON data (JavaScript object)
const userData = {
  users: [
    { id: 1, name: "Alice", email: "alice@example.com", isActive: true },
    { id: 2, name: "Bob", email: "bob@example.com", isActive: false },
  ],
};

// Encode JavaScript object to TOON string
const toonString = encode(userData);
console.log("--- Encoded TOON String ---");
console.log(toonString);

// Decode TOON string back to JavaScript object
const decodedData = decode(toonString);
console.log("\n--- Decoded JavaScript Object ---");
console.log(decodedData);
console.log(`Second user email: ${decodedData.users[1].email}`);

// You can also specify delimiter or indentation during encoding
// const toonStringTab = encode(userData, { delimiter: '\t', indent: 2 });
// console.log("\n--- TOON with Tab Delimiter ---");
// console.log(toonStringTab);

Exercise 3.5.1: TOON Encoding and Decoding Roundtrip

  1. Take the dashboardWidgets JSON data from Exercise 3.2.6 (non-uniform array).
  2. In a Python script, represent this data as a Python dictionary.
  3. Encode this dictionary into a TOON string. Print the TOON string.
  4. Decode the generated TOON string back into a Python dictionary. Print the decoded dictionary.
  5. Verify that the original and decoded data structures are identical (e.g., by comparing specific values).

Repeat the same process in a Node.js script using JavaScript objects.

Exercise 3.5.2: Exploring Delimiter Impact

  1. Use the products JSON data from Exercise 3.2.4 (tabular array).
  2. In your chosen language (Python or Node.js), encode this data into TOON using:
    • The default comma delimiter.
    • The tab \t delimiter.
    • The pipe | delimiter.
  3. Print all three generated TOON strings. Observe the differences.
  4. (Optional, Advanced): If you have tiktoken installed and an OpenAI API key, try using the count_tokens function from the Introduction to compare the token count for each of these three TOON strings. Do you see any differences based on the delimiter?

Solution Hint for token counting in Python:

import tiktoken

def count_tokens(text: str, model_name: str = "gpt-4o-mini") -> int:
    try:
        enc = tiktoken.encoding_for_model(model_name)
        return len(enc.encode(text))
    except Exception:
        return -1 # Or handle the error appropriately

# Then, after encoding:
# tokens_comma = count_tokens(toon_string_comma)
# tokens_tab = count_tokens(toon_string_tab)
# tokens_pipe = count_tokens(toon_string_pipe)

By completing these exercises, you should have a firm understanding of TOON’s syntax, its various structures, and how to programmatically work with it in both Python and Node.js. This foundational knowledge is key to leveraging TOON for more efficient AI interactions.