Intermediate Topics: TOON's Advanced Features and Best Practices

Intermediate Topics: TOON’s Advanced Features and Best Practices

Having covered the foundational elements of TOON, we’ll now delve into its more advanced features and explore best practices for maximizing its benefits in AI workflows. Understanding these nuances will enable you to squeeze even more token efficiency out of your LLM prompts and ensure your data is robustly interpreted.

5.1 Key Folding (Dotted Paths)

TOON offers an optional feature called “key folding” or “dotted paths.” This is particularly useful when you have objects that contain single-key wrapper chains, allowing you to flatten them into a more compact format, reducing indentation and token count.

JSON Example with a wrapper chain:

{
  "data": {
    "metadata": {
      "items": {
        "count": 5
      }
    }
  }
}

In standard TOON, this would still involve multiple lines of indentation:

data:
  metadata:
    items:
      count: 5

With key folding, this can be collapsed:

data.metadata.items.count: 5

This feature is best used selectively for paths where the intermediate objects don’t carry any other properties. It sacrifices some immediate visual hierarchy for greater compactness.

When to use: For deeply nested, single-key paths where the intermediate keys are purely structural and add little semantic value beyond forming the path. When to be cautious: If the intermediate objects might have other properties or their individual existence is semantically important, conventional nesting is clearer.

Exercise 5.1.1: Apply Key Folding

Convert the following JSON into TOON, applying key folding where appropriate to minimize tokens:

JSON:

{
  "system": {
    "config": {
      "network": {
        "interface": "eth0",
        "ipAddress": "192.168.1.1"
      }
    },
    "status": {
      "health": "ok"
    }
  }
}

5.2 Custom Delimiters and Their Impact

As briefly mentioned, TOON allows specifying custom delimiters for tabular arrays. While the comma , is the default and generally safe, others like tab \t or pipe | can sometimes offer further token savings or handle specific data challenges.

Why custom delimiters?

  1. Tokenizer Behavior: Different LLM tokenizers might treat tabs or pipes as a single token, whereas a comma followed by a space might be two tokens. Benchmarking with your specific LLM’s tokenizer can reveal optimal delimiters.
  2. Data Content: If your string data frequently contains commas (e.g., addresses, descriptions with lists), using a pipe | as a delimiter prevents the need for quoting those strings, which saves tokens.

Example with pipe delimiter (from a previous chapter):

users[3]{id|name|role}:
1|Alice|admin
2|Bob|user
3|Charlie|editor

In the TOON libraries, you’d typically pass a delimiter option to the encode function.

from toon import encode

data = {"items": [{"a": "one,two", "b": 3}, {"a": "three", "b": 4}]}
print(encode(data, delimiter=','))
# Output:
# items[2]{a,b}:
# "one,two",3
# three,4

print(encode(data, delimiter='|'))
# Output:
# items[2]{a|b}:
# one,two|3
# three|4

Notice how with the comma delimiter, "one,two" had to be quoted, using two extra tokens for the quotes. With the pipe delimiter, it did not.

Exercise 5.2.1: Delimiter Optimization Scenario

You are sending a list of customer feedback entries to an LLM. Each entry includes a feedbackText which can contain commas, and a rating (number).

JSON Data:

{
  "feedback": [
    { "id": 1, "feedbackText": "Great product, but slow delivery.", "rating": 4 },
    { "id": 2, "feedbackText": "Very happy with the quality!", "rating": 5 },
    { "id": 3, "feedbackText": "Minor bug in UI, needs fixing, otherwise okay.", "rating": 3 }
  ]
}
  1. Encode this JSON into TOON using the default comma delimiter.
  2. Encode it again using the pipe | delimiter.
  3. Compare the output. Explain why the pipe delimiter might be more token-efficient in this specific scenario.
  4. (Optional, advanced): If using Python, use tiktoken to count tokens for both versions.

5.3 In-depth Quoting Rules and Escape Sequences

While TOON aims for minimal quoting, it’s crucial to understand when and why quotes are applied deterministically. This helps in predicting token count and troubleshooting parsing issues.

Strings are quoted if they:

  • Are empty: ""
  • Have leading or trailing whitespace: " text "
  • Contain the active delimiter: "item, with, commas" (if comma is delimiter)
  • Contain a colon : (can conflict with key-value syntax)
  • Contain quotes (") or backslashes (\) themselves (requiring escaping \", \\).
  • Look like a boolean: "true", "false"
  • Look like a number: "42", "-3.14"
  • Look like null: "null"
  • Start with a list prefix: "- item"
  • Look like structural syntax: "[5]", "{key}"

Escape Sequences within quoted strings:

  • \" for a double quote
  • \\ for a backslash
  • \n for a newline
  • \r for a carriage return
  • \t for a tab

Example:

// JSON
{
  "description": "Product: \"New & Improved\" with a comma, price: $10.00",
  "status": "true",
  "count": "15"
}

// TOON (assuming comma delimiter)
description: "Product: \"New & Improved\" with a comma, price: $10.00"
status: "true"
count: "15"

Here, description is quoted due to the colon, internal quotes (escaped), and comma. status and count are quoted because their string values could be misinterpreted as boolean and number types, respectively, if unquoted.

Exercise 5.3.1: Predict Quoting Behavior

For each of the following string values, state whether it would be quoted in TOON (assuming comma delimiter), and explain why:

  1. "Hello World"
  2. " Trimmed Text "
  3. "Value, with comma"
  4. "true"
  5. "123.45"
  6. "Field:Value"
  7. "It said \"Okay\""
  8. "" (empty string)
  9. "null"

5.4 Handling Non-ASCII Characters

TOON, like JSON, fully supports Unicode characters. When encoding data containing non-ASCII characters, they are typically represented directly. There’s no specific token optimization for Unicode characters themselves beyond the general principles of TOON. However, different LLM tokenizers handle multi-byte Unicode characters differently, sometimes splitting them into multiple tokens.

Example:

{
  "message": "こんにちは世界! (Hello World!)",
  "author": "김민준"
}

TOON:

message: こんにちは世界! (Hello World!)
author: 김민준

5.5 Best Practices for TOON in LLM Prompts

Here are strategic recommendations for effectively using TOON with LLMs:

  1. Prioritize Tabular Data: TOON’s biggest win is with uniform arrays of objects. Always identify opportunities to convert such JSON structures into TOON’s tabular format. This yields the largest token savings.
  2. Hybrid Approach is Key:
    • In your application code: Continue using JSON (or native data structures) for internal logic, APIs, and data persistence. JSON’s tooling, schema validation (JSON Schema), and universal compatibility are unparalleled here.
    • At the LLM boundary: Convert your JSON-like data to TOON just before sending it to the LLM. Convert the LLM’s TOON response back to JSON as soon as you receive it. This gives you the best of both worlds: development clarity and LLM efficiency.
  3. Provide Clear Instructions to the LLM:
    • Input Format: Clearly state that you are providing data in TOON format. Include a small “TOON Spec” in your system prompt or a few-shot example.
    • Output Format: Explicitly ask the LLM to respond in TOON format. Provide an example or a minimal TOON schema for the expected output structure. Example:
      Your response must be STRICTLY in TOON format.
      Example output structure:
      analysis[1]{summary,sentiment}:
      "Concise summary of findings","positive"
      
    • Use Delimiters/Markers: Consider adding ###START_TOON### and ###END_TOON### markers around your TOON data in the prompt to help the LLM identify the boundaries of the structured data.
  4. Benchmark Token Usage: Always measure the actual token count of your JSON and TOON data using your target LLM’s tokenizer (tiktoken for OpenAI models, or equivalent for others). Don’t guess; verify. This helps you understand real savings and identify where TOON might not be beneficial (e.g., deeply nested, non-uniform data where JSON might be more compact).
  5. Test for Accuracy: Token efficiency is great, but not at the expense of accuracy. Run tests to ensure the LLM correctly parses and reasons over your TOON-formatted input and produces accurate TOON output. TOON’s explicit structure can improve accuracy, but always validate.
  6. Cautious with Key Folding: While it saves tokens, it can reduce immediate human readability. Use it judiciously for very common, obvious structural paths.
  7. Choose Delimiters Wisely: Default to comma. If your data has many commas, try pipe. If you observe better tokenizer efficiency with tabs, use tabs. Always benchmark.

5.6 When Not to Use TOON (Revisiting)

While TOON is powerful, it’s not a silver bullet. Reconfirming scenarios where JSON or other formats might be better:

  • Deeply Nested or Highly Non-Uniform Structures: For data with many levels of nesting or where objects within an array have vastly different keys and types (low “tabular eligibility”), TOON’s indentation overhead or fall-back to list format might offer minimal or even negative token savings compared to compact JSON.
  • Purely Tabular Data without Headers: For simple CSV-like data where you don’t even need explicit headers or array lengths, CSV might be even more compact. TOON adds a slight overhead for its structural guardrails.
  • Latency-Critical Local Deployments: Some local or quantized LLMs might have optimized parsing for JSON, meaning a compact JSON payload could be processed faster end-to-end even if it has more tokens. Always benchmark your specific setup for latency.
  • External APIs and Existing Tooling: If you’re interacting with external APIs that require JSON, or your existing data pipelines and tools are heavily invested in JSON Schema for validation and processing, stick with JSON. TOON is for the LLM boundary.

By understanding these advanced features and adhering to best practices, you can effectively leverage TOON to significantly optimize your AI applications for both cost and performance.