Understanding Rollouts and Rewards

Understanding Rollouts and Rewards

In the Agentic Lightening framework, rollouts and rewards are two of the most fundamental concepts that directly drive the learning process. Without a clear understanding of these, you cannot effectively train and optimize your AI agents. This chapter will demystify what a rollout entails and, more importantly, equip you with the knowledge to design impactful reward functions.

What is a Rollout?

A rollout in Agentic Lightening refers to a single, complete execution of your LitAgent on a given AgentLightningTask. It represents an interaction sequence where the agent processes an input, potentially takes multiple internal steps (e.g., calling tools, querying an LLM, performing reasoning), and ultimately produces an output or reaches a terminal state.

Think of it as an “episode” in reinforcement learning terms. During a rollout, Agentic Lightening (specifically, the LitAgent and any associated tracers) captures all relevant information about the agent’s behavior. This includes:

  • task: The AgentLightningTask that was provided to the agent.
  • rollout_id: A unique identifier for this specific execution trace.
  • resources: The set of shared resources (like current prompt templates, tool definitions, or model configurations) that the agent used for this particular rollout. These can change between rollouts as the Trainer updates them.
  • Agent Actions/Observations: What the agent “did” (e.g., LLM calls, tool invocations, internal thoughts).
  • Intermediate States: Any relevant data or states observed during the agent’s execution.
  • Final Output: The ultimate response or result produced by the agent.
  • reward: The scalar value indicating the agent’s performance, returned by the training_rollout method.

The training_rollout method of your LitAgent is precisely where this rollout happens. The method receives a task, rollout_id, and resources, and it is expected to return a float representing the reward for that specific execution.

The Role of LitRollout

When trainer.dev() or the actual training loop completes a rollout, it returns a LitRollout object. This object encapsulates all the collected information.

# From trainer.py (simplified)
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource, LitRollout

class ExampleAgent(LitAgent):
    async def training_rollout(
        self, task: AgentLightningTask, rollout_id: str, resources: dict[str, AgentResource]
    ) -> float:
        # Simulate some agent logic
        print(f"Agent {rollout_id} processing task: {task.context}")
        await asyncio.sleep(0.1) # Simulate work
        
        # Simple reward: 1.0 if 'success' is in context, else 0.0
        reward = 1.0 if "success" in task.context.lower() else 0.0
        print(f"Agent {rollout_id} finished with reward: {reward}")
        return reward

async def demo_rollout_capture():
    trainer = Trainer(n_workers=1)
    agent = ExampleAgent()

    task_success = AgentLightningTask(name="Task 1", context="Please achieve success in this task.")
    task_fail = AgentLightningTask(name="Task 2", context="Please handle this failure scenario.")

    print("\n--- Running a successful rollout ---")
    rollout_s = await trainer.dev(agent=agent, task=task_success, resources={})
    print("\n--- Captured LitRollout for success ---")
    print(f"Rollout ID: {rollout_s.rollout_id}")
    print(f"Final Reward: {rollout_s.final_reward}")
    print(f"Start Time: {rollout_s.start_time}")
    print(f"End Time: {rollout_s.end_time}")
    print(f"Duration: {rollout_s.end_time - rollout_s.start_time:.4f} seconds")
    # print(f"Traces: {rollout_s.traces}") # More on this later, when we configure tracing

    print("\n--- Running a failed rollout ---")
    rollout_f = await trainer.dev(agent=agent, task=task_fail, resources={})
    print("\n--- Captured LitRollout for failure ---")
    print(f"Rollout ID: {rollout_f.rollout_id}")
    print(f"Final Reward: {rollout_f.final_reward}")

if __name__ == "__main__":
    import asyncio
    from agentlightning.trainer import Trainer # Import Trainer here for self-contained example
    asyncio.run(demo_rollout_capture())

Expected Output (similar to):

--- Running a successful rollout ---
Agent iter_0_task_0 processing task: Please achieve success in this task.
Agent iter_0_task_0 finished with reward: 1.0

--- Captured LitRollout for success ---
Rollout ID: iter_0_task_0
Final Reward: 1.0
Start Time: 2025-11-06 17:30:00.123456
End Time: 2025-11-06 17:30:00.234567
Duration: 0.1111 seconds

--- Running a failed rollout ---
Agent iter_0_task_0 processing task: Please handle this failure scenario.
Agent iter_0_task_0 finished with reward: 0.0

--- Captured LitRollout for failure ---
Rollout ID: iter_0_task_0
Final Reward: 0.0

The Art of Reward Design

The reward is the single most critical feedback signal that drives the optimization process. It’s how the Trainer (and the underlying algorithms like Reinforcement Learning or Prompt Optimization) understands whether the agent’s behavior was “good” or “bad” for a given task. A well-designed reward function is essential for successful agent training; a poorly designed one can lead to agents learning unintended or suboptimal behaviors (reward hacking).

Principles of Good Reward Design:

  1. Alignment with Task Objective: The reward must accurately reflect the true goal of the task. If the goal is to summarize accurately, the reward should penalize inaccuracies and reward precision.
  2. Scalability and Granularity:
    • Sparse Rewards: Only given at the very end of a task (e.g., 1 for success, 0 for failure). This can be challenging for agents to learn from, especially in long, multi-step tasks.
    • Dense Rewards: Provided at intermediate steps or offer partial credit. These are generally preferred as they offer more frequent feedback, guiding the agent’s learning more effectively. Agentic Lightening’s support for tracers (covered in the next chapter) can facilitate dense reward calculation by capturing intermediate steps.
  3. Verifiability/Objectivity: Rewards should ideally be objectively verifiable. Avoid subjective rewards that are hard to consistently quantify. Can you programmatically determine if a reward should be given?
  4. No Reward Hacking: Design rewards to avoid situations where the agent can exploit flaws in the reward function to get high scores without actually achieving the desired behavior (e.g., an agent that just outputs “Yes” repeatedly gets a high reward if the reward simply checks for the word “Yes”).
  5. Robustness to Imperfections: The reward function should be somewhat robust to minor variations in agent output or environmental noise.
  6. Simplicity (initially): Start with simpler reward functions and gradually increase complexity as your agent’s capabilities grow and you understand the learning dynamics.

Types of Rewards:

  • Binary (0 or 1): Simplest, for clear success/failure tasks.
    • Example: If agent correctly answers a math question, reward = 1.0; else 0.0.
  • Scalar (continuous range): For tasks with degrees of success.
    • Example: For a summarization task, reward = ROUGE score (0 to 1). For a factual QA, reward = 1.0 if correct, 0.5 if partially correct, 0.0 if wrong.
  • Negative Rewards/Penalties: To discourage undesirable actions.
    • Example: -0.1 for each unnecessary tool call. -1.0 for generating offensive content.
  • Combination Rewards: Summing different reward components.
    • Example: Reward = (0.7 * Correctness) + (0.2 * Conciseness) - (0.1 * Latency).

Designing Reward Functions: Practical Examples

Let’s refine our SimpleMathAgent and create a more sophisticated reward function.

Scenario: Math Problem Solver Agent

Our agent needs to solve arithmetic problems.

  • Input: A math problem as a string (e.g., “What is 15 + 7?”).
  • Output: The numerical answer.
  • Goal: Provide the correct answer.

Initial Reward (Binary):

# simple_math_agent_rewards.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource

# Mock LLM for consistency
async def mock_llm_calculate(question: str) -> str:
    """A mock LLM that extracts and calculates basic arithmetic."""
    question_lower = question.lower()
    # Simple addition/subtraction
    if "add" in question_lower or "+" in question_lower:
        numbers = re.findall(r'\d+', question_lower)
        if len(numbers) >= 2:
            try:
                result = sum(int(n) for n in numbers)
                return f"The answer is {result}"
            except ValueError:
                pass
    if "subtract" in question_lower or "-" in question_lower:
        numbers = re.findall(r'\d+', question_lower)
        if len(numbers) >= 2:
            try:
                result = int(numbers[0]) - int(numbers[1])
                return f"The answer is {result}"
            except ValueError:
                pass
    return "I cannot calculate that."


class MathSolverAgent(LitAgent):
    """
    A LitAgent designed to solve math problems with a binary reward.
    """
    async def training_rollout(
        self,
        task: AgentLightningTask,
        rollout_id: str,
        resources: dict[str, AgentResource],
    ) -> float:
        print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")

        # Parse question and expected answer from task context
        # Format: "Question: <math_problem> (Expected: <answer>)"
        match = re.search(r"Question: (.*?)\s*\(Expected:\s*(\-?\d+)\)", task.context, re.IGNORECASE)
        if not match:
            print(f"[{rollout_id}] Error: Task context format invalid. Expected '(Expected: <number>)'")
            return 0.0

        question = match.group(1).strip()
        expected_answer = int(match.group(2))

        # Agent uses mock LLM to get answer
        llm_response = await mock_llm_calculate(question)
        print(f"[{rollout_id}] LLM response: {llm_response}")

        # Extract actual answer from LLM response
        actual_answer_match = re.search(r"the answer is (-?\d+)", llm_response, re.IGNORECASE)
        if actual_answer_match:
            actual_answer = int(actual_answer_match.group(1))
            if actual_answer == expected_answer:
                print(f"[{rollout_id}] CORRECT! Expected: {expected_answer}, Actual: {actual_answer}")
                return 1.0
            else:
                print(f"[{rollout_id}] INCORRECT! Expected: {expected_answer}, Actual: {actual_answer}")
                return 0.0
        else:
            print(f"[{rollout_id}] Could not parse numerical answer from LLM response.")
            return 0.0 # No numerical answer found

async def main_math_solver():
    trainer = Trainer(n_workers=1)
    agent = MathSolverAgent()

    tasks = [
        AgentLightningTask(name="Addition Test 1", context="Question: What is 5 + 3? (Expected: 8)"),
        AgentLightningTask(name="Subtraction Test 1", context="Question: Calculate 10 - 4. (Expected: 6)"),
        AgentLightningTask(name="Addition Test 2", context="Question: Add 12 and 18. (Expected: 30)"),
        AgentLightningTask(name="Incorrect Test 1", context="Question: What is 7 + 2? (Expected: 10)"), # Incorrect expectation
        AgentLightningTask(name="Parsing Fail", context="Question: Tell me a joke. (Expected: 0)"), # Expect 0 due to parsing
    ]

    print("\n--- Testing MathSolverAgent with Binary Rewards ---")
    for i, task in enumerate(tasks):
        rollout_result = await trainer.dev(agent=agent, task=task, resources={})
        print(f"Task '{task.name}' Reward: {rollout_result.final_reward}\n")

if __name__ == "__main__":
    from agentlightning.trainer import Trainer
    asyncio.run(main_math_solver())

To run this example:

  1. Save the code as simple_math_agent_rewards.py.
  2. Run: python simple_math_agent_rewards.py

Exercise 1: Adding Partial Credit

Modify MathSolverAgent’s training_rollout to incorporate partial credit:

  1. If the agent’s response cannot be parsed for a number, but the LLM response explicitly states “I cannot calculate that,” and the expected_answer is 0 (indicating it’s an unanswerable question), give a reward of 0.5. This encourages the agent to recognize its limitations.
  2. Keep 1.0 for correct numeric answers and 0.0 for incorrect numeric answers.
  3. Update the tasks list to include a task that tests this new 0.5 reward scenario (e.g., “Question: What is the meaning of life? (Expected: 0)”).

Scenario: Information Retrieval Agent

Our agent retrieves information from a knowledge base (simulated).

  • Input: A query (e.g., “Who invented the light bulb?”).
  • Output: A factual answer.
  • Goal: Provide the correct factual answer.

Reward with Keyword Matching and Length Penalty:

For factual questions, we might want not just correctness but also conciseness.

# info_retrieval_agent_rewards.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource

# Mock Knowledge Base / LLM for factual questions
async def mock_kb_query(query: str) -> str:
    """A mock knowledge base lookup."""
    query_lower = query.lower()
    if "light bulb" in query_lower and ("invented" in query_lower or "who" in query_lower):
        return "Thomas Edison patented a commercially viable incandescent light bulb."
    if "internet" in query_lower and ("invented" in query_lower or "who" in query_lower):
        return "The internet was developed by many pioneers, including Vinton Cerf and Robert Kahn, who designed the TCP/IP protocols."
    return "I don't have enough information to answer that."


class InfoRetrievalAgent(LitAgent):
    """
    A LitAgent for information retrieval with a reward that balances correctness and conciseness.
    """
    async def training_rollout(
        self,
        task: AgentLightningTask,
        rollout_id: str,
        resources: dict[str, AgentResource],
    ) -> float:
        print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")

        # Parse question, expected answer, and ideal length from task context
        # Format: "Question: <query> (Expected: <answer_keywords>) (Ideal Length: <num_words>)"
        match = re.search(r"Question: (.*?)\s*\(Expected:\s*(.*?)\)\s*(?: \(Ideal Length:\s*(\d+)\))?", task.context, re.IGNORECASE)
        if not match:
            print(f"[{rollout_id}] Error: Task context format invalid.")
            return 0.0

        question = match.group(1).strip()
        expected_keywords = [kw.strip().lower() for kw in match.group(2).split(',') if kw.strip()]
        ideal_length_str = match.group(3)
        ideal_length = int(ideal_length_str) if ideal_length_str else 50 # Default ideal length

        # Agent queries mock KB
        agent_answer = await mock_kb_query(question)
        print(f"[{rollout_id}] Agent's answer: {agent_answer}")

        # --- Reward Calculation ---
        correctness_reward = 0.0
        output_lower = agent_answer.lower()
        
        # Check for presence of all expected keywords
        all_keywords_present = True
        for kw in expected_keywords:
            if kw not in output_lower:
                all_keywords_present = False
                break
        
        if all_keywords_present and len(expected_keywords) > 0:
            correctness_reward = 1.0
            print(f"[{rollout_id}] All keywords present: {expected_keywords}")
        elif len(expected_keywords) == 0 and "don't have enough" in output_lower: # If no keywords expected and it says it can't answer, that's good.
             correctness_reward = 0.8
             print(f"[{rollout_id}] Agent correctly identified inability to answer. (No keywords expected)")
        else:
            # Partial credit for some keywords, if not all are present
            present_keywords = [kw for kw in expected_keywords if kw in output_lower]
            if len(expected_keywords) > 0:
                correctness_reward = len(present_keywords) / len(expected_keywords) * 0.7 # Max 0.7 for partial
                print(f"[{rollout_id}] Partial keywords present. Found: {present_keywords}. Correctness: {correctness_reward:.2f}")

        # Length penalty/bonus
        num_words = len(agent_answer.split())
        length_factor = 1.0
        if num_words > ideal_length * 1.5: # Too long
            length_factor = 0.5
            print(f"[{rollout_id}] Output too long ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")
        elif num_words < ideal_length * 0.5: # Too short
            length_factor = 0.7
            print(f"[{rollout_id}] Output too short ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")
        else: # Just right
            length_factor = 1.2 # Small bonus for ideal length
            print(f"[{rollout_id}] Output length is good ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")

        # Combine rewards
        final_reward = correctness_reward * 0.7 + (length_factor * 0.3) # Weights
        print(f"[{rollout_id}] Final Reward: {final_reward:.2f} (Correctness: {correctness_reward:.2f}, Length Factor: {length_factor:.2f})")
        
        return final_reward

async def main_info_retrieval():
    trainer = Trainer(n_workers=1)
    agent = InfoRetrievalAgent()

    tasks = [
        AgentLightningTask(name="Light Bulb Inventor", context="Question: Who invented the light bulb? (Expected: Edison, patented, incandescent) (Ideal Length: 10)"),
        AgentLightningTask(name="Internet Origin", context="Question: Who developed the internet? (Expected: Vinton Cerf, Robert Kahn, TCP/IP) (Ideal Length: 15)"),
        AgentLightningTask(name="Unanswerable Query", context="Question: What color is happiness? (Expected: ) (Ideal Length: 5)"), # No keywords expected, testing 'cannot answer'
        AgentLightningTask(name="Long Answer", context="Question: Who invented the light bulb? (Expected: Edison, patented) (Ideal Length: 2)"), # Too short ideal length
    ]

    print("\n--- Testing InfoRetrievalAgent with Combined Rewards ---")
    for i, task in enumerate(tasks):
        rollout_result = await trainer.dev(agent=agent, task=task, resources={})
        print(f"Task '{task.name}' Final Reward: {rollout_result.final_reward:.2f}\n")

if __name__ == "__main__":
    from agentlightning.trainer import Trainer
    asyncio.run(main_info_retrieval())

To run this example:

  1. Save the code as info_retrieval_agent_rewards.py.
  2. Run: python info_retrieval_agent_rewards.py

Exercise 2: Incorporating Negative Reward for Specific Keywords

Modify InfoRetrievalAgent’s training_rollout to:

  1. Introduce a negative penalty (-0.2) if the agent’s answer includes specific “forbidden” keywords (e.g., if the topic is “AI Ethics” but the answer mentions “Skynet” or “robot rebellion”).
  2. Update the tasks list to include a task where a forbidden keyword might appear.
  3. Ensure the final reward can go below zero if significant penalties are incurred.

Leveraging AgentResource for Reward Optimization

The resources dictionary passed to training_rollout is not just for the agent; it can also be used to pass information to your reward function itself! This is particularly powerful for Automatic Prompt Optimization (APO) or scenarios where reward criteria might change.

Imagine a reward function that needs a dynamically updated list of “negative keywords” to penalize. This list could be an AgentResource.

# dynamic_reward_agent.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource

# Mock LLM (simple echo for this example)
async def mock_llm_echo(prompt: str) -> str:
    """A mock LLM that just echoes a response based on the prompt."""
    if "negative_word_test" in prompt:
        return "This response contains a forbidden word like 'bad'."
    return f"Processed: {prompt}"

class DynamicRewardAgent(LitAgent):
    """
    An agent with a reward function that can be influenced by AgentResources.
    """
    async def training_rollout(
        self,
        task: AgentLightningTask,
        rollout_id: str,
        resources: dict[str, AgentResource],
    ) -> float:
        print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")

        # Example: An optimizable prompt for the LLM could be in resources
        system_prompt = "You are a helpful assistant."
        if "system_prompt" in resources:
            system_prompt = resources["system_prompt"].value
            print(f"[{rollout_id}] Using dynamic system prompt: {system_prompt}")
        
        agent_input = f"{system_prompt}\nUser: {task.context}"
        agent_output = await mock_llm_echo(agent_input)
        print(f"[{rollout_id}] Agent output: {agent_output}")

        # --- Dynamic Reward Calculation ---
        reward = 1.0 # Base reward

        # Check for forbidden keywords from resources
        forbidden_keywords = []
        if "forbidden_keywords" in resources and isinstance(resources["forbidden_keywords"].value, list):
            forbidden_keywords = [kw.lower() for kw in resources["forbidden_keywords"].value]
            
        output_lower = agent_output.lower()
        for keyword in forbidden_keywords:
            if keyword in output_lower:
                reward -= 0.5 # Penalty for forbidden keyword
                print(f"[{rollout_id}] Penalty: Found forbidden keyword '{keyword}' in output.")

        # Check for desired keywords from task context
        desired_match = re.search(r"Desired Keywords: (.*?)(?:$|\))", task.context)
        if desired_match:
            desired_keywords = [kw.strip().lower() for kw in desired_match.group(1).split(',') if kw.strip()]
            for keyword in desired_keywords:
                if keyword not in output_lower:
                    reward -= 0.2 # Penalty for missing desired keyword
                    print(f"[{rollout_id}] Penalty: Missing desired keyword '{keyword}' in output.")
        
        # Ensure reward doesn't go below a certain threshold
        final_reward = max(0.0, reward)
        print(f"[{rollout_id}] Final Reward: {final_reward:.2f}")
        return final_reward

async def main_dynamic_reward():
    trainer = Trainer(n_workers=1)
    agent = DynamicRewardAgent()

    # Define tasks
    tasks = [
        AgentLightningTask(name="Good Response", context="Respond positively. Desired Keywords: positive, great"),
        AgentLightningTask(name="Forbidden Test", context="Produce a negative_word_test. Desired Keywords: test"), # Will trigger forbidden word in LLM
    ]

    # Define resources (this would typically be managed by the Trainer/Optimizer)
    # Resource 1: A system prompt
    system_prompt_resource = AgentResource(
        name="system_prompt",
        value="You are an overly positive assistant."
    )
    # Resource 2: A list of forbidden keywords
    forbidden_keywords_resource = AgentResource(
        name="forbidden_keywords",
        value=["bad", "negative", "fail"]
    )

    all_resources = {
        system_prompt_resource.name: system_prompt_resource,
        forbidden_keywords_resource.name: forbidden_keywords_resource,
    }

    print("\n--- Testing DynamicRewardAgent ---")
    for i, task in enumerate(tasks):
        rollout_result = await trainer.dev(agent=agent, task=task, resources=all_resources)
        print(f"Task '{task.name}' Final Reward: {rollout_result.final_reward:.2f}\n")

if __name__ == "__main__":
    from agentlightning.trainer import Trainer
    asyncio.run(main_dynamic_reward())

To run this example:

  1. Save the code as dynamic_reward_agent.py.
  2. Run: python dynamic_reward_agent.py

Exercise 3: Optimizing Reward Thresholds

In the DynamicRewardAgent:

  1. Introduce another AgentResource called "min_output_length" with an integer value.
  2. Add a penalty to the reward if the agent’s output word count is less than min_output_length.
  3. Design a task and a corresponding resource value to test this new penalty.
  4. Discuss how a Trainer leveraging Automatic Prompt Optimization might update this min_output_length resource to find an optimal balance between conciseness and content.

Conclusion

Designing effective reward functions is an iterative process that often requires experimentation and a deep understanding of your agent’s task and desired behavior. Agentic Lightening provides the framework to systematically test and optimize agents using these reward signals. As you become more comfortable, you’ll find that crafting precise and informative rewards is the key to unlocking truly intelligent and performant AI agents.

In the next chapter, we will delve into the various advanced optimization algorithms that Agentic Lightening supports, which leverage these rollouts and rewards to iteratively improve your agent’s performance.