Understanding Rollouts and Rewards
In the Agentic Lightening framework, rollouts and rewards are two of the most fundamental concepts that directly drive the learning process. Without a clear understanding of these, you cannot effectively train and optimize your AI agents. This chapter will demystify what a rollout entails and, more importantly, equip you with the knowledge to design impactful reward functions.
What is a Rollout?
A rollout in Agentic Lightening refers to a single, complete execution of your LitAgent on a given AgentLightningTask. It represents an interaction sequence where the agent processes an input, potentially takes multiple internal steps (e.g., calling tools, querying an LLM, performing reasoning), and ultimately produces an output or reaches a terminal state.
Think of it as an “episode” in reinforcement learning terms. During a rollout, Agentic Lightening (specifically, the LitAgent and any associated tracers) captures all relevant information about the agent’s behavior. This includes:
task: TheAgentLightningTaskthat was provided to the agent.rollout_id: A unique identifier for this specific execution trace.resources: The set of shared resources (like current prompt templates, tool definitions, or model configurations) that the agent used for this particular rollout. These can change between rollouts as theTrainerupdates them.- Agent Actions/Observations: What the agent “did” (e.g., LLM calls, tool invocations, internal thoughts).
- Intermediate States: Any relevant data or states observed during the agent’s execution.
- Final Output: The ultimate response or result produced by the agent.
reward: The scalar value indicating the agent’s performance, returned by thetraining_rolloutmethod.
The training_rollout method of your LitAgent is precisely where this rollout happens. The method receives a task, rollout_id, and resources, and it is expected to return a float representing the reward for that specific execution.
The Role of LitRollout
When trainer.dev() or the actual training loop completes a rollout, it returns a LitRollout object. This object encapsulates all the collected information.
# From trainer.py (simplified)
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource, LitRollout
class ExampleAgent(LitAgent):
async def training_rollout(
self, task: AgentLightningTask, rollout_id: str, resources: dict[str, AgentResource]
) -> float:
# Simulate some agent logic
print(f"Agent {rollout_id} processing task: {task.context}")
await asyncio.sleep(0.1) # Simulate work
# Simple reward: 1.0 if 'success' is in context, else 0.0
reward = 1.0 if "success" in task.context.lower() else 0.0
print(f"Agent {rollout_id} finished with reward: {reward}")
return reward
async def demo_rollout_capture():
trainer = Trainer(n_workers=1)
agent = ExampleAgent()
task_success = AgentLightningTask(name="Task 1", context="Please achieve success in this task.")
task_fail = AgentLightningTask(name="Task 2", context="Please handle this failure scenario.")
print("\n--- Running a successful rollout ---")
rollout_s = await trainer.dev(agent=agent, task=task_success, resources={})
print("\n--- Captured LitRollout for success ---")
print(f"Rollout ID: {rollout_s.rollout_id}")
print(f"Final Reward: {rollout_s.final_reward}")
print(f"Start Time: {rollout_s.start_time}")
print(f"End Time: {rollout_s.end_time}")
print(f"Duration: {rollout_s.end_time - rollout_s.start_time:.4f} seconds")
# print(f"Traces: {rollout_s.traces}") # More on this later, when we configure tracing
print("\n--- Running a failed rollout ---")
rollout_f = await trainer.dev(agent=agent, task=task_fail, resources={})
print("\n--- Captured LitRollout for failure ---")
print(f"Rollout ID: {rollout_f.rollout_id}")
print(f"Final Reward: {rollout_f.final_reward}")
if __name__ == "__main__":
import asyncio
from agentlightning.trainer import Trainer # Import Trainer here for self-contained example
asyncio.run(demo_rollout_capture())
Expected Output (similar to):
--- Running a successful rollout ---
Agent iter_0_task_0 processing task: Please achieve success in this task.
Agent iter_0_task_0 finished with reward: 1.0
--- Captured LitRollout for success ---
Rollout ID: iter_0_task_0
Final Reward: 1.0
Start Time: 2025-11-06 17:30:00.123456
End Time: 2025-11-06 17:30:00.234567
Duration: 0.1111 seconds
--- Running a failed rollout ---
Agent iter_0_task_0 processing task: Please handle this failure scenario.
Agent iter_0_task_0 finished with reward: 0.0
--- Captured LitRollout for failure ---
Rollout ID: iter_0_task_0
Final Reward: 0.0
The Art of Reward Design
The reward is the single most critical feedback signal that drives the optimization process. It’s how the Trainer (and the underlying algorithms like Reinforcement Learning or Prompt Optimization) understands whether the agent’s behavior was “good” or “bad” for a given task. A well-designed reward function is essential for successful agent training; a poorly designed one can lead to agents learning unintended or suboptimal behaviors (reward hacking).
Principles of Good Reward Design:
- Alignment with Task Objective: The reward must accurately reflect the true goal of the task. If the goal is to summarize accurately, the reward should penalize inaccuracies and reward precision.
- Scalability and Granularity:
- Sparse Rewards: Only given at the very end of a task (e.g., 1 for success, 0 for failure). This can be challenging for agents to learn from, especially in long, multi-step tasks.
- Dense Rewards: Provided at intermediate steps or offer partial credit. These are generally preferred as they offer more frequent feedback, guiding the agent’s learning more effectively. Agentic Lightening’s support for tracers (covered in the next chapter) can facilitate dense reward calculation by capturing intermediate steps.
- Verifiability/Objectivity: Rewards should ideally be objectively verifiable. Avoid subjective rewards that are hard to consistently quantify. Can you programmatically determine if a reward should be given?
- No Reward Hacking: Design rewards to avoid situations where the agent can exploit flaws in the reward function to get high scores without actually achieving the desired behavior (e.g., an agent that just outputs “Yes” repeatedly gets a high reward if the reward simply checks for the word “Yes”).
- Robustness to Imperfections: The reward function should be somewhat robust to minor variations in agent output or environmental noise.
- Simplicity (initially): Start with simpler reward functions and gradually increase complexity as your agent’s capabilities grow and you understand the learning dynamics.
Types of Rewards:
- Binary (0 or 1): Simplest, for clear success/failure tasks.
- Example: If agent correctly answers a math question, reward = 1.0; else 0.0.
- Scalar (continuous range): For tasks with degrees of success.
- Example: For a summarization task, reward = ROUGE score (0 to 1). For a factual QA, reward = 1.0 if correct, 0.5 if partially correct, 0.0 if wrong.
- Negative Rewards/Penalties: To discourage undesirable actions.
- Example: -0.1 for each unnecessary tool call. -1.0 for generating offensive content.
- Combination Rewards: Summing different reward components.
- Example: Reward = (0.7 * Correctness) + (0.2 * Conciseness) - (0.1 * Latency).
Designing Reward Functions: Practical Examples
Let’s refine our SimpleMathAgent and create a more sophisticated reward function.
Scenario: Math Problem Solver Agent
Our agent needs to solve arithmetic problems.
- Input: A math problem as a string (e.g., “What is 15 + 7?”).
- Output: The numerical answer.
- Goal: Provide the correct answer.
Initial Reward (Binary):
# simple_math_agent_rewards.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource
# Mock LLM for consistency
async def mock_llm_calculate(question: str) -> str:
"""A mock LLM that extracts and calculates basic arithmetic."""
question_lower = question.lower()
# Simple addition/subtraction
if "add" in question_lower or "+" in question_lower:
numbers = re.findall(r'\d+', question_lower)
if len(numbers) >= 2:
try:
result = sum(int(n) for n in numbers)
return f"The answer is {result}"
except ValueError:
pass
if "subtract" in question_lower or "-" in question_lower:
numbers = re.findall(r'\d+', question_lower)
if len(numbers) >= 2:
try:
result = int(numbers[0]) - int(numbers[1])
return f"The answer is {result}"
except ValueError:
pass
return "I cannot calculate that."
class MathSolverAgent(LitAgent):
"""
A LitAgent designed to solve math problems with a binary reward.
"""
async def training_rollout(
self,
task: AgentLightningTask,
rollout_id: str,
resources: dict[str, AgentResource],
) -> float:
print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")
# Parse question and expected answer from task context
# Format: "Question: <math_problem> (Expected: <answer>)"
match = re.search(r"Question: (.*?)\s*\(Expected:\s*(\-?\d+)\)", task.context, re.IGNORECASE)
if not match:
print(f"[{rollout_id}] Error: Task context format invalid. Expected '(Expected: <number>)'")
return 0.0
question = match.group(1).strip()
expected_answer = int(match.group(2))
# Agent uses mock LLM to get answer
llm_response = await mock_llm_calculate(question)
print(f"[{rollout_id}] LLM response: {llm_response}")
# Extract actual answer from LLM response
actual_answer_match = re.search(r"the answer is (-?\d+)", llm_response, re.IGNORECASE)
if actual_answer_match:
actual_answer = int(actual_answer_match.group(1))
if actual_answer == expected_answer:
print(f"[{rollout_id}] CORRECT! Expected: {expected_answer}, Actual: {actual_answer}")
return 1.0
else:
print(f"[{rollout_id}] INCORRECT! Expected: {expected_answer}, Actual: {actual_answer}")
return 0.0
else:
print(f"[{rollout_id}] Could not parse numerical answer from LLM response.")
return 0.0 # No numerical answer found
async def main_math_solver():
trainer = Trainer(n_workers=1)
agent = MathSolverAgent()
tasks = [
AgentLightningTask(name="Addition Test 1", context="Question: What is 5 + 3? (Expected: 8)"),
AgentLightningTask(name="Subtraction Test 1", context="Question: Calculate 10 - 4. (Expected: 6)"),
AgentLightningTask(name="Addition Test 2", context="Question: Add 12 and 18. (Expected: 30)"),
AgentLightningTask(name="Incorrect Test 1", context="Question: What is 7 + 2? (Expected: 10)"), # Incorrect expectation
AgentLightningTask(name="Parsing Fail", context="Question: Tell me a joke. (Expected: 0)"), # Expect 0 due to parsing
]
print("\n--- Testing MathSolverAgent with Binary Rewards ---")
for i, task in enumerate(tasks):
rollout_result = await trainer.dev(agent=agent, task=task, resources={})
print(f"Task '{task.name}' Reward: {rollout_result.final_reward}\n")
if __name__ == "__main__":
from agentlightning.trainer import Trainer
asyncio.run(main_math_solver())
To run this example:
- Save the code as
simple_math_agent_rewards.py. - Run:
python simple_math_agent_rewards.py
Exercise 1: Adding Partial Credit
Modify MathSolverAgent’s training_rollout to incorporate partial credit:
- If the agent’s response cannot be parsed for a number, but the LLM response explicitly states “I cannot calculate that,” and the
expected_answeris0(indicating it’s an unanswerable question), give a reward of0.5. This encourages the agent to recognize its limitations. - Keep
1.0for correct numeric answers and0.0for incorrect numeric answers. - Update the
taskslist to include a task that tests this new0.5reward scenario (e.g., “Question: What is the meaning of life? (Expected: 0)”).
Scenario: Information Retrieval Agent
Our agent retrieves information from a knowledge base (simulated).
- Input: A query (e.g., “Who invented the light bulb?”).
- Output: A factual answer.
- Goal: Provide the correct factual answer.
Reward with Keyword Matching and Length Penalty:
For factual questions, we might want not just correctness but also conciseness.
# info_retrieval_agent_rewards.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource
# Mock Knowledge Base / LLM for factual questions
async def mock_kb_query(query: str) -> str:
"""A mock knowledge base lookup."""
query_lower = query.lower()
if "light bulb" in query_lower and ("invented" in query_lower or "who" in query_lower):
return "Thomas Edison patented a commercially viable incandescent light bulb."
if "internet" in query_lower and ("invented" in query_lower or "who" in query_lower):
return "The internet was developed by many pioneers, including Vinton Cerf and Robert Kahn, who designed the TCP/IP protocols."
return "I don't have enough information to answer that."
class InfoRetrievalAgent(LitAgent):
"""
A LitAgent for information retrieval with a reward that balances correctness and conciseness.
"""
async def training_rollout(
self,
task: AgentLightningTask,
rollout_id: str,
resources: dict[str, AgentResource],
) -> float:
print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")
# Parse question, expected answer, and ideal length from task context
# Format: "Question: <query> (Expected: <answer_keywords>) (Ideal Length: <num_words>)"
match = re.search(r"Question: (.*?)\s*\(Expected:\s*(.*?)\)\s*(?: \(Ideal Length:\s*(\d+)\))?", task.context, re.IGNORECASE)
if not match:
print(f"[{rollout_id}] Error: Task context format invalid.")
return 0.0
question = match.group(1).strip()
expected_keywords = [kw.strip().lower() for kw in match.group(2).split(',') if kw.strip()]
ideal_length_str = match.group(3)
ideal_length = int(ideal_length_str) if ideal_length_str else 50 # Default ideal length
# Agent queries mock KB
agent_answer = await mock_kb_query(question)
print(f"[{rollout_id}] Agent's answer: {agent_answer}")
# --- Reward Calculation ---
correctness_reward = 0.0
output_lower = agent_answer.lower()
# Check for presence of all expected keywords
all_keywords_present = True
for kw in expected_keywords:
if kw not in output_lower:
all_keywords_present = False
break
if all_keywords_present and len(expected_keywords) > 0:
correctness_reward = 1.0
print(f"[{rollout_id}] All keywords present: {expected_keywords}")
elif len(expected_keywords) == 0 and "don't have enough" in output_lower: # If no keywords expected and it says it can't answer, that's good.
correctness_reward = 0.8
print(f"[{rollout_id}] Agent correctly identified inability to answer. (No keywords expected)")
else:
# Partial credit for some keywords, if not all are present
present_keywords = [kw for kw in expected_keywords if kw in output_lower]
if len(expected_keywords) > 0:
correctness_reward = len(present_keywords) / len(expected_keywords) * 0.7 # Max 0.7 for partial
print(f"[{rollout_id}] Partial keywords present. Found: {present_keywords}. Correctness: {correctness_reward:.2f}")
# Length penalty/bonus
num_words = len(agent_answer.split())
length_factor = 1.0
if num_words > ideal_length * 1.5: # Too long
length_factor = 0.5
print(f"[{rollout_id}] Output too long ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")
elif num_words < ideal_length * 0.5: # Too short
length_factor = 0.7
print(f"[{rollout_id}] Output too short ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")
else: # Just right
length_factor = 1.2 # Small bonus for ideal length
print(f"[{rollout_id}] Output length is good ({num_words} words), ideal: {ideal_length}. Length factor: {length_factor:.2f}")
# Combine rewards
final_reward = correctness_reward * 0.7 + (length_factor * 0.3) # Weights
print(f"[{rollout_id}] Final Reward: {final_reward:.2f} (Correctness: {correctness_reward:.2f}, Length Factor: {length_factor:.2f})")
return final_reward
async def main_info_retrieval():
trainer = Trainer(n_workers=1)
agent = InfoRetrievalAgent()
tasks = [
AgentLightningTask(name="Light Bulb Inventor", context="Question: Who invented the light bulb? (Expected: Edison, patented, incandescent) (Ideal Length: 10)"),
AgentLightningTask(name="Internet Origin", context="Question: Who developed the internet? (Expected: Vinton Cerf, Robert Kahn, TCP/IP) (Ideal Length: 15)"),
AgentLightningTask(name="Unanswerable Query", context="Question: What color is happiness? (Expected: ) (Ideal Length: 5)"), # No keywords expected, testing 'cannot answer'
AgentLightningTask(name="Long Answer", context="Question: Who invented the light bulb? (Expected: Edison, patented) (Ideal Length: 2)"), # Too short ideal length
]
print("\n--- Testing InfoRetrievalAgent with Combined Rewards ---")
for i, task in enumerate(tasks):
rollout_result = await trainer.dev(agent=agent, task=task, resources={})
print(f"Task '{task.name}' Final Reward: {rollout_result.final_reward:.2f}\n")
if __name__ == "__main__":
from agentlightning.trainer import Trainer
asyncio.run(main_info_retrieval())
To run this example:
- Save the code as
info_retrieval_agent_rewards.py. - Run:
python info_retrieval_agent_rewards.py
Exercise 2: Incorporating Negative Reward for Specific Keywords
Modify InfoRetrievalAgent’s training_rollout to:
- Introduce a negative penalty (
-0.2) if the agent’s answer includes specific “forbidden” keywords (e.g., if the topic is “AI Ethics” but the answer mentions “Skynet” or “robot rebellion”). - Update the
taskslist to include a task where a forbidden keyword might appear. - Ensure the final reward can go below zero if significant penalties are incurred.
Leveraging AgentResource for Reward Optimization
The resources dictionary passed to training_rollout is not just for the agent; it can also be used to pass information to your reward function itself! This is particularly powerful for Automatic Prompt Optimization (APO) or scenarios where reward criteria might change.
Imagine a reward function that needs a dynamically updated list of “negative keywords” to penalize. This list could be an AgentResource.
# dynamic_reward_agent.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource
# Mock LLM (simple echo for this example)
async def mock_llm_echo(prompt: str) -> str:
"""A mock LLM that just echoes a response based on the prompt."""
if "negative_word_test" in prompt:
return "This response contains a forbidden word like 'bad'."
return f"Processed: {prompt}"
class DynamicRewardAgent(LitAgent):
"""
An agent with a reward function that can be influenced by AgentResources.
"""
async def training_rollout(
self,
task: AgentLightningTask,
rollout_id: str,
resources: dict[str, AgentResource],
) -> float:
print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")
# Example: An optimizable prompt for the LLM could be in resources
system_prompt = "You are a helpful assistant."
if "system_prompt" in resources:
system_prompt = resources["system_prompt"].value
print(f"[{rollout_id}] Using dynamic system prompt: {system_prompt}")
agent_input = f"{system_prompt}\nUser: {task.context}"
agent_output = await mock_llm_echo(agent_input)
print(f"[{rollout_id}] Agent output: {agent_output}")
# --- Dynamic Reward Calculation ---
reward = 1.0 # Base reward
# Check for forbidden keywords from resources
forbidden_keywords = []
if "forbidden_keywords" in resources and isinstance(resources["forbidden_keywords"].value, list):
forbidden_keywords = [kw.lower() for kw in resources["forbidden_keywords"].value]
output_lower = agent_output.lower()
for keyword in forbidden_keywords:
if keyword in output_lower:
reward -= 0.5 # Penalty for forbidden keyword
print(f"[{rollout_id}] Penalty: Found forbidden keyword '{keyword}' in output.")
# Check for desired keywords from task context
desired_match = re.search(r"Desired Keywords: (.*?)(?:$|\))", task.context)
if desired_match:
desired_keywords = [kw.strip().lower() for kw in desired_match.group(1).split(',') if kw.strip()]
for keyword in desired_keywords:
if keyword not in output_lower:
reward -= 0.2 # Penalty for missing desired keyword
print(f"[{rollout_id}] Penalty: Missing desired keyword '{keyword}' in output.")
# Ensure reward doesn't go below a certain threshold
final_reward = max(0.0, reward)
print(f"[{rollout_id}] Final Reward: {final_reward:.2f}")
return final_reward
async def main_dynamic_reward():
trainer = Trainer(n_workers=1)
agent = DynamicRewardAgent()
# Define tasks
tasks = [
AgentLightningTask(name="Good Response", context="Respond positively. Desired Keywords: positive, great"),
AgentLightningTask(name="Forbidden Test", context="Produce a negative_word_test. Desired Keywords: test"), # Will trigger forbidden word in LLM
]
# Define resources (this would typically be managed by the Trainer/Optimizer)
# Resource 1: A system prompt
system_prompt_resource = AgentResource(
name="system_prompt",
value="You are an overly positive assistant."
)
# Resource 2: A list of forbidden keywords
forbidden_keywords_resource = AgentResource(
name="forbidden_keywords",
value=["bad", "negative", "fail"]
)
all_resources = {
system_prompt_resource.name: system_prompt_resource,
forbidden_keywords_resource.name: forbidden_keywords_resource,
}
print("\n--- Testing DynamicRewardAgent ---")
for i, task in enumerate(tasks):
rollout_result = await trainer.dev(agent=agent, task=task, resources=all_resources)
print(f"Task '{task.name}' Final Reward: {rollout_result.final_reward:.2f}\n")
if __name__ == "__main__":
from agentlightning.trainer import Trainer
asyncio.run(main_dynamic_reward())
To run this example:
- Save the code as
dynamic_reward_agent.py. - Run:
python dynamic_reward_agent.py
Exercise 3: Optimizing Reward Thresholds
In the DynamicRewardAgent:
- Introduce another
AgentResourcecalled"min_output_length"with an integer value. - Add a penalty to the reward if the agent’s output word count is less than
min_output_length. - Design a task and a corresponding resource value to test this new penalty.
- Discuss how a
Trainerleveraging Automatic Prompt Optimization might update thismin_output_lengthresource to find an optimal balance between conciseness and content.
Conclusion
Designing effective reward functions is an iterative process that often requires experimentation and a deep understanding of your agent’s task and desired behavior. Agentic Lightening provides the framework to systematically test and optimize agents using these reward signals. As you become more comfortable, you’ll find that crafting precise and informative rewards is the key to unlocking truly intelligent and performant AI agents.
In the next chapter, we will delve into the various advanced optimization algorithms that Agentic Lightening supports, which leverage these rollouts and rewards to iteratively improve your agent’s performance.