Project 1: Optimizing a Basic QA Agent with Prompt Tuning
This project will guide you through building a simple Question-Answering (QA) agent and then using Agentic Lightening to optimize its performance through Automatic Prompt Optimization (APO). This is a classic example of how Agentic Lightening can iteratively refine an agent’s behavior by adjusting its interaction with an LLM, without needing to fine-tune the LLM itself.
Clear Objective: To create a QA agent that can accurately answer factual questions and optimize its performance by dynamically tuning its system prompt.
Problem Statement: Our initial QA agent uses a generic prompt, leading to inconsistent or sometimes incorrect answers. We want to use Agentic Lightening to discover a better system prompt that improves answer accuracy for a given set of questions.
Project Structure
We’ll break this project into manageable steps:
- Define the Base QA Agent: Implement the core logic of our
LitAgentthat interacts with a (mock) LLM. - Define the Task Dataset: Create a set of
AgentLightningTaskobjects with factual questions and their ground truth answers. - Implement the Reward Function: Design a reward function that evaluates the agent’s answer against the ground truth.
- Set Up the APO Optimizer: Create a basic optimizer that proposes new prompt variations.
- Run the Training Loop: Execute the
Trainerto orchestrate the optimization process.
Step 1: Define the Base QA Agent
Our QA agent will take a question, send it to a mock LLM with a system prompt, and return the LLM’s answer. The system prompt will be our optimizable resource.
Create a new directory for this project, e.g., agentic_qa_project. Inside, create a file named qa_agent.py:
# qa_agent.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource
# --- Mock LLM for Factual QA ---
# This mock LLM simulates varying performance based on the prompt.
# A good prompt will yield better answers.
async def mock_factual_qa_llm(system_prompt: str, question: str) -> str:
await asyncio.sleep(0.1) # Simulate LLM call latency
# Simulate prompt effectiveness
if "fact checker" in system_prompt.lower() and "precise" in system_prompt.lower():
# High-quality prompt leads to better answers
if "capital of france" in question.lower():
return "The capital of France is Paris."
if "largest ocean" in question.lower():
return "The Pacific Ocean is the largest ocean on Earth."
if "invented telephone" in question.lower():
return "Alexander Graham Bell is widely credited with inventing the telephone."
if "highest mountain" in question.lower():
return "Mount Everest is the highest mountain in the world."
# Default or less effective prompt
if "capital of france" in question.lower():
return "Paris is known as the capital of France." # Slightly less precise
if "largest ocean" in question.lower():
return "I think the Pacific is the biggest ocean." # Less confident
return "I'm not sure about that specific fact." # Default fallback
class FactualQAAgent(LitAgent):
"""
A LitAgent that answers factual questions using a system prompt,
which will be optimized by Agentic Lightening.
"""
async def training_rollout(
self,
task: AgentLightningTask,
rollout_id: str,
resources: dict[str, AgentResource],
) -> float:
print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")
# Extract the question from the task context
question_match = re.search(r"Question: (.*?)(?:\s*\(Expected:\s*(.*?)\))?$", task.context, re.IGNORECASE)
if not question_match:
print(f"[{rollout_id}] Error: Could not parse question from task context.")
return 0.0
question = question_match.group(1).strip()
expected_keywords_str = question_match.group(2)
expected_keywords = [kw.strip().lower() for kw in expected_keywords_str.split(',') if kw.strip()] if expected_keywords_str else []
# Get the current system prompt from resources, or use a default if none provided
current_system_prompt = "You are a helpful assistant."
if "qa_system_prompt" in resources:
current_system_prompt = resources["qa_system_prompt"].value
print(f"[{rollout_id}] Using System Prompt: '{current_system_prompt}'")
# Call the mock LLM
agent_answer = await mock_factual_qa_llm(current_system_prompt, question)
print(f"[{rollout_id}] Agent's Answer: '{agent_answer}'")
# --- Reward Calculation (Step 3) ---
reward = self._calculate_reward(agent_answer, expected_keywords)
print(f"[{rollout_id}] Final Reward: {reward:.2f}")
return reward
def _calculate_reward(self, agent_answer: str, expected_keywords: list[str]) -> float:
"""
Calculates a reward based on keyword presence in the agent's answer.
"""
if not expected_keywords:
return 0.0 # Cannot evaluate without expected keywords
agent_answer_lower = agent_answer.lower()
score = 0.0
for keyword in expected_keywords:
if keyword in agent_answer_lower:
score += 1.0
# Binary reward: 1.0 if all keywords present, 0.0 otherwise
if score == len(expected_keywords):
return 1.0
# Give partial credit for missing some keywords
if len(expected_keywords) > 0:
return score / len(expected_keywords) * 0.5 # Max 0.5 for partial matches
return 0.0 # Fallback
Step 2: Define the Task Dataset
We need a set of questions with their corresponding expected answer keywords to evaluate our agent. We will store these in a list of AgentLightningTask objects.
Add the following task list to a new file named qa_tasks.py in the same directory:
# qa_tasks.py
from agentlightning.types import AgentLightningTask
qa_tasks = [
AgentLightningTask(name="Capital Question", context="Question: What is the capital of France? (Expected: Paris, capital, France)"),
AgentLightningTask(name="Ocean Question", context="Question: What is the largest ocean on Earth? (Expected: Pacific, largest, ocean)"),
AgentLightningTask(name="Telephone Question", context="Question: Who invented the telephone? (Expected: Alexander Graham Bell, telephone, invented)"),
AgentLightningTask(name="Mountain Question", context="Question: What is the highest mountain in the world? (Expected: Mount Everest, highest, mountain)"),
]
# A task where the LLM might struggle with a basic prompt
qa_tasks_challenging = [
AgentLightningTask(name="Challenging Q1", context="Question: Which country is known as the 'Land of the Rising Sun'? (Expected: Japan)"), # Our mock LLM doesn't know this one
]
all_qa_tasks = qa_tasks + qa_tasks_challenging
Step 3: Implement the Reward Function
The _calculate_reward method is already defined in FactualQAAgent in qa_agent.py. It provides a reward based on how many of the expected_keywords are found in the agent’s answer. A 1.0 is given for a perfect match, and partial credit up to 0.5 for some keywords present.
- Self-Correction/Verification: Review the
_calculate_rewardmethod. Does it accurately reflect what “good performance” means for your QA agent? Is the partial credit fair?
Step 4: Set Up the APO Optimizer
We’ll create a simple mock APO optimizer. In a real scenario, this would involve using an LLM to generate prompt variations, or a more sophisticated search algorithm. For this project, our mock optimizer will cycle through a predefined list of prompt candidates.
Create a new file named apo_optimizer.py in the same directory:
# apo_optimizer.py
from agentlightning.types import AgentResource, LitRollout
class SimplePromptOptimizer:
"""
A mock APO optimizer that cycles through a list of predefined system prompts
and keeps track of the best one based on average reward.
"""
def __init__(self, prompt_candidates: list[str]):
self.prompt_candidates = prompt_candidates
self.current_candidate_index = 0
self.best_prompt = prompt_candidates[0]
self.best_avg_reward = -1.0
self.iteration_rewards = {} # Store rewards for each prompt candidate
print(f"APO Optimizer initialized with {len(self.prompt_candidates)} candidates.")
async def optimize_step(self, rollout_results: list[LitRollout], resources_version: str) -> dict:
"""
Evaluates the current prompt's performance and proposes the next one.
"""
if not rollout_results:
print("APO Optimizer: No rollouts received for this step.")
return {"version": resources_version, "resources": {}}
current_prompt_used = "Default" # Get the prompt that was used for these rollouts
if "qa_system_prompt" in rollout_results[0].resources:
current_prompt_used = rollout_results[0].resources["qa_system_prompt"].value
avg_reward = sum(r.final_reward for r in rollout_results) / len(rollout_results)
print(f"\nAPO Optimizer: Prompt '{current_prompt_used}' (Index: {self.current_candidate_index}) resulted in Avg Reward: {avg_reward:.2f}")
# Store the reward for the current prompt candidate
self.iteration_rewards[current_prompt_used] = avg_reward
# Update best prompt if current one is better
if avg_reward > self.best_avg_reward:
self.best_avg_reward = avg_reward
self.best_prompt = current_prompt_used
print(f"APO Optimizer: New best prompt: '{self.best_prompt}' (Avg Reward: {self.best_avg_reward:.2f})")
self.current_candidate_index += 1
# Propose the next prompt candidate, or the best one if all are evaluated
if self.current_candidate_index < len(self.prompt_candidates):
next_prompt = self.prompt_candidates[self.current_candidate_index]
print(f"APO Optimizer: Proposing next prompt: '{next_prompt}'")
return {
"version": f"v_prompt_{self.current_candidate_index}",
"resources": {
"qa_system_prompt": AgentResource(name="qa_system_prompt", value=next_prompt),
}
}
else:
print(f"APO Optimizer: All candidates evaluated. Sticking with best prompt: '{self.best_prompt}'")
return {
"version": f"v_prompt_final",
"resources": {
"qa_system_prompt": AgentResource(name="qa_system_prompt", value=self.best_prompt),
}
}
Prompt Candidates:
Here are some prompt candidates for our optimizer to try. The optimizer will cycle through these, providing them as AgentResource objects to our FactualQAAgent.
# Define in your main training script or import as a constant
prompt_candidates = [
"You are a helpful assistant.",
"Answer factual questions precisely.",
"You are a highly accurate fact checker. Provide concise and precise answers.",
"Given a question, extract key facts and provide a definitive answer.",
"You are an expert encyclopedia. Answer all questions with utmost accuracy."
]
Step 5: Run the Training Loop
Now, we’ll combine all the pieces: the agent, tasks, and optimizer within a main training script. We’ll simulate the Trainer.fit loop using trainer.dev for simplicity, showing how the optimizer updates the prompt resources.
Create a file named run_qa_optimization.py in the same directory:
# run_qa_optimization.py
import asyncio
from agentlightning.trainer import Trainer
from agentlightning.types import AgentLightningResource
from qa_agent import FactualQAAgent
from qa_tasks import all_qa_tasks
from apo_optimizer import SimplePromptOptimizer
# --- Define Prompt Candidates for Optimization ---
prompt_candidates = [
"You are a helpful assistant.",
"Answer factual questions precisely.",
"You are a highly accurate fact checker. Provide concise and precise answers.",
"Given a question, extract key facts and provide a definitive answer.",
"You are an expert encyclopedia. Answer all questions with utmost accuracy."
]
async def main_qa_optimization():
# Make sure you have the AgentLightningServer running in a separate terminal:
# agentlightning server start --host 0.0.0.0 --port 8000
backend_url = "http://localhost:8000" # We won't fully use server in trainer.dev but good to keep in mind
num_workers = 1 # For trainer.dev, N workers means N rollouts sequentially
trainer = Trainer(n_workers=num_workers)
qa_agent = FactualQAAgent()
optimizer = SimplePromptOptimizer(prompt_candidates)
current_resources = {} # Resources dictionary to pass to the agent
print("--- Starting QA Agent Optimization with APO ---")
# Simulate epochs where the optimizer proposes new prompts
# We will run for len(prompt_candidates) + 1 epochs to evaluate all candidates and the final best.
for epoch in range(len(prompt_candidates) + 1):
print(f"\n========== Optimization Epoch {epoch + 1} ==========")
# Current prompt resource for this epoch's rollouts
current_prompt_resource = current_resources.get("qa_system_prompt")
prompt_value_for_epoch = current_prompt_resource.value if current_prompt_resource else "Default"
print(f"Agent will use prompt: '{prompt_value_for_epoch}'")
collected_rollouts = []
# For each epoch, run the agent against all defined tasks
for i, task in enumerate(all_qa_tasks):
print(f"\n Running task {i+1}/{len(all_qa_tasks)}: {task.name}")
# Each trainer.dev call simulates one rollout, passing current resources
rollout_result = await trainer.dev(
agent=qa_agent,
task=task,
resources=current_resources # Pass the resources to the agent
)
collected_rollouts.append(rollout_result)
# After all rollouts for the current prompt, the optimizer evaluates and proposes next
updated_trainer_state = await optimizer.optimize_step(collected_rollouts, f"v_epoch_{epoch}")
# Update current_resources with the new prompt proposed by the optimizer
new_resources_from_optimizer = updated_trainer_state.get("resources", {})
if "qa_system_prompt" in new_resources_from_optimizer:
current_resources["qa_system_prompt"] = new_resources_from_optimizer["qa_system_prompt"]
else:
# If optimizer is done, ensure the agent uses the final best prompt
current_resources["qa_system_prompt"] = AgentLightningResource(
name="qa_system_prompt",
value=optimizer.best_prompt # Directly use the best_prompt from optimizer
)
# Check if optimizer is finished proposing new prompts
if optimizer.current_candidate_index > len(prompt_candidates) and epoch >= len(prompt_candidates):
print("\nAll prompt candidates evaluated.")
break # Exit the loop after evaluating all candidates and the final best
print("\n--- QA Agent Optimization Completed ---")
print(f"Final Optimal Prompt: '{optimizer.best_prompt}'")
print("\nReward history per prompt:")
for prompt, reward in optimizer.iteration_rewards.items():
print(f" '{prompt}': {reward:.2f}")
if __name__ == "__main__":
# Optional: Start AgentLightningServer in a separate terminal.
# While trainer.dev() doesn't strictly require it, for a full setup
# and to prepare for future projects using actual worker dispatch,
# it's good practice to have it running.
# agentlightning server start --host 0.0.0.0 --port 8000
asyncio.run(main_qa_optimization())
To Run Project 1:
- Create Project Directory: Create a folder named
agentic_qa_project. - Save Files: Save
qa_agent.py,qa_tasks.py,apo_optimizer.py, andrun_qa_optimization.pyinto this directory. - Activate Environment: Ensure your Agentic Lightening virtual environment is active.
- (Optional but Recommended) Start Server: In a separate terminal, start the
AgentLightningServer:agentlightning server start --host 0.0.0.0 --port 8000 - Run Optimization: In your primary terminal (in
agentic_qa_projectdirectory with env active):python run_qa_optimization.py
Expected Output:
You will see output for each epoch, showing:
- Which prompt the agent is currently using.
- The agent’s execution for each task, including its answer and reward.
- The
SimplePromptOptimizerevaluating the average reward for the current prompt. - The optimizer proposing the next prompt candidate.
- Finally, a summary of the best prompt found and its reward.
You should observe that prompts like "You are a highly accurate fact checker. Provide concise and precise answers." or "You are an expert encyclopedia. Answer all questions with utmost accuracy." will likely lead to higher average rewards compared to the initial generic prompt, demonstrating the power of APO.
Exercises/Mini-Challenges for Project 1:
Enhance Reward Function:
- Modify the
_calculate_rewardmethod inFactualQAAgentto give a small penalty (-0.1) for answers that are excessively long (e.g., more than 20 words) when the ideal length is usually short. - Consider how to introduce more sophisticated string matching (e.g., using fuzzy matching or embedding similarity) for partial credit, rather than just keyword presence.
- Modify the
More Sophisticated Prompt Generation (Conceptual):
- (Advanced) Instead of a predefined list of
prompt_candidates, imagineSimplePromptOptimizeruses an LLM (e.g., GPT-3.5) to generate new prompt variations based on the performance of previous prompts. For example, if a prompt performs poorly, the LLM could be prompted: “The previous prompt resulted in low accuracy. Suggest 3 improvements to this prompt to make it more effective for factual QA: [previous_prompt]”. - This would make your APO truly “automatic” in its generation of candidates.
- (Advanced) Instead of a predefined list of
Dynamic Task Selection:
- Modify
run_qa_optimization.pyto only run tasks that the current best prompt performed poorly on, or tasks that are deemed “harder” based on initial evaluation. This focuses the training effort where it’s most needed. (You’d need to add a “difficulty” or “failure_count” attribute to yourAgentLightningTaskobjects).
- Modify
This project provides a foundational understanding of how to use Agentic Lightening for practical agent optimization. In the next project, we’ll explore integrating an existing LangChain agent and leveraging Reinforcement Learning.