Project 1: Optimizing a Basic QA Agent with Prompt Tuning

Project 1: Optimizing a Basic QA Agent with Prompt Tuning

This project will guide you through building a simple Question-Answering (QA) agent and then using Agentic Lightening to optimize its performance through Automatic Prompt Optimization (APO). This is a classic example of how Agentic Lightening can iteratively refine an agent’s behavior by adjusting its interaction with an LLM, without needing to fine-tune the LLM itself.

Clear Objective: To create a QA agent that can accurately answer factual questions and optimize its performance by dynamically tuning its system prompt.

Problem Statement: Our initial QA agent uses a generic prompt, leading to inconsistent or sometimes incorrect answers. We want to use Agentic Lightening to discover a better system prompt that improves answer accuracy for a given set of questions.

Project Structure

We’ll break this project into manageable steps:

  1. Define the Base QA Agent: Implement the core logic of our LitAgent that interacts with a (mock) LLM.
  2. Define the Task Dataset: Create a set of AgentLightningTask objects with factual questions and their ground truth answers.
  3. Implement the Reward Function: Design a reward function that evaluates the agent’s answer against the ground truth.
  4. Set Up the APO Optimizer: Create a basic optimizer that proposes new prompt variations.
  5. Run the Training Loop: Execute the Trainer to orchestrate the optimization process.

Step 1: Define the Base QA Agent

Our QA agent will take a question, send it to a mock LLM with a system prompt, and return the LLM’s answer. The system prompt will be our optimizable resource.

Create a new directory for this project, e.g., agentic_qa_project. Inside, create a file named qa_agent.py:

# qa_agent.py
import asyncio
import re
from agentlightning.litagent import LitAgent
from agentlightning.types import AgentLightningTask, AgentResource

# --- Mock LLM for Factual QA ---
# This mock LLM simulates varying performance based on the prompt.
# A good prompt will yield better answers.
async def mock_factual_qa_llm(system_prompt: str, question: str) -> str:
    await asyncio.sleep(0.1) # Simulate LLM call latency
    
    # Simulate prompt effectiveness
    if "fact checker" in system_prompt.lower() and "precise" in system_prompt.lower():
        # High-quality prompt leads to better answers
        if "capital of france" in question.lower():
            return "The capital of France is Paris."
        if "largest ocean" in question.lower():
            return "The Pacific Ocean is the largest ocean on Earth."
        if "invented telephone" in question.lower():
            return "Alexander Graham Bell is widely credited with inventing the telephone."
        if "highest mountain" in question.lower():
            return "Mount Everest is the highest mountain in the world."
    
    # Default or less effective prompt
    if "capital of france" in question.lower():
        return "Paris is known as the capital of France." # Slightly less precise
    if "largest ocean" in question.lower():
        return "I think the Pacific is the biggest ocean." # Less confident
    return "I'm not sure about that specific fact." # Default fallback

class FactualQAAgent(LitAgent):
    """
    A LitAgent that answers factual questions using a system prompt,
    which will be optimized by Agentic Lightening.
    """
    async def training_rollout(
        self,
        task: AgentLightningTask,
        rollout_id: str,
        resources: dict[str, AgentResource],
    ) -> float:
        print(f"[{rollout_id}] Agent received task: {task.name} - '{task.context}'")

        # Extract the question from the task context
        question_match = re.search(r"Question: (.*?)(?:\s*\(Expected:\s*(.*?)\))?$", task.context, re.IGNORECASE)
        if not question_match:
            print(f"[{rollout_id}] Error: Could not parse question from task context.")
            return 0.0
        
        question = question_match.group(1).strip()
        expected_keywords_str = question_match.group(2)
        expected_keywords = [kw.strip().lower() for kw in expected_keywords_str.split(',') if kw.strip()] if expected_keywords_str else []

        # Get the current system prompt from resources, or use a default if none provided
        current_system_prompt = "You are a helpful assistant."
        if "qa_system_prompt" in resources:
            current_system_prompt = resources["qa_system_prompt"].value
        
        print(f"[{rollout_id}] Using System Prompt: '{current_system_prompt}'")

        # Call the mock LLM
        agent_answer = await mock_factual_qa_llm(current_system_prompt, question)
        print(f"[{rollout_id}] Agent's Answer: '{agent_answer}'")

        # --- Reward Calculation (Step 3) ---
        reward = self._calculate_reward(agent_answer, expected_keywords)
        
        print(f"[{rollout_id}] Final Reward: {reward:.2f}")
        return reward

    def _calculate_reward(self, agent_answer: str, expected_keywords: list[str]) -> float:
        """
        Calculates a reward based on keyword presence in the agent's answer.
        """
        if not expected_keywords:
            return 0.0 # Cannot evaluate without expected keywords

        agent_answer_lower = agent_answer.lower()
        score = 0.0
        for keyword in expected_keywords:
            if keyword in agent_answer_lower:
                score += 1.0
        
        # Binary reward: 1.0 if all keywords present, 0.0 otherwise
        if score == len(expected_keywords):
            return 1.0
        
        # Give partial credit for missing some keywords
        if len(expected_keywords) > 0:
            return score / len(expected_keywords) * 0.5 # Max 0.5 for partial matches
        
        return 0.0 # Fallback

Step 2: Define the Task Dataset

We need a set of questions with their corresponding expected answer keywords to evaluate our agent. We will store these in a list of AgentLightningTask objects.

Add the following task list to a new file named qa_tasks.py in the same directory:

# qa_tasks.py
from agentlightning.types import AgentLightningTask

qa_tasks = [
    AgentLightningTask(name="Capital Question", context="Question: What is the capital of France? (Expected: Paris, capital, France)"),
    AgentLightningTask(name="Ocean Question", context="Question: What is the largest ocean on Earth? (Expected: Pacific, largest, ocean)"),
    AgentLightningTask(name="Telephone Question", context="Question: Who invented the telephone? (Expected: Alexander Graham Bell, telephone, invented)"),
    AgentLightningTask(name="Mountain Question", context="Question: What is the highest mountain in the world? (Expected: Mount Everest, highest, mountain)"),
]

# A task where the LLM might struggle with a basic prompt
qa_tasks_challenging = [
    AgentLightningTask(name="Challenging Q1", context="Question: Which country is known as the 'Land of the Rising Sun'? (Expected: Japan)"), # Our mock LLM doesn't know this one
]

all_qa_tasks = qa_tasks + qa_tasks_challenging

Step 3: Implement the Reward Function

The _calculate_reward method is already defined in FactualQAAgent in qa_agent.py. It provides a reward based on how many of the expected_keywords are found in the agent’s answer. A 1.0 is given for a perfect match, and partial credit up to 0.5 for some keywords present.

  • Self-Correction/Verification: Review the _calculate_reward method. Does it accurately reflect what “good performance” means for your QA agent? Is the partial credit fair?

Step 4: Set Up the APO Optimizer

We’ll create a simple mock APO optimizer. In a real scenario, this would involve using an LLM to generate prompt variations, or a more sophisticated search algorithm. For this project, our mock optimizer will cycle through a predefined list of prompt candidates.

Create a new file named apo_optimizer.py in the same directory:

# apo_optimizer.py
from agentlightning.types import AgentResource, LitRollout

class SimplePromptOptimizer:
    """
    A mock APO optimizer that cycles through a list of predefined system prompts
    and keeps track of the best one based on average reward.
    """
    def __init__(self, prompt_candidates: list[str]):
        self.prompt_candidates = prompt_candidates
        self.current_candidate_index = 0
        self.best_prompt = prompt_candidates[0]
        self.best_avg_reward = -1.0
        self.iteration_rewards = {} # Store rewards for each prompt candidate
        print(f"APO Optimizer initialized with {len(self.prompt_candidates)} candidates.")

    async def optimize_step(self, rollout_results: list[LitRollout], resources_version: str) -> dict:
        """
        Evaluates the current prompt's performance and proposes the next one.
        """
        if not rollout_results:
            print("APO Optimizer: No rollouts received for this step.")
            return {"version": resources_version, "resources": {}}

        current_prompt_used = "Default" # Get the prompt that was used for these rollouts
        if "qa_system_prompt" in rollout_results[0].resources:
            current_prompt_used = rollout_results[0].resources["qa_system_prompt"].value
        
        avg_reward = sum(r.final_reward for r in rollout_results) / len(rollout_results)
        print(f"\nAPO Optimizer: Prompt '{current_prompt_used}' (Index: {self.current_candidate_index}) resulted in Avg Reward: {avg_reward:.2f}")

        # Store the reward for the current prompt candidate
        self.iteration_rewards[current_prompt_used] = avg_reward

        # Update best prompt if current one is better
        if avg_reward > self.best_avg_reward:
            self.best_avg_reward = avg_reward
            self.best_prompt = current_prompt_used
            print(f"APO Optimizer: New best prompt: '{self.best_prompt}' (Avg Reward: {self.best_avg_reward:.2f})")

        self.current_candidate_index += 1

        # Propose the next prompt candidate, or the best one if all are evaluated
        if self.current_candidate_index < len(self.prompt_candidates):
            next_prompt = self.prompt_candidates[self.current_candidate_index]
            print(f"APO Optimizer: Proposing next prompt: '{next_prompt}'")
            return {
                "version": f"v_prompt_{self.current_candidate_index}",
                "resources": {
                    "qa_system_prompt": AgentResource(name="qa_system_prompt", value=next_prompt),
                }
            }
        else:
            print(f"APO Optimizer: All candidates evaluated. Sticking with best prompt: '{self.best_prompt}'")
            return {
                "version": f"v_prompt_final",
                "resources": {
                    "qa_system_prompt": AgentResource(name="qa_system_prompt", value=self.best_prompt),
                }
            }

Prompt Candidates: Here are some prompt candidates for our optimizer to try. The optimizer will cycle through these, providing them as AgentResource objects to our FactualQAAgent.

# Define in your main training script or import as a constant
prompt_candidates = [
    "You are a helpful assistant.",
    "Answer factual questions precisely.",
    "You are a highly accurate fact checker. Provide concise and precise answers.",
    "Given a question, extract key facts and provide a definitive answer.",
    "You are an expert encyclopedia. Answer all questions with utmost accuracy."
]

Step 5: Run the Training Loop

Now, we’ll combine all the pieces: the agent, tasks, and optimizer within a main training script. We’ll simulate the Trainer.fit loop using trainer.dev for simplicity, showing how the optimizer updates the prompt resources.

Create a file named run_qa_optimization.py in the same directory:

# run_qa_optimization.py
import asyncio
from agentlightning.trainer import Trainer
from agentlightning.types import AgentLightningResource
from qa_agent import FactualQAAgent
from qa_tasks import all_qa_tasks
from apo_optimizer import SimplePromptOptimizer

# --- Define Prompt Candidates for Optimization ---
prompt_candidates = [
    "You are a helpful assistant.",
    "Answer factual questions precisely.",
    "You are a highly accurate fact checker. Provide concise and precise answers.",
    "Given a question, extract key facts and provide a definitive answer.",
    "You are an expert encyclopedia. Answer all questions with utmost accuracy."
]

async def main_qa_optimization():
    # Make sure you have the AgentLightningServer running in a separate terminal:
    # agentlightning server start --host 0.0.0.0 --port 8000
    
    backend_url = "http://localhost:8000" # We won't fully use server in trainer.dev but good to keep in mind
    num_workers = 1 # For trainer.dev, N workers means N rollouts sequentially

    trainer = Trainer(n_workers=num_workers)
    qa_agent = FactualQAAgent()
    optimizer = SimplePromptOptimizer(prompt_candidates)

    current_resources = {} # Resources dictionary to pass to the agent

    print("--- Starting QA Agent Optimization with APO ---")
    
    # Simulate epochs where the optimizer proposes new prompts
    # We will run for len(prompt_candidates) + 1 epochs to evaluate all candidates and the final best.
    for epoch in range(len(prompt_candidates) + 1):
        print(f"\n========== Optimization Epoch {epoch + 1} ==========")
        
        # Current prompt resource for this epoch's rollouts
        current_prompt_resource = current_resources.get("qa_system_prompt")
        prompt_value_for_epoch = current_prompt_resource.value if current_prompt_resource else "Default"
        print(f"Agent will use prompt: '{prompt_value_for_epoch}'")

        collected_rollouts = []
        # For each epoch, run the agent against all defined tasks
        for i, task in enumerate(all_qa_tasks):
            print(f"\n  Running task {i+1}/{len(all_qa_tasks)}: {task.name}")
            # Each trainer.dev call simulates one rollout, passing current resources
            rollout_result = await trainer.dev(
                agent=qa_agent,
                task=task,
                resources=current_resources # Pass the resources to the agent
            )
            collected_rollouts.append(rollout_result)
        
        # After all rollouts for the current prompt, the optimizer evaluates and proposes next
        updated_trainer_state = await optimizer.optimize_step(collected_rollouts, f"v_epoch_{epoch}")
        
        # Update current_resources with the new prompt proposed by the optimizer
        new_resources_from_optimizer = updated_trainer_state.get("resources", {})
        if "qa_system_prompt" in new_resources_from_optimizer:
            current_resources["qa_system_prompt"] = new_resources_from_optimizer["qa_system_prompt"]
        else:
            # If optimizer is done, ensure the agent uses the final best prompt
            current_resources["qa_system_prompt"] = AgentLightningResource(
                name="qa_system_prompt", 
                value=optimizer.best_prompt # Directly use the best_prompt from optimizer
            )

        # Check if optimizer is finished proposing new prompts
        if optimizer.current_candidate_index > len(prompt_candidates) and epoch >= len(prompt_candidates):
            print("\nAll prompt candidates evaluated.")
            break # Exit the loop after evaluating all candidates and the final best

    print("\n--- QA Agent Optimization Completed ---")
    print(f"Final Optimal Prompt: '{optimizer.best_prompt}'")
    print("\nReward history per prompt:")
    for prompt, reward in optimizer.iteration_rewards.items():
        print(f"  '{prompt}': {reward:.2f}")

if __name__ == "__main__":
    # Optional: Start AgentLightningServer in a separate terminal.
    # While trainer.dev() doesn't strictly require it, for a full setup
    # and to prepare for future projects using actual worker dispatch,
    # it's good practice to have it running.
    # agentlightning server start --host 0.0.0.0 --port 8000
    
    asyncio.run(main_qa_optimization())

To Run Project 1:

  1. Create Project Directory: Create a folder named agentic_qa_project.
  2. Save Files: Save qa_agent.py, qa_tasks.py, apo_optimizer.py, and run_qa_optimization.py into this directory.
  3. Activate Environment: Ensure your Agentic Lightening virtual environment is active.
  4. (Optional but Recommended) Start Server: In a separate terminal, start the AgentLightningServer:
    agentlightning server start --host 0.0.0.0 --port 8000
    
  5. Run Optimization: In your primary terminal (in agentic_qa_project directory with env active):
    python run_qa_optimization.py
    

Expected Output:

You will see output for each epoch, showing:

  • Which prompt the agent is currently using.
  • The agent’s execution for each task, including its answer and reward.
  • The SimplePromptOptimizer evaluating the average reward for the current prompt.
  • The optimizer proposing the next prompt candidate.
  • Finally, a summary of the best prompt found and its reward.

You should observe that prompts like "You are a highly accurate fact checker. Provide concise and precise answers." or "You are an expert encyclopedia. Answer all questions with utmost accuracy." will likely lead to higher average rewards compared to the initial generic prompt, demonstrating the power of APO.

Exercises/Mini-Challenges for Project 1:

  1. Enhance Reward Function:

    • Modify the _calculate_reward method in FactualQAAgent to give a small penalty (-0.1) for answers that are excessively long (e.g., more than 20 words) when the ideal length is usually short.
    • Consider how to introduce more sophisticated string matching (e.g., using fuzzy matching or embedding similarity) for partial credit, rather than just keyword presence.
  2. More Sophisticated Prompt Generation (Conceptual):

    • (Advanced) Instead of a predefined list of prompt_candidates, imagine SimplePromptOptimizer uses an LLM (e.g., GPT-3.5) to generate new prompt variations based on the performance of previous prompts. For example, if a prompt performs poorly, the LLM could be prompted: “The previous prompt resulted in low accuracy. Suggest 3 improvements to this prompt to make it more effective for factual QA: [previous_prompt]”.
    • This would make your APO truly “automatic” in its generation of candidates.
  3. Dynamic Task Selection:

    • Modify run_qa_optimization.py to only run tasks that the current best prompt performed poorly on, or tasks that are deemed “harder” based on initial evaluation. This focuses the training effort where it’s most needed. (You’d need to add a “difficulty” or “failure_count” attribute to your AgentLightningTask objects).

This project provides a foundational understanding of how to use Agentic Lightening for practical agent optimization. In the next project, we’ll explore integrating an existing LangChain agent and leveraging Reinforcement Learning.