Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge - A Practical Guide
Introduction to Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) have revolutionized the way we interact with information, demonstrating remarkable abilities in generating human-like text, answering questions, and summarizing content. However, they come with inherent limitations:
- Hallucinations: LLMs can sometimes generate factually incorrect or nonsensical information, presenting it confidently as truth. This is a significant hurdle in applications requiring high accuracy.
- Lack of Up-to-Date Information: The knowledge of LLMs is static, frozen at the time of their last training data cutoff. They cannot access real-time information or specific proprietary data sources.
- Limited Context Window: While LLMs have growing context windows, there’s still a limit to how much information they can process in a single prompt. For complex queries requiring extensive background, fitting all relevant data into the prompt becomes challenging.
Retrieval-Augmented Generation (RAG) emerges as a powerful paradigm to address these limitations. RAG combines the generative power of LLMs with external, dynamic, and authoritative knowledge bases. Instead of relying solely on its internal, pre-trained knowledge, a RAG system first retrieves relevant information from an external source and then uses this retrieved context to augment the LLM’s response generation.
Why RAG is Crucial for Modern LLM Applications
RAG offers several compelling advantages:
- Reduced Hallucinations: By providing factual, external evidence, RAG grounds the LLM’s responses, making them more reliable and less prone to generating incorrect information.
- Access to Up-to-Date Information: RAG enables LLMs to query databases, web pages, or documents that are continuously updated, ensuring the responses reflect the latest information.
- Incorporation of Proprietary Data: Businesses can leverage RAG to build LLM applications that access their internal documents, customer data, or specialized knowledge bases, keeping sensitive information private and relevant.
- Attribution and Explainability: RAG systems can often cite the sources from which information was retrieved, improving the trustworthiness and verifiability of the LLM’s output.
- Cost-Effectiveness: Instead of continuously retraining LLMs with new data (a costly and resource-intensive process), RAG allows for easy updates to the external knowledge base.
- Enhanced Specificity and Detail: By retrieving precise snippets, RAG can help LLMs generate more detailed and contextually rich answers than they might otherwise.
The Basic RAG Flow: Retrieve then Generate
At its core, RAG follows a two-stage process:
- Retrieval: Given a user query, the system searches an external knowledge base to find relevant documents, passages, or data points. This is typically done by converting the query and the documents into numerical representations (embeddings) and then finding documents whose embeddings are most similar to the query’s embedding.
- Generation: The retrieved information is then provided to the LLM as additional context alongside the original user query. The LLM then generates a response, conditioning its output on both the query and the provided context.
Let’s illustrate this with a simple example:
Scenario: A user asks, “When was the last fiscal year earnings report for Google published?”
Without RAG (Traditional LLM): The LLM might try to guess based on its training data, potentially giving an outdated or incorrect answer, or stating it doesn’t know.
With RAG:
- Retrieval: The RAG system would take the query, convert it into an embedding, and then search a financial news database or Google’s investor relations website. It would retrieve the latest earnings report release date, perhaps a snippet like: “Google’s Q2 2025 earnings report was published on July 25, 2025.”
- Generation: The LLM receives the prompt: “Based on the following context, answer the question: ‘When was the last fiscal year earnings report for Google published?’ Context: ‘Google’s Q2 2025 earnings report was published on July 25, 2025.’” The LLM then generates a precise answer: “The last fiscal year earnings report for Google was published on July 25, 2025, covering Q2 2025.”
Practical Example: A Simple RAG System (Conceptual)
Before diving into code, let’s understand the high-level components with a diagram and a pseudo-code representation.
graph TD
A[User Query] --> B{Retrieve Relevant Documents};
B --> C[Vector Database / Document Store];
C --> D[Embeddings Model];
B --> E[Retrieved Context];
E --> F[LLM (Generative Model)];
A --> F;
F --> G[Augmented Response];
Pseudo-code:
function build_rag_system(knowledge_base_documents):
# Step 1: Prepare the knowledge base (offline process)
document_chunks = chunk_documents(knowledge_base_documents)
document_embeddings = create_embeddings(document_chunks)
vector_database = store_embeddings(document_embeddings, document_chunks)
return vector_database
function query_rag_system(user_query, vector_database, llm_model):
# Step 2: Process a user query (online process)
query_embedding = create_embedding(user_query)
retrieved_chunks = vector_database.search(query_embedding, top_k=5) # Find top 5 similar chunks
context = combine_chunks_into_context(retrieved_chunks)
prompt = f"Given the following context, answer the question accurately and concisely.\n\nContext:\n{context}\n\nQuestion: {user_query}"
response = llm_model.generate(prompt)
return response
This guide will systematically break down each step of this process, providing concrete examples and code to build and deploy your own RAG systems.
Part 1: Foundations of RAG - Building Your Knowledge Base
This section focuses on the initial steps of preparing your external knowledge base for retrieval. This is a crucial offline process that determines the quality and relevance of information your RAG system can access.
1.1 Document Loading: Getting Your Data into RAG
The first step in any RAG pipeline is to ingest your data. This data can come from various sources: PDFs, Markdown files, web pages, databases, APIs, etc. Libraries like LangChain and LlamaIndex provide robust DocumentLoaders to handle this.
Core Concept: Document Object
In most RAG frameworks, raw data is loaded into a standardized Document object, which typically contains:
page_content: The textual content of the document.metadata: A dictionary of key-value pairs providing additional information about the document (e.g., source file, URL, page number, author, date). This metadata is crucial for advanced retrieval and filtering.
Practical Example: Loading Various Document Types
Let’s start by installing necessary libraries.
pip install langchain langchain_community pypdf beautifulsoup4
Mini-Project 1.1.1: Loading Documents from Files and Web
We’ll load a PDF, a text file, and a web page.
from langchain_community.document_loaders import PyPDFLoader, TextLoader, WebBaseLoader
from langchain_core.documents import Document
import os
# Create dummy files for demonstration
with open("example.txt", "w") as f:
f.write("This is a simple text document. It contains some basic information.\n")
f.write("For instance, the capital of France is Paris. The highest mountain is Everest.")
# Note: For PDF, you'd need an actual PDF file.
# For this example, we'll simulate loading a PDF and then deleting it.
# In a real scenario, you would have your PDF file ready.
# To make this runnable without an actual PDF, we'll skip the real PDF loading here,
# but demonstrate the loader.
# Create a dummy PDF placeholder for the example (you'd replace this with a real path)
dummy_pdf_path = "dummy_document.pdf"
# If you have an actual PDF, uncomment the following line and replace "path/to/your/document.pdf"
# loader = PyPDFLoader("path/to/your/document.pdf")
# docs = loader.load()
print("--- Loading Text File ---")
try:
text_loader = TextLoader("example.txt")
text_docs: list[Document] = text_loader.load()
for doc in text_docs:
print(f"Content (first 100 chars): {doc.page_content[:100]}...")
print(f"Metadata: {doc.metadata}")
except FileNotFoundError:
print("example.txt not found. Please create it.")
print("\n--- Loading Web Page ---")
try:
# We'll use a well-known page for demonstration.
# Replace with your desired URL.
web_loader = WebBaseLoader(
web_path=("https://www.paulgraham.com/greatwork.html"),
bs_kwargs={"features": "html.parser"} # Optional: Specify parser
)
web_docs: list[Document] = web_loader.load()
for doc in web_docs:
print(f"Content (first 100 chars): {doc.page_content[:100]}...")
print(f"Metadata: {doc.metadata}")
except Exception as e:
print(f"Error loading web page: {e}")
# Clean up dummy file
os.remove("example.txt")
Explanation:
TextLoader: Reads content from a.txtfile.PyPDFLoader: Designed for PDF files. It extracts text from each page.WebBaseLoader: Fetches content from a URL.bs_kwargscan be used to pass arguments to BeautifulSoup for more controlled parsing.
Exercise 1.1.1:
Modify the WebBaseLoader example to load a different news article or a specific documentation page. Experiment with bs_kwargs to see if you can filter out specific HTML elements (e.g., footers, sidebars) by passing in BeautifulSoup selectors. (Hint: Look up BeautifulSoup’s select method for ideas on how to target specific elements if you were to post-process the page_content).
1.2 Text Splitting (Chunking): Managing Context Limits
LLMs have a limited context window. Feeding an entire document, especially a long one, into the LLM prompt is often impractical or too expensive. Moreover, the LLM might struggle to identify the most relevant parts if the context is too broad.
Chunking (also known as text splitting) is the process of breaking down large documents into smaller, manageable pieces called “chunks.” The goal is to create chunks that are:
- Cohesive: Each chunk should ideally contain a complete thought or idea.
- Sufficiently Small: To fit within the LLM’s context window.
- Sufficiently Large: To retain enough context for the LLM to understand and generate meaningful responses.
Core Concepts: Chunk Size and Overlap
- Chunk Size: The maximum number of tokens or characters in a single chunk. Choosing an optimal chunk size is critical and often depends on the type of data and the LLM being used. Too small, and context is lost; too large, and it might exceed the LLM’s capacity or dilute relevant information.
- Chunk Overlap: To maintain continuity between chunks and avoid losing context at the boundaries, chunks often overlap by a certain number of tokens/characters. This ensures that information spanning across two chunk boundaries is still captured in at least one chunk.
Practical Example: Different Text Splitters
LangChain and LlamaIndex offer various TextSplitters. Let’s explore some common ones.
pip install tiktoken # For token-based splitting
Mini-Project 1.2.1: Experimenting with Text Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_core.documents import Document
long_text = """
Retrieval-Augmented Generation (RAG) is an architectural pattern that combines an information retrieval system with a text generator.
The core idea is to retrieve relevant documents or data snippets from a vast knowledge base based on a user's query, and then feed these snippets as context to a large language model (LLM).
This allows the LLM to generate more accurate, up-to-date, and grounded responses, significantly reducing the problem of "hallucinations" often observed in standalone LLMs.
There are several key components in a RAG system. First, there's the document loading phase, where raw data from various sources (PDFs, websites, databases) is ingested and converted into a standardized format.
Next, text splitting or "chunking" breaks down these larger documents into smaller, manageable segments. This is crucial because LLMs have context window limitations.
The choice of chunk size and overlap is a critical design decision. Too small, and you might lose context; too large, and you might exceed the LLM's input limit or dilute the relevance of individual chunks.
After splitting, these chunks are then transformed into numerical representations called embeddings using an embedding model.
These embeddings capture the semantic meaning of the text. They are then stored in a vector database, which is optimized for fast similarity search.
When a user submits a query, it is also embedded, and the vector database is queried to find the most semantically similar document chunks.
These retrieved chunks serve as additional context for the LLM to formulate its answer.
Finally, the LLM processes the user query along with the retrieved context to generate a coherent and informed response.
"""
# Convert to a Document object for consistency, though TextSplitters can also take strings directly
document_to_split = Document(page_content=long_text, metadata={"source": "example_rag_intro"})
print("--- RecursiveCharacterTextSplitter (default) ---")
# This splitter attempts to split by paragraphs, then sentences, then words, etc.
# It tries to keep chunks semantically coherent.
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=200, # Max characters per chunk
chunk_overlap=20, # Overlap between chunks
length_function=len # Function to measure length (len for characters, token_len for tokens)
)
recursive_chunks = recursive_splitter.split_documents([document_to_split])
for i, chunk in enumerate(recursive_chunks):
print(f"Chunk {i+1} (len: {len(chunk.page_content)}):")
print(f"'{chunk.page_content}'\n---")
print("\n--- CharacterTextSplitter (by specific separator) ---")
# This splitter simply splits by a specified character, e.g., "\n\n", then by smaller separators if chunks are still too large.
character_splitter = CharacterTextSplitter(
separator="\n\n", # Primary separator
chunk_size=200,
chunk_overlap=20,
length_function=len
)
character_chunks = character_splitter.split_documents([document_to_split])
for i, chunk in enumerate(character_chunks):
print(f"Chunk {i+1} (len: {len(chunk.page_content)}):")
print(f"'{chunk.page_content}'\n---")
# Using a token-based splitter (requires an LLM or tokenizer for length_function)
# For demonstration, we'll use a simple character-based approach for now.
# Real-world tokenizers like `tiktoken` provide more accurate token counts.
# from langchain.text_splitter import CharacterTextSplitter
# from langchain.schema import Document
#
# # This is a simplified representation. In a real scenario, you'd use a specific tokenizer.
# # For example, for OpenAI models, you might use:
# # import tiktoken
# # enc = tiktoken.encoding_for_model("gpt-4")
# # token_len = lambda text: len(enc.encode(text))
#
# # For now, let's stick to character count as length function for simplicity.
# # For true token counting, you'd integrate a specific tokenizer's encode method.
Explanation:
RecursiveCharacterTextSplitter: This is often the go-to splitter. It tries a list of separators (["\n\n", "\n", " ", ""]) and splits by the first one that results in chunks smaller thanchunk_size. This strategy aims to keep semantically related text together.CharacterTextSplitter: A more basic splitter that primarily splits by a specifiedseparator. If chunks are still too large, it will then resort to other splitting methods.
Choosing a length_function for chunk_size:
len: Counts characters. Simpler, but less accurate for LLMs as they process tokens, not characters.tiktoken.encoding_for_model("gpt-4").encode: Counts tokens according to OpenAI’s models. This is highly recommended when working with OpenAI LLMs for more precisechunk_sizemanagement.- Other tokenizers (e.g., HuggingFace
transformers): Can be used for open-source LLMs.
Exercise 1.2.1: Token-based Splitting with tiktoken
Integrate tiktoken to use a token-based length function for RecursiveCharacterTextSplitter. Choose a chunk_size in tokens (e.g., 250 tokens) and observe how the chunks are generated. Compare the character count of these token-based chunks with the character-based chunks from the previous example.
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
long_text = """
Retrieval-Augmented Generation (RAG) is an architectural pattern that combines an information retrieval system with a text generator.
The core idea is to retrieve relevant documents or data snippets from a vast knowledge base based on a user's query, and then feed these snippets as context to a large language model (LLM).
This allows the LLM to generate more accurate, up-to-date, and grounded responses, significantly reducing the problem of "hallucinations" often observed in standalone LLMs.
There are several key components in a RAG system. First, there's the document loading phase, where raw data from various sources (PDFs, websites, databases) is ingested and converted into a standardized format.
Next, text splitting or "chunking" breaks down these larger documents into smaller, manageable segments. This is crucial because LLMs have context window limitations.
The choice of chunk size and overlap is a critical design decision. Too small, and you might lose context; too large, and you might exceed the LLM's input limit or dilute the relevance of individual chunks.
After splitting, these chunks are then transformed into numerical representations called embeddings using an embedding model.
These embeddings capture the semantic meaning of the text. They are then stored in a vector database, which is optimized for fast similarity search.
When a user submits a query, it is also embedded, and the vector database is queried to find the most semantically similar document chunks.
These retrieved chunks serve as additional context for the LLM to formulate its answer.
Finally, the LLM processes the user query along with the retrieved context to generate a coherent and informed response.
"""
document_to_split = Document(page_content=long_text, metadata={"source": "example_rag_intro_tokens"})
# Define a token length function using tiktoken
try:
encoding = tiktoken.get_encoding("cl100k_base") # For gpt-4, gpt-3.5-turbo models
def tiktoken_len(text: str) -> int:
return len(encoding.encode(text))
except Exception as e:
print(f"Could not load tiktoken encoding. Falling back to character length. Error: {e}")
tiktoken_len = len # Fallback if tiktoken fails
print("--- RecursiveCharacterTextSplitter (token-based) ---")
token_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # Max tokens per chunk
chunk_overlap=10, # Overlap between chunks (in tokens)
length_function=tiktoken_len, # Use our token length function
separators=["\n\n", "\n", " ", ""] # Same default separators
)
token_chunks = token_splitter.split_documents([document_to_split])
for i, chunk in enumerate(token_chunks):
print(f"Chunk {i+1} (token_len: {tiktoken_len(chunk.page_content)}, char_len: {len(chunk.page_content)}):")
print(f"'{chunk.page_content}'\n---")
Troubleshooting Chunking:
- Chunks are too short/long: Adjust
chunk_size. - Information split across chunks: Increase
chunk_overlapor try a differentseparatorinCharacterTextSplitteror experiment withRecursiveCharacterTextSplitter’sseparatorsargument. - Contextual relevance loss: Consider “parent-child” chunking strategies (advanced topic for later), where small chunks are retrieved but a larger surrounding context is passed to the LLM.
1.3 Embeddings: Giving Text Meaning to Machines
Once documents are chunked, the next step is to convert these textual chunks into a numerical format that computers can understand and process for similarity. This is where embedding models come into play.
Core Concept: Vector Embeddings
An embedding is a dense vector representation of text (words, sentences, paragraphs, or entire documents) in a high-dimensional space. The magic of embeddings is that texts with similar meanings are mapped to vectors that are close to each other in this space, while texts with different meanings are far apart. This allows for semantic search: instead of keyword matching, we search for meaning.
Practical Example: Generating Embeddings
We’ll use a common open-source embedding model, HuggingFaceEmbeddings, which downloads models from the Hugging Face hub. For production, you might use models like OpenAI’s text-embedding-ada-002 or Google’s text-embedding-004.
pip install sentence-transformers # Dependency for HuggingFaceEmbeddings
Mini-Project 1.3.1: Creating Embeddings from Text Chunks
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import numpy as np
# Sample text for embedding
text_data = [
"The quick brown fox jumps over the lazy dog.",
"A fast brown fox leaps over a lethargic canine.",
"Machine learning is a subfield of artificial intelligence.",
"Artificial intelligence involves building intelligent machines."
]
# Simulate chunks from document loading and splitting
sample_documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(text_data)]
print("--- Initializing HuggingFace Embeddings Model ---")
# Using a common and relatively small Sentence-Transformer model
# You can explore other models on the Hugging Face Hub, e.g., 'all-MiniLM-L6-v2'
# Ensure you have 'sentence-transformers' installed.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
print("\n--- Generating Embeddings ---")
# The embedding_model can take a list of strings directly
document_embeddings = embedding_model.embed_documents([doc.page_content for doc in sample_documents])
# Embed a single query for comparison
query_text = "fox and dog story"
query_embedding = embedding_model.embed_query(query_text)
print(f"Number of document embeddings generated: {len(document_embeddings)}")
print(f"Dimension of embeddings (e.g., for 'all-MiniLM-L6-v2' it's 384): {len(document_embeddings[0])}")
print("\n--- Comparing Embeddings (Cosine Similarity) ---")
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Convert to numpy arrays for easier calculation
doc_embeddings_np = np.array(document_embeddings)
query_embedding_np = np.array(query_embedding)
similarities = []
for i, doc_embed in enumerate(doc_embeddings_np):
sim = cosine_similarity(query_embedding_np, doc_embed)
similarities.append((sim, i, sample_documents[i].page_content))
# Sort by similarity in descending order
similarities.sort(key=lambda x: x[0], reverse=True)
print(f"Query: '{query_text}'")
for sim, idx, content in similarities:
print(f"Similarity: {sim:.4f}, Document {idx}: '{content}'")
Explanation:
HuggingFaceEmbeddings: This class allows you to use various models from the Hugging Facesentence-transformerslibrary. Themodel_namespecifies which pre-trained model to download and use.embed_documents: Takes a list of strings and returns a list of embedding vectors.embed_query: Takes a single string and returns its embedding vector.- Cosine Similarity: A common metric to measure the similarity between two non-zero vectors. A value close to 1 indicates high similarity, 0 indicates no similarity (orthogonality), and -1 indicates complete dissimilarity.
Choosing an Embedding Model:
- OpenAI/Google Embeddings: Often high-performing, but proprietary and typically require API keys and incur costs. (
OpenAIEmbeddings,GoogleGenerativeAIEmbeddings). - Hugging Face Embeddings (Sentence Transformers): Excellent for open-source and local deployment. Many models available, varying in size, performance, and language support.
all-MiniLM-L6-v2is a good general-purpose choice. - Domain-Specific Embeddings: For highly specialized domains (e.g., legal, medical), fine-tuned or domain-specific models might outperform general-purpose ones.
Exercise 1.3.1: Exploring Different Embedding Models
Change the model_name in HuggingFaceEmbeddings to BAAI/bge-small-en-v1.5 (a well-regarded model). Re-run the similarity comparison and observe if the similarity scores or rankings change. You might need to install transformers if you encounter issues.
Troubleshooting Embeddings:
- Poor Relevance: The embedding model might not be well-suited for your data or queries. Consider a different model or fine-tuning.
- Performance: Generating embeddings for very large datasets can be slow. Batch processing and using optimized hardware (GPUs) can help.
- Cost (for API-based models): Be mindful of API call costs.
1.4 Vector Databases: Storing and Searching Embeddings
After generating embeddings, you need an efficient way to store them and perform rapid similarity searches. This is the role of a vector database (also known as a vector store or vector index).
Core Concept: Approximate Nearest Neighbor (ANN) Search
When you have millions or billions of vectors, performing an exact nearest neighbor search (comparing a query vector to every other vector) becomes computationally infeasible. Vector databases employ Approximate Nearest Neighbor (ANN) algorithms to quickly find vectors that are approximately the closest to a given query vector. These algorithms sacrifice a small amount of accuracy for significant speed improvements.
Examples of ANN algorithms include HNSW (Hierarchical Navigable Small Worlds), IVF (Inverted File Index), and FAISS.
Practical Example: Using a Vector Database (ChromaDB)
ChromaDB is an open-source, lightweight vector database that’s easy to set up and use locally. Other popular choices include Pinecone, Weaviate, Milvus, Qdrant, and FAISS (a library, not a full-fledged database).
pip install chromadb
Mini-Project 1.4.1: Ingesting Data into ChromaDB and Performing Search
We’ll combine document loading, chunking, embedding, and finally storing and searching in ChromaDB.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
# 1. Prepare some example documents
# Create a dummy text file
data_content = """
The capital of France is Paris. Paris is also known as the "City of Love" and is famous for the Eiffel Tower.
The capital of Germany is Berlin. Berlin has a rich history, including the Brandenburg Gate.
The capital of Spain is Madrid. Madrid is known for its vibrant nightlife and beautiful architecture.
The capital of Italy is Rome. Rome is famous for its ancient ruins, like the Colosseum and Roman Forum.
Quantum physics studies matter and energy at the most fundamental level. It explores phenomena like superposition and entanglement.
Classical mechanics describes the motion of macroscopic objects, from projectiles to parts of machinery.
"""
with open("capitals_and_physics.txt", "w") as f:
f.write(data_content)
# 2. Load documents
loader = TextLoader("capitals_and_physics.txt")
documents = loader.load()
# 3. Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
print(f"Number of chunks created: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} (len: {len(chunk.page_content)}): '{chunk.page_content}'")
# 4. Choose an embedding model
# We'll use the same HuggingFace model for consistency
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 5. Initialize and persist ChromaDB
# This will create a local directory 'chroma_db' to store the database
persist_directory = "./chroma_db"
if os.path.exists(persist_directory):
import shutil
shutil.rmtree(persist_directory) # Clear previous data for a fresh start
print(f"\n--- Creating ChromaDB with {len(chunks)} chunks ---")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=persist_directory
)
print("ChromaDB created and persisted.")
# Make sure to call persist() if you're not using from_documents directly and want to save
# vector_db.persist()
# 6. Perform similarity search
query = "What is known about the main city of France?"
print(f"\n--- Performing similarity search for query: '{query}' ---")
# k=2 means retrieve the top 2 most similar documents
retrieved_docs: list[Document] = vector_db.similarity_search(query, k=2)
print(f"\nTop {len(retrieved_docs)} retrieved documents:")
for i, doc in enumerate(retrieved_docs):
print(f"Document {i+1} (Source: {doc.metadata.get('source', 'N/A')}, Length: {len(doc.page_content)}):")
print(f"'{doc.page_content}'")
print("---")
# Clean up dummy file and ChromaDB directory
os.remove("capitals_and_physics.txt")
if os.path.exists(persist_directory):
import shutil
shutil.rmtree(persist_directory)
Explanation:
Chroma.from_documents(): A convenient method to load chunks, embed them, and add them to ChromaDB in one go.embedding: You pass the initialized embedding model to Chroma, so it knows how to convert text into vectors.persist_directory: Specifies a local folder where ChromaDB will store its data, allowing you to reload it later without re-ingesting.similarity_search(query, k): Takes a query string, embeds it using the same embedding model, and then searches the vector database for thekmost similar document chunks. It returns these asDocumentobjects.
Exercise 1.4.1: Reloading and Querying a Persisted ChromaDB
Modify the previous example. After vector_db.persist(), delete the vector_db object (e.g., del vector_db). Then, re-initialize ChromaDB from the persist_directory without providing documents (as they are already persisted) and perform a new query.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
import shutil
# --- Setup: Identical to Mini-Project 1.4.1 for initial data loading ---
data_content = """
The capital of France is Paris. Paris is also known as the "City of Love" and is famous for the Eiffel Tower.
The capital of Germany is Berlin. Berlin has a rich history, including the Brandenburg Gate.
The capital of Spain is Madrid. Madrid is known for its vibrant nightlife and beautiful architecture.
The capital of Italy is Rome. Rome is famous for its ancient ruins, like the Colosseum and Roman Forum.
Quantum physics studies matter and energy at the most fundamental level. It explores phenomena like superposition and entanglement.
Classical mechanics describes the motion of macroscopic objects, from projectiles to parts of machinery.
"""
with open("capitals_and_physics.txt", "w") as f:
f.write(data_content)
loader = TextLoader("capitals_and_physics.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
persist_directory = "./chroma_db_exercise"
if os.path.exists(persist_directory):
shutil.rmtree(persist_directory)
print(f"--- Creating and persisting ChromaDB with {len(chunks)} chunks ---")
initial_vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=persist_directory
)
# Ensure it's persisted (from_documents handles this, but good to be explicit for learning)
initial_vector_db.persist()
print("Initial ChromaDB created and persisted.")
del initial_vector_db # Simulate closing the database
# --- Exercise Part: Reloading and Querying ---
print(f"\n--- Reloading ChromaDB from '{persist_directory}' ---")
# To reload, we need to pass the same embedding function that was used to create it
reloaded_vector_db = Chroma(
persist_directory=persist_directory,
embedding_function=embedding_model # Critical: Use the same embedding function
)
print("ChromaDB reloaded.")
query_reloaded = "Tell me about ancient Italian cities."
print(f"\n--- Performing similarity search on reloaded DB for query: '{query_reloaded}' ---")
retrieved_docs_reloaded = reloaded_vector_db.similarity_search(query_reloaded, k=1)
print(f"\nTop {len(retrieved_docs_reloaded)} retrieved documents from reloaded DB:")
for i, doc in enumerate(retrieved_docs_reloaded):
print(f"Document {i+1} (Source: {doc.metadata.get('source', 'N/A')}, Length: {len(doc.page_content)}):")
print(f"'{doc.page_content}'")
print("---")
# --- Cleanup ---
os.remove("capitals_and_physics.txt")
if os.path.exists(persist_directory):
shutil.rmtree(persist_directory)
Troubleshooting Vector Databases:
- Performance Issues: For very large datasets, consider cloud-based vector databases (Pinecone, Weaviate) or more performant local solutions like FAISS (with appropriate indexing strategies). Ensure your embedding model isn’t too large for your resources.
- Out-of-Memory Errors: If your chunks are too large or you’re ingesting millions of documents locally, you might hit memory limits. Adjust chunking or use persistent/cloud databases.
- Relevance: If search results are not relevant, re-evaluate:
- Chunking strategy: Are your chunks semantically coherent?
- Embedding model: Is it suitable for your domain?
- Query formulation: Is the user query well-phrased for semantic search?
Mini-Project 1: Building a Simple Document RAG Index
Let’s consolidate everything learned so far into a mini-project where you build a RAG index for a set of documents.
Goal: Create a script that takes a directory of text files, loads them, chunks them, embeds them, and stores them in a ChromaDB vector store. Then, allow a user to query this index and see the retrieved chunks.
Instructions:
- Create a directory named
docsand put a few.txtfiles inside it with various topics (e.g., one about history, one about science, one about literature). - Write Python code to:
- Load all
.txtfiles from thedocsdirectory. - Split the loaded documents into chunks.
- Generate embeddings for these chunks using
HuggingFaceEmbeddings. - Store the chunks and their embeddings in a ChromaDB instance, persisting it to a directory.
- Implement a loop that prompts the user for a query, performs a similarity search on the ChromaDB, and prints the top 3 retrieved chunks.
- Include cleanup for the
docsdirectory and the ChromaDB persistence directory.
- Load all
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
# 1. Setup: Create dummy documents and directories
DOCS_DIR = "rag_docs_mini_project"
PERSIST_DIR = "./rag_chroma_db_mini_project"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR)
os.makedirs(DOCS_DIR)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR)
# Create some dummy text files
with open(os.path.join(DOCS_DIR, "history.txt"), "w") as f:
f.write("""
The Roman Empire was founded in 27 BC when Augustus became the first emperor.
It reached its peak under Emperor Trajan and eventually fell in 476 AD in the West.
Key aspects of Roman society included law, engineering (aqueducts, roads), and military might.
""")
with open(os.path.join(DOCS_DIR, "science.txt"), "w") as f:
f.write("""
Photosynthesis is the process by which green plants and some other organisms
use sunlight to synthesize foods with the help of chlorophyll.
This process converts light energy into chemical energy, releasing oxygen as a byproduct.
The formula is 6CO2 + 6H2O + Light Energy -> C6H12O6 + 6O2.
""")
with open(os.path.join(DOCS_DIR, "literature.txt"), "w") as f:
f.write("""
"Romeo and Juliet" is a tragedy written by William Shakespeare early in his career.
It tells the story of two young star-crossed lovers whose deaths ultimately reconcile
their feuding families. It is among Shakespeare's most popular and frequently performed plays.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
# 2. Load documents
print(f"Loading documents from '{DOCS_DIR}'...")
# Use DirectoryLoader to load all .txt files
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents: list[Document] = loader.load()
print(f"Loaded {len(documents)} raw documents.")
# 3. Chunk documents
print("Splitting documents into chunks...")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=200, # Max characters per chunk
chunk_overlap=30, # Overlap between chunks
length_function=len
)
chunks: list[Document] = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks.")
# 4. Choose an embedding model
print("Initializing embedding model...")
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 5. Initialize and persist ChromaDB
print(f"Creating and persisting ChromaDB to '{PERSIST_DIR}'...")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
print("ChromaDB created and persisted.")
print("You can now close this script and the database will remain.")
# Reload the DB (optional, but demonstrates persistence)
print(f"\nReloading ChromaDB from '{PERSIST_DIR}' for querying...")
reloaded_vector_db = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embedding_model
)
# 6. Implement query loop
print("\n--- RAG Index Ready! Enter your queries below. Type 'exit' to quit. ---")
while True:
query = input("\nEnter your query: ")
if query.lower() == 'exit':
break
print(f"Searching for relevant documents for query: '{query}'")
retrieved_docs: list[Document] = reloaded_vector_db.similarity_search(query, k=3)
print(f"\nTop {len(retrieved_docs)} retrieved documents:")
for i, doc in enumerate(retrieved_docs):
# Extract filename from metadata.source for better context
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Document {i+1} (Source: {source_file}, Length: {len(doc.page_content)}):")
print(f"'{doc.page_content}'")
print("---")
print("\nExiting RAG query tool.")
# 7. Cleanup
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR)
shutil.rmtree(PERSIST_DIR)
print("Cleanup complete.")
This mini-project provides a complete, runnable example of building the foundational components of a RAG system. The next parts will focus on integrating this retrieval mechanism with LLMs for actual generation and exploring advanced techniques.
Part 2: Integrating RAG with Large Language Models
With our RAG index (vector database) ready, the next step is to integrate it with an LLM to generate informed responses. This section covers setting up LLM access and constructing effective prompts.
2.1 Setting Up LLM Access
To use LLMs, you’ll typically interact with them via an API (e.g., OpenAI, Google Gemini) or run a local open-source model.
Practical Example: Using an LLM (OpenAI)
We’ll use OpenAI’s GPT models for this example due to their widespread adoption. Make sure you have an OpenAI API key.
pip install openai
Mini-Project 2.1.1: Basic LLM Interaction
from openai import OpenAI
import os
# Set your OpenAI API key from environment variable
# It's recommended to set it as an environment variable: export OPENAI_API_KEY="your_key_here"
# If not set, you can uncomment and set it directly, but this is less secure for production.
# os.environ["OPENAI_API_KEY"] = "sk-..."
try:
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def get_llm_response(prompt_text: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo", # Or "gpt-4", "gpt-4o" for better performance/cost
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt_text}
],
max_tokens=150
)
return response.choices[0].message.content
print("--- Basic LLM Interaction ---")
question = "What is the capital of Japan?"
llm_answer = get_llm_response(question)
print(f"Question: {question}")
print(f"LLM Answer: {llm_answer}")
question_unknown = "Who won the World Series in 2025?" # Information beyond training data
llm_answer_unknown = get_llm_response(question_unknown)
print(f"\nQuestion: {question_unknown}")
print(f"LLM Answer: {llm_answer_unknown}") # Expect a disclaimer or general knowledge
except Exception as e:
print(f"Error interacting with OpenAI API. Make sure your API key is set and valid: {e}")
print("If you don't have an OpenAI key, you can substitute with a local LLM or another provider.")
Explanation:
- We use the
openaiPython client. client.chat.completions.createis the standard way to interact with chat-based models.model: Specifies the LLM model to use (e.g.,gpt-3.5-turbo,gpt-4).messages: A list of message dictionaries, each with arole(system,user,assistant) andcontent. The “system” role sets the overall behavior/persona of the assistant.max_tokens: Limits the length of the generated response.
Exercise 2.1.1: Experiment with Google Gemini
If you have a Google Gemini API key, modify the example to use google.generativeai (or LangChain’s ChatGoogleGenerativeAI) to interact with a Gemini model (e.g., gemini-pro). Observe any differences in setting up the client and making the call.
pip install -q -U google-generativeai
import google.generativeai as genai
import os
# Set your Google API key from environment variable
# export GOOGLE_API_KEY="your_key_here"
# If not set, you can uncomment and set it directly.
# os.environ["GOOGLE_API_KEY"] = "AIza..."
try:
genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
def get_gemini_response(prompt_text: str) -> str:
# For text-only prompts, use "gemini-pro"
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(prompt_text)
return response.text
print("--- Basic Gemini LLM Interaction ---")
question = "What is the capital of Japan?"
gemini_answer = get_gemini_response(question)
print(f"Question: {question}")
print(f"Gemini Answer: {gemini_answer}")
question_unknown = "Who won the World Series in 2025?"
gemini_answer_unknown = get_gemini_response(question_unknown)
print(f"\nQuestion: {question_unknown}")
print(f"Gemini Answer: {gemini_answer_unknown}")
except Exception as e:
print(f"Error interacting with Google Gemini API. Make sure your API key is set and valid: {e}")
print("If you don't have a Gemini key, you can substitute with a local LLM or another provider.")
2.2 Prompt Engineering for RAG
The quality of an LLM’s response is heavily dependent on the prompt it receives. In RAG, we don’t just ask a question; we provide context. Effective prompt engineering is crucial to guide the LLM to use the retrieved information properly.
Core Concepts: System Prompts and Context Integration
- System Prompt: This sets the overall tone, persona, and instructions for the LLM. For RAG, it often includes instructions to “use the provided context” and “avoid making up information.”
- Context Integration: The retrieved chunks are inserted directly into the prompt. It’s important to format them clearly so the LLM can easily distinguish between the user’s query and the external context.
Practical Example: Constructing RAG Prompts
Let’s combine our ChromaDB setup (from Mini-Project 1) with an LLM. For this example, we’ll continue with OpenAI’s API.
Mini-Project 2.2.1: Simple RAG System (Retrieve + Generate)
This mini-project will take the vector database you built in Part 1 and use it to augment an LLM’s answers.
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from openai import OpenAI
from typing import List
# --- Setup: Identical to Mini-Project 1 for data loading and indexing ---
DOCS_DIR = "rag_docs_full_system"
PERSIST_DIR = "./rag_chroma_db_full_system"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
# Create some dummy text files
with open(os.path.join(DOCS_DIR, "company_info.txt"), "w") as f:
f.write("""
Our company, InnovateCorp, was founded in 2010 by Dr. Anya Sharma.
Our mission is to develop cutting-edge AI solutions for sustainable urban development.
We recently launched our flagship product, EcoBuild AI, in Q1 2025.
EcoBuild AI helps cities optimize energy consumption and waste management through predictive analytics.
Our main office is located in San Francisco, CA.
""")
with open(os.path.join(DOCS_DIR, "product_faq.txt"), "w") as f:
f.write("""
**EcoBuild AI Frequently Asked Questions**
Q: What problem does EcoBuild AI solve?
A: It addresses energy inefficiency and waste management challenges in urban environments.
Q: When was it launched?
A: EcoBuild AI was launched in Q1 2025.
Q: What technologies does it use?
A: It leverages machine learning, IoT data, and cloud computing.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
# Load, chunk, embed, and store in ChromaDB
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
print(f"ChromaDB created and persisted to '{PERSIST_DIR}'.")
# Reload for robust demonstration
reloaded_vector_db = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embedding_model
)
# 2. Setup LLM access (OpenAI)
try:
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
if not openai_client.api_key:
raise ValueError("OPENAI_API_KEY environment variable not set.")
print("\nOpenAI LLM client initialized.")
except Exception as e:
print(f"Failed to initialize OpenAI client: {e}. Please set OPENAI_API_KEY.")
openai_client = None # Set to None if initialization fails
def get_llm_response_rag(query: str, retrieved_context: List[Document]) -> str:
if not openai_client:
return "LLM service is not available. Please check API key setup."
context_str = "\n".join([doc.page_content for doc in retrieved_context])
# Construct the RAG prompt
system_prompt = """
You are a helpful assistant specialized in providing information based on the given context.
Answer the question truthfully and concisely, strictly using only the provided context.
If the answer cannot be found in the context, clearly state that you don't know or that the information is not available in the provided documents.
Do not make up information.
"""
user_prompt = f"""
Context:
{context_str}
Question: {query}
Answer:
"""
try:
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0, # Keep temperature low for factual, grounded answers
max_tokens=200
)
return response.choices[0].message.content
except Exception as e:
return f"Error generating LLM response: {e}"
# 3. Combine Retrieval and Generation
print("\n--- RAG System Ready! Enter your queries below. Type 'exit' to quit. ---")
while True:
user_query = input("\nEnter your query: ")
if user_query.lower() == 'exit':
break
# Retrieval step
retrieved_chunks = reloaded_vector_db.similarity_search(user_query, k=3)
print(f"\nRetrieved {len(retrieved_chunks)} relevant chunks.")
# for i, chunk in enumerate(retrieved_chunks):
# print(f"Chunk {i+1}: '{chunk.page_content[:100]}...'")
# Generation step
rag_answer = get_llm_response_rag(user_query, retrieved_chunks)
print(f"\nUser Query: {user_query}")
print(f"RAG Answer: {rag_answer}")
# Example of a query where context might not be sufficient
if openai_client:
no_context_query = "What is the square root of 144?"
no_context_answer = get_llm_response_rag(no_context_query, []) # No context provided
print(f"\nUser Query (no context test): {no_context_query}")
print(f"RAG Answer (no context test): {no_context_answer}")
print("\nExiting RAG system.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Key elements of the RAG prompt:
- Clear Instructions: “strictly using only the provided context”, “If the answer cannot be found in the context, clearly state that you don’t know.” These instructions are vital to prevent hallucinations.
- Context Section: A dedicated section labeled “Context:” where retrieved documents are clearly presented.
- Question Section: The user’s original query.
- Answer Section: Guides the LLM to start its response here.
- Temperature: Setting
temperature=0.0(or a very low value) encourages the LLM to be less creative and more deterministic, which is generally desired for factual RAG applications.
Exercise 2.2.1: Prompt Tuning for Summarization
Modify the system_prompt in get_llm_response_rag to encourage the LLM to summarize the retrieved context relevant to the query, rather than just answering directly. Test with a query that might require synthesis from multiple chunks (e.g., “Tell me about InnovateCorp’s main product and its benefits.”).
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from openai import OpenAI
from typing import List
# --- Setup: Identical to Mini-Project 1 for data loading and indexing ---
DOCS_DIR = "rag_docs_full_system_summarize"
PERSIST_DIR = "./rag_chroma_db_full_system_summarize"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
# Create some dummy text files
with open(os.path.join(DOCS_DIR, "company_info.txt"), "w") as f:
f.write("""
Our company, InnovateCorp, was founded in 2010 by Dr. Anya Sharma.
Our mission is to develop cutting-edge AI solutions for sustainable urban development.
We recently launched our flagship product, EcoBuild AI, in Q1 2025.
EcoBuild AI helps cities optimize energy consumption and waste management through predictive analytics.
Our main office is located in San Francisco, CA.
""")
with open(os.path.join(DOCS_DIR, "product_faq.txt"), "w") as f:
f.write("""
**EcoBuild AI Frequently Asked Questions**
Q: What problem does EcoBuild AI solve?
A: It addresses energy inefficiency and waste management challenges in urban environments.
Q: When was it launched?
A: EcoBuild AI was launched in Q1 2025.
Q: What technologies does it use?
A: It leverages machine learning, IoT data, and cloud computing.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
# Load, chunk, embed, and store in ChromaDB
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
print(f"ChromaDB created and persisted to '{PERSIST_DIR}'.")
# Reload for robust demonstration
reloaded_vector_db = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embedding_model
)
# 2. Setup LLM access (OpenAI)
try:
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
if not openai_client.api_key:
raise ValueError("OPENAI_API_KEY environment variable not set.")
print("\nOpenAI LLM client initialized.")
except Exception as e:
print(f"Failed to initialize OpenAI client: {e}. Please set OPENAI_API_KEY.")
openai_client = None # Set to None if initialization fails
def get_llm_response_rag_summarize(query: str, retrieved_context: List[Document]) -> str:
if not openai_client:
return "LLM service is not available. Please check API key setup."
context_str = "\n".join([doc.page_content for doc in retrieved_context])
# MODIFIED SYSTEM PROMPT for summarization
system_prompt = """
You are a helpful assistant specialized in providing concise summaries of information based on the given context.
Summarize the key information from the provided context that is relevant to the user's question.
If the answer or relevant information cannot be found in the context, clearly state that you don't know or that the information is not available in the provided documents.
Do not make up information.
"""
user_prompt = f"""
Context:
{context_str}
Question: {query}
Summarized Answer:
"""
try:
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0,
max_tokens=200
)
return response.choices[0].message.content
except Exception as e:
return f"Error generating LLM response: {e}"
# 3. Combine Retrieval and Generation
print("\n--- RAG System (Summarization Mode) Ready! Enter your queries below. Type 'exit' to quit. ---")
while True:
user_query = input("\nEnter your query: ")
if user_query.lower() == 'exit':
break
# Retrieval step
retrieved_chunks = reloaded_vector_db.similarity_search(user_query, k=3)
print(f"\nRetrieved {len(retrieved_chunks)} relevant chunks.")
# Generation step using the new summarization function
rag_answer = get_llm_response_rag_summarize(user_query, retrieved_chunks)
print(f"\nUser Query: {user_query}")
print(f"RAG Summarized Answer: {rag_answer}")
# Example query for synthesis
if openai_client:
synthesis_query = "Tell me about InnovateCorp's main product and its benefits."
retrieved_for_synthesis = reloaded_vector_db.similarity_search(synthesis_query, k=3)
synthesis_answer = get_llm_response_rag_summarize(synthesis_query, retrieved_for_synthesis)
print(f"\nUser Query (synthesis test): {synthesis_query}")
print(f"RAG Summarized Answer (synthesis test): {synthesis_answer}")
print("\nExiting RAG system.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Troubleshooting Prompt Engineering:
- LLM ignores context: Make sure your system prompt clearly emphasizes using only the provided context and penalizes making up information. Using terms like “strictly,” “solely,” “do not make up information” is helpful.
- LLM still hallucinates: Your context might be insufficient or ambiguous. Revisit your chunking strategy or retrieval parameters (
k). Increasemax_tokensif the LLM is cutting off its response. - Answers are too generic: Refine your prompt to ask for specific types of information or a particular format (e.g., “Provide a bulleted list…”, “Summarize in 3 sentences…”).
- Context length exceeded: If your
retrieved_chunksare too numerous or too long, they might exceed the LLM’s context window. Reducekinsimilarity_searchor decreasechunk_sizein your text splitter.
Mini-Project 2: Building a RAG-Powered Chatbot
Goal: Extend Mini-Project 1 by integrating the LLM to create a basic RAG chatbot that can answer questions based on the ingested documents.
Instructions:
- Reuse the document loading, chunking, embedding, and ChromaDB persistence from Mini-Project 1.
- Integrate a function
answer_with_rag(query: str) -> strthat:- Takes a user query.
- Performs a similarity search on the ChromaDB to get relevant chunks.
- Constructs a RAG prompt using the retrieved chunks and the query.
- Sends the prompt to an LLM (e.g., OpenAI’s GPT-3.5-turbo or Google’s Gemini-pro).
- Returns the LLM’s generated response.
- Create a simple interactive loop where the user can ask questions, and the chatbot provides RAG-augmented answers. Handle cases where the LLM might not find the answer in the provided context.
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from openai import OpenAI # Or google.generativeai for Gemini
from typing import List
# --- Setup: Document loading and indexing ---
DOCS_DIR = "rag_chatbot_docs"
PERSIST_DIR = "./rag_chatbot_chroma_db"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
# Create some dummy text files
with open(os.path.join(DOCS_DIR, "company_report.txt"), "w") as f:
f.write("""
Acme Innovations Inc. released its annual report for 2024.
Revenues increased by 15% to $120 million, primarily driven by strong sales of their new AI-powered analytics suite.
The R&D department invested $30 million in quantum computing research and sustainable energy solutions.
CEO, Jane Doe, highlighted plans for international expansion into European markets in late 2025.
Employee count grew to 500 across all departments.
""")
with open(os.path.join(DOCS_DIR, "tech_blog.txt"), "w") as f:
f.write("""
Our latest blog post details the advancements in our AI analytics suite, version 2.0.
It now includes real-time anomaly detection and predictive maintenance features for industrial applications.
We are excited about the new partnership with 'GreenTech Solutions' to pilot our sustainable energy AI.
The blog post also mentions an upcoming webinar on "AI in Manufacturing" scheduled for September 15, 2025.
""")
with open(os.path.join(DOCS_DIR, "hr_policy.txt"), "w") as f:
f.write("""
Acme Innovations Inc. promotes a diverse and inclusive workplace.
Our new remote work policy allows employees to work from home two days a week, effective October 1, 2025.
Employee benefits include comprehensive health insurance, a 401k matching program, and professional development courses.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
# Load documents
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} raw documents.")
# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks.")
# Choose an embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Initialize and persist ChromaDB
if os.path.exists(PERSIST_DIR): # Ensure clean start for demo
shutil.rmtree(PERSIST_DIR)
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
print(f"ChromaDB created and persisted to '{PERSIST_DIR}'.")
# Reload the DB
reloaded_vector_db = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embedding_model
)
# --- LLM Setup ---
try:
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
if not openai_client.api_key:
raise ValueError("OPENAI_API_KEY environment variable not set.")
print("\nOpenAI LLM client initialized.")
except Exception as e:
print(f"Failed to initialize OpenAI client: {e}. Please set OPENAI_API_KEY.")
openai_client = None
def answer_with_rag(query: str) -> str:
if not openai_client:
return "Error: LLM service not available. Please check API key."
# Retrieval step
retrieved_chunks = reloaded_vector_db.similarity_search(query, k=4) # Retrieve top 4 chunks
context_str = "\n\n".join([doc.page_content for doc in retrieved_chunks])
# Construct RAG prompt
system_prompt = """
You are an intelligent assistant designed to answer questions based *only* on the provided context.
If the answer is not explicitly stated in the context, say "I don't have enough information in my knowledge base to answer that."
Do not invent information or provide external knowledge.
Keep your answers concise and directly to the point.
"""
user_prompt = f"""
Context:
{context_str}
Question: {query}
Answer:
"""
try:
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0, # Prioritize factual accuracy over creativity
max_tokens=250 # Limit response length
)
return response.choices[0].message.content
except Exception as e:
return f"Error during LLM generation: {e}"
# --- Chatbot Loop ---
print("\n--- RAG Chatbot Activated! ---")
print("Ask questions about Acme Innovations Inc. (type 'exit' to quit).")
while True:
user_input = input("\nYou: ")
if user_input.lower() == 'exit':
break
if not openai_client:
print("Bot: Cannot respond as LLM service is not available.")
continue
bot_response = answer_with_rag(user_input)
print(f"Bot: {bot_response}")
print("\nChatbot session ended.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
This comprehensive example demonstrates how to create a full RAG pipeline, from data ingestion to interactive querying with an LLM. The next parts will delve into advanced topics to further enhance your RAG systems.
Part 3: Advanced RAG Techniques and Optimization
Building on the foundational RAG system, this section explores advanced strategies to improve retrieval accuracy, generation quality, and system performance.
3.1 Advanced Retrieval Strategies
Simple similarity search (k nearest neighbors) is a good starting point, but it often misses nuances. Advanced retrieval aims to fetch more relevant, diverse, or contextually richer information.
3.1.1 Re-ranking Retrieved Documents
Even the top k documents from a vector search might contain some less relevant ones. Re-ranking involves using a more sophisticated model (often a smaller, specialized language model) to score the relevance of the retrieved documents against the query.
Core Concept: Cross-Encoder Models Unlike bi-encoder embedding models (which embed query and document independently), cross-encoder models take both the query and a document (or document pair) as input and output a single relevance score. They are more computationally expensive but offer higher accuracy.
Practical Example: Using a Re-ranker with sentence-transformers
pip install sentence-transformers
Mini-Project 3.1.1.1: Re-ranking Retrieved Chunks
We’ll use a pre-trained cross-encoder model for re-ranking.
from sentence_transformers import CrossEncoder
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
import shutil
# --- Setup: Reuse document loading and indexing from Part 1/2 ---
DOCS_DIR = "rerank_docs"
PERSIST_DIR = "./rerank_chroma_db"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "tech_overview.txt"), "w") as f:
f.write("""
InnovateTech's latest product, the 'Quantum Leap', is a revolutionary AI processor.
It utilizes superconducting qubits to achieve unparalleled computational speed for complex simulations.
The Quantum Leap is designed for scientific research and advanced data analytics, not for everyday consumer use.
It was announced at the AI World Summit in April 2025.
""")
with open(os.path.join(DOCS_DIR, "company_news.txt"), "w") as f:
f.write("""
InnovateTech announced a new partnership with Global Research Labs today.
This collaboration aims to accelerate quantum computing breakthroughs.
The CEO stated that the 'Quantum Leap' processor would be central to this partnership.
They also mentioned new hires in the quantum engineering division.
""")
with open(os.path.join(DOCS_DIR, "random_info.txt"), "w") as f:
f.write("""
The cat sat on the mat. The dog barked at the moon.
Green apples are often tart. Blue is a primary color.
The stock market closed higher today due to unexpected positive economic data.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# 1. Load a Cross-Encoder for re-ranking
print("\n--- Initializing Cross-Encoder Re-ranker ---")
# Using a good general-purpose cross-encoder for semantic textual similarity
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# 2. Perform initial retrieval (e.g., retrieve more than you need, then re-rank)
query = "What is the new AI processor from InnovateTech?"
print(f"\nOriginal Query: '{query}'")
initial_retrieval_k = 5 # Retrieve more documents initially
retrieved_docs: list[Document] = reloaded_vector_db.similarity_search(query, k=initial_retrieval_k)
print(f"\nInitially retrieved {len(retrieved_docs)} documents:")
for i, doc in enumerate(retrieved_docs):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Doc {i+1} (Source: {source_file}): '{doc.page_content[:100]}...'")
# 3. Re-rank the retrieved documents
print("\n--- Re-ranking retrieved documents ---")
# Prepare the input for the cross-encoder: a list of (query, document_text) pairs
rerank_pairs = [[query, doc.page_content] for doc in retrieved_docs]
rerank_scores = reranker.predict(rerank_pairs)
# Combine original documents with their re-rank scores
reranked_results = sorted(
zip(retrieved_docs, rerank_scores),
key=lambda x: x[1], # Sort by score (second element of tuple)
reverse=True
)
# Select the top N after re-ranking
top_n_after_rerank = 2
final_retrieved_docs = [doc for doc, score in reranked_results[:top_n_after_rerank]]
print(f"\nTop {len(final_retrieved_docs)} documents after re-ranking:")
for i, (doc, score) in enumerate(reranked_results[:top_n_after_rerank]):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Rank {i+1} (Score: {score:.4f}, Source: {source_file}): '{doc.page_content[:100]}...'")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
- We first perform a standard
similarity_searchto get a superset of potentially relevant documents (initial_retrieval_k). - The
CrossEncodermodel then takes each(query, document_text)pair and assigns a relevance score. - We sort the documents by these scores and select the truly top
Ndocuments to pass to the LLM.
Benefits of Re-ranking:
- Improved Precision: Helps filter out false positives from initial retrieval.
- Handles Long-Tail Queries: Can sometimes better understand complex or nuanced queries than simple embedding similarity.
3.1.2 Hybrid Search (Keywords + Semantic)
Pure semantic search can sometimes miss exact keyword matches, especially for highly specific terms or proper nouns. Hybrid search combines semantic (vector) search with traditional keyword-based search (e.g., BM25, TF-IDF).
Core Concept: Reciprocal Rank Fusion (RRF) RRF is a common algorithm used to combine the results from multiple ranking methods (like vector search and keyword search) into a single, robust ranked list.
Practical Example: Conceptual Hybrid Search (with mock keyword search)
LangChain offers integrations with more advanced hybrid search tools. Here, we’ll demonstrate the concept with a mock keyword search.
Mini-Project 3.1.2.1: Conceptual Hybrid Search
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
import shutil
from collections import defaultdict
# --- Setup: Reuse document loading and indexing ---
DOCS_DIR = "hybrid_docs"
PERSIST_DIR = "./hybrid_chroma_db"
if os.path.exists(DOCS_DIR):
shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR):
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "report_2023.txt"), "w") as f:
f.write("""
Our 2023 annual report details robust growth in the renewable energy sector.
The solar panel division saw a 25% increase in revenue.
We invested heavily in research for advanced battery storage solutions.
The report highlights key achievements including a patent for a new type of wind turbine.
""")
with open(os.path.join(DOCS_DIR, "press_release_wind.txt"), "w") as f:
f.write("""
Press Release: InnovatePower announces breakthrough in wind turbine efficiency.
The new 'AeroGen' turbine model achieves 15% higher energy yield than previous models.
This innovation is set to revolutionize the wind power industry.
""")
with open(os.path.join(DOCS_DIR, "news_q1_2024.txt"), "w") as f:
f.write("""
Q1 2024 earnings show continued strong performance.
Expansion into offshore wind farms is progressing ahead of schedule.
Challenges include fluctuating raw material costs.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# --- Mock Keyword Search Function (for demonstration) ---
def mock_keyword_search(query: str, all_chunks: list[Document], k: int = 5) -> list[Document]:
query_words = set(query.lower().split())
ranked_chunks = []
for chunk in all_chunks:
chunk_words = set(chunk.page_content.lower().split())
common_words = query_words.intersection(chunk_words)
score = len(common_words) # Simple score: number of common words
if score > 0:
ranked_chunks.append((score, chunk))
ranked_chunks.sort(key=lambda x: x[0], reverse=True)
return [chunk for score, chunk in ranked_chunks[:k]]
# --- Reciprocal Rank Fusion (RRF) ---
def reciprocal_rank_fusion(ranked_lists: list[list[Document]], k=60) -> list[Document]:
fused_scores = defaultdict(float)
document_map = {} # Map unique doc content to Document object
for ranked_list in ranked_lists:
for rank, doc in enumerate(ranked_list):
# Use a unique identifier for the document content
doc_id = doc.page_content # Simple unique identifier for this demo
document_map[doc_id] = doc # Store the full Document object
fused_scores[doc_id] += 1 / (k + rank + 1)
# Sort documents by fused scores in descending order
sorted_doc_ids = sorted(fused_scores.keys(), key=lambda doc_id: fused_scores[doc_id], reverse=True)
# Reconstruct Document objects
fused_results = [document_map[doc_id] for doc_id in sorted_doc_ids]
return fused_results
# 1. Perform Semantic Search
query = "New developments in wind energy and our patents"
print(f"\nOriginal Query: '{query}'")
semantic_results = reloaded_vector_db.similarity_search(query, k=5)
print("\n--- Semantic Search Results ---")
for i, doc in enumerate(semantic_results):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Semantic Rank {i+1} (Source: {source_file}): '{doc.page_content[:100]}...'")
# 2. Perform Keyword Search (using our mock function)
# In a real scenario, this would be a dedicated search engine or another retriever
all_chunks_list = chunks # Get all original chunks
keyword_results = mock_keyword_search(query, all_chunks_list, k=5)
print("\n--- Keyword Search Results ---")
for i, doc in enumerate(keyword_results):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Keyword Rank {i+1} (Source: {source_file}): '{doc.page_content[:100]}...'")
# 3. Combine results using RRF
fused_results = reciprocal_rank_fusion([semantic_results, keyword_results])
print("\n--- Hybrid Search Results (after RRF) ---")
# Take top 3 for the LLM
final_hybrid_docs = fused_results[:3]
for i, doc in enumerate(final_hybrid_docs):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Hybrid Rank {i+1} (Source: {source_file}): '{doc.page_content[:100]}...'")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
- We simulate keyword search with
mock_keyword_search. In a real system, you’d use a dedicated text search library or database. reciprocal_rank_fusioncombines the ranked lists from both search methods, giving higher scores to documents that appear high in multiple lists.
Benefits of Hybrid Search:
- Robustness: Captures both semantic meaning and exact term matches.
- Improved Recall for Specifics: Especially useful for queries involving names, codes, or precise terminology.
3.1.3 Contextual Compression and Parent Document Retriever
Sometimes a small, highly relevant chunk is retrieved, but it lacks sufficient surrounding context for the LLM to generate a comprehensive answer. Conversely, large chunks might dilute the relevance.
Core Concept: Parent Document Retriever This strategy involves:
- Chunking into smaller “child” chunks for the purpose of embedding and retrieval.
- Maintaining larger “parent” documents (or larger chunks encompassing multiple child chunks).
- When a query matches a “child” chunk, the system retrieves the entire “parent” document (or a larger, context-rich chunk) that the child belongs to. This provides richer context to the LLM.
Practical Example: Parent Document Retriever (Conceptual with LangChain’s helper)
LangChain provides a ParentDocumentRetriever to simplify this.
pip install faiss-cpu # A fast vector store for demonstration
Mini-Project 3.1.3.1: Parent Document Retriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import os
import shutil
# --- Setup: Define paths and cleanup ---
DOCS_DIR = "parent_docs"
PERSIST_DIR_CHILD = "./parent_chroma_child_db"
PERSIST_DIR_PARENT = "./parent_faiss_parent_db" # Using FAISS for parent docs for variety
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR_CHILD): shutil.rmtree(PERSIST_DIR_CHILD, ignore_errors=True)
if os.path.exists(PERSIST_DIR_PARENT): shutil.rmtree(PERSIST_DIR_PARENT, ignore_errors=True)
# Create a single long document that benefits from parent retrieval
long_doc_content = """
Introduction to Advanced Materials:
Advanced materials are at the forefront of technological innovation, enabling breakthroughs across various industries.
These materials often possess superior properties compared to traditional ones, such as enhanced strength,
lightweight characteristics, and improved thermal or electrical conductivity.
Section 1: Nanomaterials
Nanomaterials are materials with at least one dimension in the nanoscale (1-100 nanometers).
Their unique properties, like high surface area-to-volume ratio, lead to novel applications.
Examples include carbon nanotubes for electronics and silver nanoparticles for antibacterial coatings.
Their quantum mechanical properties become significant at this scale.
Section 2: Smart Materials
Smart materials, also known as intelligent or responsive materials, react to external stimuli.
This reaction can be a change in shape, size, color, or electrical properties.
Shape memory alloys (SMAs) and piezoelectric materials are prime examples.
SMAs are used in aerospace and biomedical devices, regaining their original shape upon heating.
Section 3: Biocompatible Materials
These materials are designed to interact safely with biological systems.
They are critical in medical implants, prosthetics, and drug delivery systems.
Polymers like silicone and metals like titanium are common biocompatible materials.
The body's immune response to these materials is a key consideration.
Conclusion:
The development of advanced materials continues to push the boundaries of what's possible,
offering solutions to complex challenges in engineering, medicine, and environmental science.
"""
with open(os.path.join(DOCS_DIR, "advanced_materials.txt"), "w") as f:
f.write(long_doc_content)
print(f"Created dummy document in '{DOCS_DIR}'")
# Load the document
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} raw documents.")
# 1. Define parent and child text splitters
# Child splitter for storing in vector database (small chunks for precise retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
# Parent splitter for what to pass to the LLM (larger, contextual chunks)
# Or, if you want the full document, you'd skip splitting the parent
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
# 2. Set up the vectorstore for child documents and a document store for parent documents
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Vectorstore for the smaller (child) chunks, used for retrieval
vectorstore = Chroma(
collection_name="parent_document_retrieval_child_chunks",
embedding_function=embedding_model,
persist_directory=PERSIST_DIR_CHILD
)
# Document store for the larger (parent) documents, used for fetching full context
# InMemoryStore is good for small-scale, but for persistence, use a key-value store or another DB
store = InMemoryStore()
# 3. Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter, # Optional: if you want parents to be larger chunks, not full docs
search_kwargs={"k": 2} # How many child chunks to retrieve initially
)
# 4. Add documents to the retriever (this automatically handles splitting and storing)
print("\n--- Adding documents to ParentDocumentRetriever ---")
retriever.add_documents(documents)
print("Documents added to retriever.")
# 5. Perform a query and observe retrieved parent documents
query = "Tell me about materials that change shape when heated."
print(f"\n--- Performing query: '{query}' ---")
# The retriever's get_relevant_documents method will use child chunks for search,
# but return the corresponding parent chunks.
retrieved_parent_docs: list[Document] = retriever.get_relevant_documents(query)
print(f"\nRetrieved {len(retrieved_parent_docs)} parent documents:")
for i, doc in enumerate(retrieved_parent_docs):
source_file = os.path.basename(doc.metadata.get('source', 'N/A'))
print(f"Parent Document {i+1} (Source: {source_file}, Length: {len(doc.page_content)}):")
print(f"'{doc.page_content}'")
print("---")
# Let's inspect the underlying child chunks to confirm the process
# For this, we'd need to manually query the vectorstore.
print("\n--- Inspecting raw child chunk retrieval (for verification) ---")
raw_child_chunks = vectorstore.similarity_search(query, k=2)
for i, chunk in enumerate(raw_child_chunks):
source_file = os.path.basename(chunk.metadata.get('source', 'N/A'))
print(f"Raw Child Chunk {i+1} (Source: {source_file}, Length: {len(chunk.page_content)}):")
print(f"'{chunk.page_content}'")
print("---")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}', '{PERSIST_DIR_CHILD}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR_CHILD, ignore_errors=True)
# InMemoryStore doesn't need explicit cleanup, but if using FAISS for parent docs, you'd clean that too.
print("Cleanup complete.")
Explanation:
- We define two
TextSplitters:child_splitterfor small chunks in the vector store andparent_splitterfor larger chunks that get sent to the LLM. vectorstore: Stores the embeddings of the small child chunks.docstore: A key-value store that maps chunk IDs to the larger parent documents.ParentDocumentRetriever: Coordinates the process. It usesvectorstorefor search anddocstoreto retrieve the full parent content once a child is found.
Benefits of Parent Document Retriever:
- Optimal Context: Ensures the LLM receives enough context even if the most relevant keyword is in a small part of a larger document.
- Reduced Noise: Still uses small, precise chunks for retrieval, reducing the chance of bringing in irrelevant large documents.
Exercise 3.1.3.1: Full Document Parents
Modify the ParentDocumentRetriever example so that instead of parent_splitter, it always retrieves the full original document when any of its child chunks are found. (Hint: you might need to adjust how add_documents is called or set parent_splitter=None if the retriever supports it, and ensure original documents are stored in docstore).
3.2 Advanced Chunking Methodologies
Beyond simple character or token splitting, intelligent chunking can significantly impact retrieval quality.
3.2.1 Semantic Chunking
Instead of splitting by arbitrary character counts or separators, semantic chunking aims to split documents at semantically meaningful boundaries. This ensures that each chunk represents a coherent topic or idea.
Core Concept: Embedding-based Chunking This often involves:
- Breaking a document into very small, overlapping sentences or paragraphs.
- Generating embeddings for these small segments.
- Calculating the similarity between adjacent segment embeddings.
- Identifying “dips” in similarity (where the topic changes) as chunk boundaries.
Practical Example: Conceptual Semantic Chunking
This is often more involved to implement from scratch. Here’s a conceptual outline and a hint towards libraries that offer it.
Mini-Project 3.2.1.1: Conceptual Semantic Chunking (with text_splitter hint)
LangChain’s SemanticChunker (requires torch, transformers) and other libraries are emerging to handle this. For this exercise, we’ll outline the logic and use standard tools to approximate the concept.
# Semantic chunking often involves more advanced processing,
# such as sentence embedding and boundary detection based on similarity scores.
# Libraries like 'langchain-experimental' or specialized NLP tools might offer this.
# For simplicity, and to keep within core LangChain for now, we'll demonstrate the concept
# through a basic recursive splitter, but emphasize the *goal* of semantic coherence.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
text_for_semantic_chunking = """
Chapter 1: The Dawn of AI
Artificial intelligence, a field that has captivated scientists for decades, began with early symbolic systems.
These systems attempted to encode human knowledge into rules that computers could follow.
Famous early examples include ELIZA, a chatbot, and the General Problem Solver.
This era, roughly from the 1950s to 1980s, laid theoretical groundwork.
Chapter 2: The Rise of Machine Learning
The 1990s and early 2000s saw a shift towards machine learning.
Instead of explicit rules, systems learned from data.
Support Vector Machines and decision trees gained prominence.
The availability of larger datasets and increased computational power fueled this paradigm.
Chapter 3: Deep Learning and Beyond
The 2010s marked the explosion of deep learning. Neural networks, particularly convolutional and recurrent ones,
achieved state-of-the-art results in image recognition and natural language processing.
Today, transformer architectures power large language models like GPT and BERT.
This continuous evolution points towards ever more sophisticated intelligent systems.
"""
doc_to_split = Document(page_content=text_for_semantic_chunking)
print("--- Aiming for Semantic Coherence with RecursiveCharacterTextSplitter ---")
# While not strictly "semantic" in the embedding-based sense,
# a well-configured RecursiveCharacterTextSplitter with appropriate separators
# *aims* to keep semantically related parts together by prioritizing larger structural breaks.
semantic_aware_splitter = RecursiveCharacterTextSplitter(
chunk_size=250,
chunk_overlap=30,
separators=["\n\n", "\n", ". ", "; ", ", ", " "] # Prioritize larger breaks first
)
sem_chunks = semantic_aware_splitter.split_documents([doc_to_split])
for i, chunk in enumerate(sem_chunks):
print(f"Chunk {i+1} (len: {len(chunk.page_content)}):")
print(f"'{chunk.page_content}'\n---")
print("\n**Note:** True semantic chunking often involves embedding adjacent sentences/paragraphs and looking for large drops in similarity. The above uses a rule-based approach to *try* to create semantically coherent chunks by splitting at natural paragraph/sentence breaks first.")
# For actual semantic chunking based on embeddings, you'd typically look into
# `langchain_experimental.text_splitter.SemanticChunker` or implement the logic yourself:
# 1. Split text into sentences/small paragraphs.
# 2. Embed each small unit.
# 3. Calculate cosine similarity between adjacent embeddings.
# 4. Identify where similarity drops significantly (a "valley") as potential chunk boundaries.
# 5. Combine units between valleys into final chunks.
Benefits of Semantic Chunking:
- Improved Retrieval Accuracy: Chunks are more likely to contain complete ideas, leading to better matches with user queries.
- Reduced LLM Confusion: LLMs receive more coherent context, making it easier for them to synthesize information.
3.2.2 Using Metadata for Chunking and Filtering
Metadata attached to documents and chunks can be leveraged for more intelligent chunking and highly precise retrieval.
Core Concept: Metadata-driven Chunking
Instead of just text content, chunks can incorporate relevant metadata (e.g., section title, author, date, document type) directly into their page_content before embedding, or use metadata for filtering in the vector database.
Practical Example: Metadata-Aware Chunking and Filtering
Mini-Project 3.2.2.1: Metadata-Enhanced RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
import shutil
# --- Setup: Define paths and cleanup ---
DOCS_DIR = "metadata_docs"
PERSIST_DIR = "./metadata_chroma_db"
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
# Create documents with specific metadata to demonstrate filtering
doc1_content = """
Article: "The Future of Renewable Energy"
Authored by Dr. Elena Petrova on 2025-03-10.
This article discusses advancements in fusion power and grid-scale battery storage.
Fusion energy promises limitless clean power, while batteries are key for grid stability.
"""
doc2_content = """
Press Release: "Innovate Solutions Q2 2025 Earnings"
Released by John Smith on 2025-07-25.
Innovate Solutions reported a 10% increase in profits, largely due to our AI division.
Our new product, 'QuantumFlow', contributed significantly.
"""
doc3_content = """
Whitepaper: "Understanding Quantum Machine Learning"
Published by Dr. Elena Petrova on 2024-11-15.
This whitepaper explores the theoretical underpinnings and practical applications of QML.
It focuses on quantum algorithms for classification and optimization.
"""
# Create Document objects with rich metadata
doc1 = Document(page_content=doc1_content, metadata={"source": "report", "author": "Dr. Elena Petrova", "date": "2025-03-10", "keywords": ["renewable energy", "fusion", "batteries"]})
doc2 = Document(page_content=doc2_content, metadata={"source": "press_release", "author": "John Smith", "date": "2025-07-25", "company": "Innovate Solutions"})
doc3 = Document(page_content=doc3_content, metadata={"source": "whitepaper", "author": "Dr. Elena Petrova", "date": "2024-11-15", "keywords": ["quantum computing", "machine learning"]})
documents = [doc1, doc2, doc3]
print(f"Created {len(documents)} documents with metadata.")
# 1. Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks.")
# 2. Embeddings and Vector Store
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents and metadata.")
# 3. Perform queries with metadata filtering
print("\n--- Performing queries with metadata filters ---")
# Query 1: Find information about "fusion power" authored by "Dr. Elena Petrova"
query_1 = "What are the latest findings on fusion power?"
print(f"\nQuery 1: '{query_1}' with filter: author='Dr. Elena Petrova'")
retrieved_1 = reloaded_vector_db.similarity_search(
query=query_1,
k=2,
filter={"author": "Dr. Elena Petrova"} # Apply metadata filter
)
for i, doc in enumerate(retrieved_1):
print(f"Doc {i+1} (Author: {doc.metadata.get('author')}, Source: {doc.metadata.get('source')}): '{doc.page_content[:100]}...'")
# Query 2: Find information about "AI products" released after 2025-01-01 (using $gt for "greater than")
# Note: ChromaDB supports various operators like $eq, $ne, $gt, $gte, $lt, $lte
query_2 = "Tell me about new AI products."
print(f"\nQuery 2: '{query_2}' with filter: date > '2025-01-01'")
retrieved_2 = reloaded_vector_db.similarity_search(
query=query_2,
k=2,
filter={"date": {"$gt": "2025-01-01"}} # Filter by date greater than
)
for i, doc in enumerate(retrieved_2):
print(f"Doc {i+1} (Date: {doc.metadata.get('date')}, Source: {doc.metadata.get('source')}): '{doc.page_content[:100]}...'")
# Query 3: Find any document with "quantum computing" keyword (metadata contains item)
query_3 = "Any details on quantum algorithms?"
print(f"\nQuery 3: '{query_3}' with filter: keywords contains 'quantum computing'")
# ChromaDB filter for list containment can be complex for direct equality.
# For 'keywords' array, sometimes you might need to flatten or adjust schema.
# For simple key-value where value is a string or single item, direct comparison works.
# For this demo, let's simplify a bit, assuming 'keywords' is a single string for simpler filtering.
# If 'keywords' were a list, you might need to query the vector store directly with `where` clause if Chroma supports.
# For illustrative purpose, we assume `keywords` are embedded in content for retrieval if direct filter on list is not trivial.
# Let's adjust doc3's metadata slightly for easier direct filtering if needed or rely on content search.
# For exact list matching: filter={"keywords": ["quantum computing", "machine learning"]}
# To search for *presence* of an item in a list field requires more advanced ChromaDB query syntax,
# often needing to query the collection directly. Let's make `keywords` a string for this demo.
doc3_mod = Document(page_content=doc3_content, metadata={"source": "whitepaper", "author": "Dr. Elena Petrova", "date": "2024-11-15", "topic": "quantum computing"})
documents_mod = [doc1, doc2, doc3_mod]
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
vector_db_mod = Chroma.from_documents(
documents=text_splitter.split_documents(documents_mod),
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db_mod.persist()
reloaded_vector_db_mod = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print(f"\nQuery 3 (Revised): '{query_3}' with filter: topic='quantum computing'")
retrieved_3 = reloaded_vector_db_mod.similarity_search(
query=query_3,
k=2,
filter={"topic": "quantum computing"}
)
for i, doc in enumerate(retrieved_3):
print(f"Doc {i+1} (Topic: {doc.metadata.get('topic')}, Source: {doc.metadata.get('source')}): '{doc.page_content[:100]}...'")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
Documentobjects are created with a richmetadatadictionary.- When performing
similarity_search, thefilterargument allows you to specify conditions on the metadata. ChromaDB supports various operators for filtering (e.g.,$eq,$gt,$lt,$in,$ne).
Benefits of Metadata Filtering:
- Precision: Drastically reduces the search space, ensuring only truly relevant documents (based on structured criteria) are considered.
- Structured Search: Combines the flexibility of semantic search with the rigidity of structured data queries.
- Facet Search: Enables users to narrow down results by categories, dates, authors, etc.
Exercise 3.2.2.1: Advanced Metadata Filtering
Add another document to the metadata_docs with category: "legal" and region: "EU". Then, write a query that retrieves documents related to “regulations” filtered by category="legal" AND region="EU". (You might need to combine filters or adjust how ChromaDB handles multiple conditions.)
3.3 Fine-tuning for RAG
While using off-the-shelf embedding models and LLMs is a great starting point, fine-tuning can significantly boost performance for specific domains or tasks.
3.3.1 Fine-tuning Embedding Models
If your knowledge base contains highly specialized jargon or domain-specific language, a general-purpose embedding model might not capture the semantic nuances effectively. Fine-tuning an embedding model on your own data can improve the relevance of retrieval.
Core Concept: Contrastive Learning (e.g., with Triplet Loss) Fine-tuning embedding models often involves contrastive learning, where the model is trained to push embeddings of similar texts closer together and embeddings of dissimilar texts farther apart. This typically requires pairs of similar sentences or triplets of (anchor, positive, negative) sentences.
Practical Example: Conceptual Fine-tuning of Embeddings
Full fine-tuning requires a significant dataset and computational resources. This example outlines the concept and points to resources.
Mini-Project 3.3.1.1: Conceptual Embedding Model Fine-tuning
# Conceptual Outline for Fine-tuning an Embedding Model
# This is not a runnable code snippet due to the complexity of data preparation
# and training infrastructure required, but it outlines the steps.
print("--- Conceptual Fine-tuning of Embedding Models ---")
# Step 1: Data Preparation
print("1. Data Preparation: Create a dataset of (query, positive_document, negative_document) triplets.")
print(" - Positive document: A document that is highly relevant to the query.")
print(" - Negative document: A document that is not relevant to the query.")
print(" Example triplets for a medical RAG system:")
print(" Query: 'Symptoms of Type 2 Diabetes'")
print(" Positive: 'Common symptoms include increased thirst, frequent urination, and blurred vision.'")
print(" Negative: 'Treatment for Type 1 Diabetes involves insulin injections.'")
print(" This data often needs to be manually labeled or synthetically generated.")
# Step 2: Choose a base Sentence-Transformer model
print("\n2. Choose a Base Model: Start with a pre-trained Sentence-Transformer model (e.g., 'all-MiniLM-L6-v2').")
print(" `from sentence_transformers import SentenceTransformer`")
# model = SentenceTransformer('all-MiniLM-L6-v2')
# Step 3: Define Loss Function (e.g., TripletLoss)
print("\n3. Define Loss Function: Use a contrastive loss, like TripletLoss, to push positives closer and negatives further.")
# from sentence_transformers import losses
# from torch.utils.data import DataLoader
# train_loss = losses.TripletLoss(model=model)
# Step 4: Create DataLoader for training
print("\n4. Create DataLoader: Prepare your triplets for batch training.")
# from sentence_transformers.readers import InputExample
# train_examples = [InputExample(texts=[query, pos, neg]) for query, pos, neg in your_triplets]
# train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Step 5: Train the model
print("\n5. Train the Model: Iterate over epochs to fine-tune the embeddings.")
# model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
# Step 6: Evaluate
print("\n6. Evaluate: Measure retrieval performance (e.g., Recall@K, Mean Reciprocal Rank) on a validation set.")
print(" The goal is to see if the fine-tuned embeddings better capture domain-specific relevance.")
print("\n**Note:** Fine-tuning embedding models requires a solid understanding of PyTorch/TensorFlow, data preparation, and GPU resources. It's an advanced optimization technique.")
Benefits of Fine-tuning Embeddings:
- Higher Relevance: Embeddings become more specialized and accurate for your specific domain, leading to better retrieval.
- Reduced Data Size: More precise embeddings mean you might need to retrieve fewer documents, saving on LLM context window space and costs.
3.3.2 Fine-tuning the LLM for RAG
While RAG primarily relies on the LLM’s in-context learning capabilities, fine-tuning the LLM itself can further improve its ability to leverage retrieved context, summarize, and answer questions in a desired style.
Core Concept: Instruction-Following Fine-tuning This involves training the LLM on a dataset of (instruction, context, desired response) triplets, where the instruction explicitly tells the LLM to use the context and generate a specific type of answer.
Practical Example: Conceptual LLM Fine-tuning for RAG
Similar to embedding fine-tuning, this is a complex process.
Mini-Project 3.3.2.1: Conceptual LLM Fine-tuning for RAG
# Conceptual Outline for Fine-tuning an LLM for RAG
# This is not a runnable code snippet due to the complexity and resource requirements.
print("--- Conceptual Fine-tuning of LLMs for RAG ---")
# Step 1: Data Preparation
print("1. Data Preparation: Create a dataset of (question, context, desired_answer) pairs.")
print(" - Question: A user query.")
print(" - Context: The retrieved documents/chunks that would be provided by your RAG system.")
print(" - Desired Answer: A high-quality, concise, and grounded answer derived *only* from the context.")
print(" Example for a customer support RAG LLM:")
print(" Prompt: 'Question: How do I reset my password? Context: To reset your password, visit the login page and click 'Forgot Password'. Follow the instructions to receive a reset link.'")
print(" Completion: 'To reset your password, go to the login page, click 'Forgot Password', and follow the instructions to get a reset link.'")
print(" This dataset should closely mimic the actual prompts your RAG system will send to the LLM.")
# Step 2: Choose a Base LLM
print("\n2. Choose a Base LLM: Select a base model (e.g., a smaller open-source LLM like Llama 2 7B, or use OpenAI/Google's fine-tuning APIs).")
print(" For open-source, this involves using libraries like HuggingFace `transformers` and `peft`.")
# Step 3: Fine-tuning Method (e.g., Full Fine-tuning, LoRA, QLoRA)
print("\n3. Fine-tuning Method: Depending on resources, choose full fine-tuning or parameter-efficient methods like LoRA/QLoRA.")
print(" - Full fine-tuning: Updates all model parameters (resource intensive).")
print(" - LoRA/QLoRA: Updates only a small number of adapter parameters, much more efficient.")
# Step 4: Training
print("\n4. Training: Train the LLM on your prepared dataset.")
print(" This typically involves setting up a training loop, defining optimizer, learning rate, etc.")
# Step 5: Evaluation
print("\n5. Evaluation: Evaluate the fine-tuned LLM on a held-out test set.")
print(" Metrics: ROUGE scores for summarization, factual correctness, helpfulness, adherence to context.")
print("\n**Note:** LLM fine-tuning is significantly more resource-intensive than embedding model fine-tuning. It often requires GPUs and cloud computing platforms. However, it can yield highly specialized and superior performance for your RAG application.")
Benefits of Fine-tuning LLMs for RAG:
- Improved Context Utilization: LLM learns to better identify and synthesize information from retrieved chunks.
- Custom Style and Tone: Tailors the LLM’s output to your desired brand voice, conciseness, or verbosity.
- Reduced Instruction Dependence: Can follow RAG instructions more reliably even with less explicit prompting.
- Potentially Smaller Models: A fine-tuned smaller LLM might perform as well as a larger, general-purpose LLM on your specific RAG task.
Part 4: Building Robust RAG Pipelines and Agentic Systems
This section moves beyond basic retrieval and generation to cover integrating RAG into more complex applications, focusing on best practices for development, deployment, and integration with agentic frameworks.
4.1 Orchestration with LangChain and LlamaIndex
While you can build RAG systems from scratch, frameworks like LangChain and LlamaIndex provide abstractions and tools to simplify the process, offering modular components for each stage of the RAG pipeline.
Core Concepts: Chains, Agents, and Tools
- Chains (LangChain): Sequential or complex combinations of LLM calls, retrievers, document transformers, etc., designed to accomplish a specific task.
- Agents (LangChain/LlamaIndex): LLMs that use a
Toolto decide what actions to take, observe the outcome, and repeat until the task is complete. RAG is a prime example of a tool for an agent. - Tools: Functions or APIs that an agent can use (e.g., a RAG retriever, a calculator, a web search tool).
Practical Example: Building a RAG Chain with LangChain
Mini-Project 4.1.1: LangChain RAG Chain
We’ll use LangChain to streamline the RAG process using create_stuff_documents_chain and create_retrieval_chain.
pip install langchain openai chromadb sentence-transformers
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI # Use specific OpenAI chat integration
from langchain_core.prompts import ChatPromptTemplate
# --- Setup: Document loading and indexing (reusing previous logic) ---
DOCS_DIR = "langchain_rag_docs"
PERSIST_DIR = "./langchain_rag_chroma_db"
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "report_2025.txt"), "w") as f:
f.write("""
Our 2025 annual report shows a 20% increase in renewable energy investments.
The new geothermal power project in Iceland is expected to start operations by Q3 2025.
Customer satisfaction scores reached an all-time high of 92%.
""")
with open(os.path.join(DOCS_DIR, "hr_updates.txt"), "w") as f:
f.write("""
HR Department announced new parental leave policies effective July 1, 2025.
Employees can now take up to 16 weeks of paid leave.
A new employee wellness program including free gym memberships will launch in Q4.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# --- LangChain RAG Integration ---
# 1. Initialize LLM (Ensure OPENAI_API_KEY is set in your environment)
try:
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
print("\nLangChain LLM (ChatOpenAI) initialized.")
except Exception as e:
print(f"Failed to initialize ChatOpenAI: {e}. Ensure OPENAI_API_KEY is set.")
llm = None
if llm:
# 2. Create a prompt template for combining retrieved documents
# The `stuff_documents_chain` expects a prompt with a `context` and `input` variable.
# The `context` will be filled by the retrieved documents, `input` by the user's question.
prompt = ChatPromptTemplate.from_template("""
Answer the user's question based on the provided context only.
If you cannot find the answer in the context, explicitly state that the information is not available.
Do not invent information.
Context:
{context}
Question: {input}
""")
# 3. Create a chain to combine documents and generate a response
# This chain takes documents and a user question, formats them into the prompt,
# and sends it to the LLM.
document_combiner_chain = create_stuff_documents_chain(llm, prompt)
# 4. Create a retriever from our vector database
retriever = reloaded_vector_db.as_retriever(search_kwargs={"k": 3})
# 5. Create the full RAG retrieval chain
# This chain first uses the retriever to get documents, then passes them to the document_combiner_chain.
retrieval_chain = create_retrieval_chain(retriever, document_combiner_chain)
# 6. Invoke the RAG chain
print("\n--- LangChain RAG Chain Ready! ---")
query1 = "What are the latest customer satisfaction scores?"
response1 = retrieval_chain.invoke({"input": query1})
print(f"\nUser Query: {query1}")
print(f"LangChain RAG Response: {response1['answer']}")
# print(f"Retrieved documents: {[doc.page_content for doc in response1['context']]}") # For debugging
query2 = "Tell me about the new HR policies."
response2 = retrieval_chain.invoke({"input": query2})
print(f"\nUser Query: {query2}")
print(f"LangChain RAG Response: {response2['answer']}")
query3 = "What is the capital of France?" # Outside the document context
response3 = retrieval_chain.invoke({"input": query3})
print(f"\nUser Query: {query3}")
print(f"LangChain RAG Response: {response3['answer']}")
else:
print("Skipping LangChain RAG demo due to LLM initialization failure.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
ChatOpenAI: LangChain’s wrapper for OpenAI’s chat models.ChatPromptTemplate: Used to define the structure of the RAG prompt with placeholders forcontextandinput.create_stuff_documents_chain: A utility to create a chain that “stuffs” (inserts) multiple documents into a single prompt for the LLM.reloaded_vector_db.as_retriever(): Converts our ChromaDB instance into a LangChainRetrieverobject.create_retrieval_chain: Combines theretrieveranddocument_combiner_chaininto a single, cohesive RAG pipeline.invoke({"input": query}): Runs the RAG chain with the user’s query. The outputresponse['answer']contains the LLM’s generated text, andresponse['context']contains the retrieved documents.
Benefits of Frameworks (LangChain/LlamaIndex):
- Modularity: Easy to swap components (different LLMs, vector stores, retrievers, text splitters).
- Abstraction: Simplifies complex interactions.
- Community and Ecosystem: Access to a vast array of integrations and pre-built components.
4.2 Building RAG-Enhanced Agentic Systems
RAG becomes even more powerful when integrated into agentic AI systems. An agent can intelligently decide when and how to use the RAG system as a tool to answer questions that require external knowledge.
Core Concept: Agents with Tools
An agent operates in a loop:
- Perceive: Receives a user input.
- Reason: Uses an LLM (the “brain”) to decide on the next action based on the input and available
Tools. - Act: Executes the chosen
Tool(e.g., call a RAG retriever, perform a web search, use a calculator). - Observe: Gets the result from the
Tool. - Loop: Continues reasoning and acting until the task is complete or it determines it cannot answer.
Practical Example: LangChain Agent with RAG Tool
Mini-Project 4.2.1: LangChain Agent with RAG Tool
We’ll build a simple agent that has two tools: our RAG retriever and a calculator. The agent will choose which tool to use based on the user’s query.
pip install langchain langchain_openai chromadb sentence-transformers langchain_community numexpr # For calculator
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate
# --- Setup: Document loading and indexing (reusing previous logic) ---
DOCS_DIR = "langchain_agent_docs"
PERSIST_DIR = "./langchain_agent_chroma_db"
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "company_benefits.txt"), "w") as f:
f.write("""
Our employee benefits include comprehensive health, dental, and vision insurance.
We also offer a generous 401(k) matching program, up to 5% of your salary.
Employees receive 20 days of paid time off (PTO) annually.
""")
with open(os.path.join(DOCS_DIR, "company_events.txt"), "w") as f:
f.write("""
The annual company picnic will be held on August 23, 2025, at City Park.
Our holiday party is scheduled for December 15, 2025.
We host quarterly hackathons, with the next one on September 10, 2025.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# --- LangChain Agent with RAG Tool ---
# 1. Initialize LLM for Agent (can be a different model than the RAG LLM)
try:
llm_agent = ChatOpenAI(model="gpt-4o", temperature=0) # gpt-4o or gpt-4 for better reasoning
print("\nLangChain Agent LLM (ChatOpenAI) initialized.")
except Exception as e:
print(f"Failed to initialize ChatOpenAI for agent: {e}. Ensure OPENAI_API_KEY is set.")
llm_agent = None
if llm_agent:
# 2. Define the RAG Tool
# The retriever needs a way to be exposed as a tool
def rag_query_tool(query: str) -> str:
"""
Searches the company knowledge base for information about company policies, benefits, and events.
Input should be a concise question or keywords.
"""
# Directly use our reloaded vector DB for similarity search
retrieved_docs = reloaded_vector_db.similarity_search(query, k=3)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
# Use an LLM to generate an answer from the retrieved context
# This is a sub-LLM call within the tool's action
llm_for_tool = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
prompt_for_tool = ChatPromptTemplate.from_template("""
Based ONLY on the following context, answer the user's question concisely.
If the information is not in the context, state that it's not available.
Context:
{context}
Question: {input}
""")
chain_for_tool = prompt_for_tool | llm_for_tool
response = chain_for_tool.invoke({"context": context, "input": query})
return response.content
rag_tool = Tool(
name="CompanyKnowledgeBase",
func=rag_query_tool,
description="Useful for answering questions about company policies, employee benefits, and upcoming events."
)
# 3. Define other tools (e.g., a simple calculator)
from langchain_community.tools.calculator.tool import Calculator
calculator_tool = Calculator()
# 4. List all available tools for the agent
tools = [rag_tool, calculator_tool]
# 5. Define the Agent's Prompt
# This prompt guides the LLM (agent) on how to use the tools and respond.
agent_prompt = PromptTemplate.from_template("""
You are a helpful and intelligent assistant.
You have access to the following tools:
{tools}
Use the tools to answer the user's question.
If a question requires information from the company knowledge base, use the 'CompanyKnowledgeBase' tool.
If a question requires calculation, use the 'Calculator' tool.
If you cannot answer using the tools, state that you cannot answer the question.
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
""")
# 6. Create the AgentExecutor
agent = create_react_agent(llm_agent, tools, agent_prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)
# 7. Interact with the Agent
print("\n--- LangChain RAG-Enabled Agent Activated! ---")
print("Ask questions about company info or perform calculations. Type 'exit' to quit.")
while True:
user_input = input("\nYou: ")
if user_input.lower() == 'exit':
break
try:
response = agent_executor.invoke({"input": user_input})
print(f"Agent Final Answer: {response['output']}")
except Exception as e:
print(f"Agent Error: {e}")
else:
print("Skipping LangChain Agent demo due to LLM initialization failure.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
rag_query_toolfunction: This custom Python function encapsulates the RAG logic (retrieval + LLM generation from context). It is specifically designed to be callable by the agent.Toolobject: LangChain’s way of wrapping functions or API calls so the agent can use them. Thedescriptionis crucial, as the agent’s LLM uses this to decide which tool is appropriate.Calculatortool: An example of another tool an agent might have.create_react_agent: Creates an agent that uses the ReAct (Reasoning and Acting) framework, where the LLM thinks (Thought), takes anAction, observes theObservation, and repeats.AgentExecutor: Runs the agent, managing theThought/Action/Observationloop.verbose=Trueis very helpful for debugging agent behavior.
Benefits of RAG-Enhanced Agents:
- Intelligent Tool Use: Agents can dynamically choose the best tool (including RAG) for a given query, improving efficiency and accuracy.
- Complex Workflows: Can break down complex queries into sub-tasks, using different tools as needed.
- Flexibility: Easily extendable with new tools (e.g., API callers, code interpreters, web search).
4.2.1 Exercise: Adding a Web Search Tool to the Agent
Goal: Enhance our LangChain agent by giving it the ability to perform web searches for information outside its internal RAG knowledge base.
Instructions:
- Install necessary libraries for web search (e.g.,
duckduckgo_search). - Define a new
Toolfor web searching usingDuckDuckGoSearchRun. - Add this new tool to the agent’s list of
tools. - Modify the agent’s
agent_promptto instruct it on when to use theWeb Searchtool (e.g., for general knowledge or real-time information not found in the company knowledge base). - Test the agent with questions that require web search and questions that require RAG.
pip install langchain langchain_openai chromadb sentence-transformers numexpr duckduckgo-search
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate
from langchain_community.tools import DuckDuckGoSearchRun # For web search
# --- Setup: Document loading and indexing (reusing previous logic) ---
DOCS_DIR = "langchain_agent_docs_web"
PERSIST_DIR = "./langchain_agent_chroma_db_web"
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "company_benefits.txt"), "w") as f:
f.write("""
Our employee benefits include comprehensive health, dental, and vision insurance.
We also offer a generous 401(k) matching program, up to 5% of your salary.
Employees receive 20 days of paid time off (PTO) annually.
""")
with open(os.path.join(DOCS_DIR, "company_events.txt"), "w") as f:
f.write("""
The annual company picnic will be held on August 23, 2025, at City Park.
Our holiday party is scheduled for December 15, 2025.
We host quarterly hackathons, with the next one on September 10, 2025.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# --- LangChain Agent with RAG and Web Search Tools ---
# 1. Initialize LLM for Agent
try:
# Using gpt-4o for better reasoning and tool-use capabilities
llm_agent = ChatOpenAI(model="gpt-4o", temperature=0)
print("\nLangChain Agent LLM (ChatOpenAI) initialized.")
except Exception as e:
print(f"Failed to initialize ChatOpenAI for agent: {e}. Ensure OPENAI_API_KEY is set.")
llm_agent = None
if llm_agent:
# 2. Define the RAG Tool (same as before)
def rag_query_tool(query: str) -> str:
"""
Searches the company knowledge base for information about company policies, benefits, and events.
Input should be a concise question or keywords.
"""
retrieved_docs = reloaded_vector_db.similarity_search(query, k=3)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
llm_for_tool = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
prompt_for_tool = ChatPromptTemplate.from_template("""
Based ONLY on the following context, answer the user's question concisely.
If the information is not in the context, state that it's not available.
Context:
{context}
Question: {input}
""")
chain_for_tool = prompt_for_tool | llm_for_tool
response = chain_for_tool.invoke({"context": context, "input": query})
return response.content
rag_tool = Tool(
name="CompanyKnowledgeBase",
func=rag_query_tool,
description="Useful for answering questions about company policies, employee benefits, and upcoming events."
)
# 3. Define the Web Search Tool
# DuckDuckGoSearchRun is a simple web search tool from LangChain Community
web_search_tool = DuckDuckGoSearchRun(name="Web_Search", description="Useful for general knowledge questions or real-time information, such as current events, general facts, or things outside the company knowledge base.")
# 4. Define other tools (e.g., a simple calculator)
from langchain_community.tools.calculator.tool import Calculator
calculator_tool = Calculator()
# 5. List all available tools for the agent
tools = [rag_tool, calculator_tool, web_search_tool] # Add the new web search tool
# 6. Define the Agent's Prompt (modified to include the new tool)
agent_prompt = PromptTemplate.from_template("""
You are a helpful and intelligent assistant.
You have access to the following tools:
{tools}
Use the tools to answer the user's question.
- If a question requires information from the company knowledge base, use the 'CompanyKnowledgeBase' tool.
- If a question requires calculation, use the 'Calculator' tool.
- If a question requires general or real-time information not present in the company knowledge base, use the 'Web_Search' tool.
If you cannot answer using the tools, state that you cannot answer the question.
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
""")
# 7. Create the AgentExecutor
agent = create_react_agent(llm_agent, tools, agent_prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)
# 8. Interact with the Agent
print("\n--- LangChain RAG-Enabled Agent with Web Search Activated! ---")
print("Ask questions about company info, perform calculations, or ask general knowledge questions. Type 'exit' to quit.")
while True:
user_input = input("\nYou: ")
if user_input.lower() == 'exit':
break
try:
response = agent_executor.invoke({"input": user_input})
print(f"\nAgent Final Answer: {response['output']}")
except Exception as e:
print(f"Agent Error: {e}")
else:
print("Skipping LangChain Agent demo due to LLM initialization failure.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Testing the Enhanced Agent:
- Company-specific query: “How many days of PTO do employees get?” (Should use
CompanyKnowledgeBase) - Calculation query: “What is 123 multiplied by 456?” (Should use
Calculator) - General knowledge query: “What is the capital of Australia?” (Should use
Web_Search) - Real-time query: “What is the current date?” (Should use
Web_Search)
This exercise demonstrates the power of combining RAG with other tools in an agentic system, allowing the LLM to intelligently adapt its approach based on the nature of the query.
Part 5: Optimizing RAG for Performance, Relevance, and Scalability
Once your RAG system is functional, the next challenge is to optimize it for real-world scenarios. This involves considerations for speed, accuracy, and handling large volumes of data and users.
5.1 Evaluating RAG System Performance
Measuring the effectiveness of your RAG pipeline is crucial for iterative improvement. Evaluation metrics can focus on both the retrieval phase and the generation phase.
Core Concepts: Retrieval and Generation Metrics
Retrieval Metrics:
- Recall@k: The proportion of relevant documents that are found among the top
kretrieved documents. - Precision@k: The proportion of retrieved documents in the top
kthat are relevant. - MRR (Mean Reciprocal Rank): Measures how high the first relevant document is ranked.
- NDCG (Normalized Discounted Cumulative Gain): Accounts for the position of relevant documents and their relevance scores.
Generation Metrics (RAG-specific):
- Faithfulness/Grounding: How well the generated answer is supported by the retrieved context. (Crucial for RAG to fight hallucinations).
- Relevance: How relevant the generated answer is to the user’s query.
- Answer Correctness/Accuracy: Is the answer factually correct (requires human or gold-standard evaluation)?
- Conciseness, Fluency, Coherence: General LLM quality metrics.
Practical Example: Manual Evaluation and Tools
Automated RAG evaluation is a developing field. Often, a combination of automated metrics (for retrieval) and human evaluation (for generation quality) is used. Libraries like RAGAS and LlamaIndex have built-in evaluation capabilities.
Mini-Project 5.1.1: Conceptual RAG Evaluation with Mock Data
We’ll simulate a small dataset and demonstrate how metrics would be calculated.
pip install -q datasets # For example data loading later
import random
from typing import List, Dict, Any
# Mock data for demonstration
# In a real scenario, this would come from a test set with human labels.
eval_dataset = [
{
"query": "Who founded InnovateCorp?",
"ground_truth_answer": "InnovateCorp was founded by Dr. Anya Sharma.",
"relevant_doc_ids": ["doc_innovatecorp_founder", "doc_company_history_p1"] # IDs of docs that contain the answer
},
{
"query": "What is EcoBuild AI?",
"ground_truth_answer": "EcoBuild AI is InnovateCorp's flagship product that helps cities optimize energy consumption and waste management through predictive analytics.",
"relevant_doc_ids": ["doc_ecobuild_product_desc", "doc_company_info_q1"]
},
{
"query": "When was the last hackathon?",
"ground_truth_answer": "The next hackathon is on September 10, 2025. Information on the 'last' one is not specified.",
"relevant_doc_ids": ["doc_company_events"]
}
]
# Simulate retrieved documents (e.g., from your ChromaDB)
# For simplicity, we'll just use string content for "retrieved_doc_content"
mock_retrieved_data_for_eval = {
"query_innovatecorp": [
{"id": "doc_innovatecorp_founder", "content": "InnovateCorp was founded in 2010 by Dr. Anya Sharma."},
{"id": "doc_company_history_p1", "content": "Dr. Anya Sharma's vision led to InnovateCorp's inception."},
{"id": "doc_product_vision", "content": "Our product vision centers on sustainability."}, # Less relevant
{"id": "doc_company_events", "content": "Annual company picnic on Aug 23, 2025."}, # Irrelevant
],
"query_ecobuild": [
{"id": "doc_ecobuild_product_desc", "content": "EcoBuild AI helps cities optimize energy consumption and waste management through predictive analytics."},
{"id": "doc_company_info_q1", "content": "We recently launched our flagship product, EcoBuild AI, in Q1 2025."},
{"id": "doc_ai_research", "content": "Our AI research focuses on deep learning models."}, # Partially relevant
],
"query_hackathon": [
{"id": "doc_company_events", "content": "We host quarterly hackathons, with the next one on September 10, 2025."},
{"id": "doc_hr_policy", "content": "New remote work policy effective Oct 1, 2025."}, # Irrelevant
]
}
# Simulate LLM generated answers
mock_llm_answers_for_eval = {
"query_innovatecorp": "InnovateCorp was founded by Dr. Anya Sharma.",
"query_ecobuild": "EcoBuild AI is InnovateCorp's flagship product, launched in Q1 2025, which optimizes energy consumption and waste management for cities using predictive analytics.",
"query_hackathon": "The next hackathon is scheduled for September 10, 2025. Information on previous hackathons is not available in the provided context."
}
print("--- Conceptual RAG Evaluation ---")
# --- Retrieval Metrics ---
def calculate_recall_at_k(retrieved_docs: List[Dict], relevant_doc_ids: List[str], k: int) -> float:
retrieved_ids = {doc["id"] for doc in retrieved_docs[:k]}
hits = len(retrieved_ids.intersection(set(relevant_doc_ids)))
return min(1.0, hits / len(relevant_doc_ids)) if relevant_doc_ids else 0.0 # Handle case where no relevant docs are specified
def calculate_precision_at_k(retrieved_docs: List[Dict], relevant_doc_ids: List[str], k: int) -> float:
retrieved_ids = {doc["id"] for doc in retrieved_docs[:k]}
hits = len(retrieved_ids.intersection(set(relevant_doc_ids)))
return hits / k if k > 0 else 0.0
print("\n**Retrieval Evaluation (Manual Simulation)**")
for i, eval_item in enumerate(eval_dataset):
query_key = f"query_{eval_item['query'].split()[2].lower()}" # Simple key generation
retrieved = mock_retrieved_data_for_eval.get(query_key, [])
relevant_ids = eval_item["relevant_doc_ids"]
recall_at_3 = calculate_recall_at_k(retrieved, relevant_ids, k=3)
precision_at_3 = calculate_precision_at_k(retrieved, relevant_ids, k=3)
print(f"\nQuery: '{eval_item['query']}'")
print(f" Relevant IDs: {relevant_ids}")
print(f" Retrieved IDs (top 3): {[doc['id'] for doc in retrieved[:3]]}")
print(f" Recall@3: {recall_at_3:.2f}")
print(f" Precision@3: {precision_at_3:.2f}")
# --- Generation Metrics (Conceptual / Human Evaluation) ---
print("\n**Generation Evaluation (Conceptual)**")
print("For generation metrics like Faithfulness, Relevance, and Correctness, human evaluation is often the gold standard.")
print("Automated tools like RAGAS attempt to proxy these with LLMs.")
for i, eval_item in enumerate(eval_dataset):
query_key = f"query_{eval_item['query'].split()[2].lower()}"
generated_answer = mock_llm_answers_for_eval.get(query_key, "N/A")
ground_truth = eval_item["ground_truth_answer"]
retrieved_context_for_answer = "\n".join([doc["content"] for doc in mock_retrieved_data_for_eval.get(query_key, [])])
print(f"\nQuery: '{eval_item['query']}'")
print(f" Ground Truth: '{ground_truth}'")
print(f" Generated Answer: '{generated_answer}'")
print(f" Retrieved Context Used: '{retrieved_context_for_answer[:150]}...'")
print(" -> Human Evaluation needed for: Faithfulness, Relevance, Correctness, Conciseness.")
# Example: A human would rate:
# Faithfulness: Yes/No (Is answer solely based on context?)
# Relevance: High/Medium/Low
# Correctness: Correct/Partially Correct/Incorrect
print("\nFor more sophisticated automated RAG evaluation, consider tools like `RAGAS` or `LlamaIndex` built-in evaluation modules.")
Explanation:
- Evaluation Dataset: A set of
(query, ground_truth_answer, relevant_doc_ids)tuples. Creating this dataset is often the most labor-intensive part of RAG evaluation. - Mock Retrieval/Generation: We simulate the output of our RAG system. In reality, you would run your RAG pipeline on the
eval_dataset. - Metric Functions: Simple implementations of Recall@k and Precision@k.
- Human-in-the-Loop: Emphasized for judging generation quality, as LLMs struggle to accurately self-assess hallucination or deep factual correctness.
Tools for Automated RAG Evaluation:
RAGAS: A framework designed specifically for RAG evaluation. It uses an LLM to judge faithfulness, answer relevance, context relevance, and context recall, reducing the need for extensive human labeling.- LlamaIndex Evaluation: Provides modules for generating evaluation datasets and running standard retrieval and generation metrics.
Exercise 5.1.1: Explore RAGAS
Research RAGAS and set up a basic evaluation pipeline for a simple RAG system (you can use your Mini-Project 2 RAG chatbot). Generate a small synthetic dataset or use a provided example from RAGAS documentation, and run its core metrics (faithfulness, answer relevance, context relevance, context recall). This will involve installing ragas and potentially setting up an LLM for evaluation.
5.2 Optimizing Latency and Throughput
A RAG system needs to be fast and handle many requests concurrently, especially for real-time applications.
Core Concepts: Caching, Batching, Asynchronous Processing, Hardware
- Caching: Store results of expensive operations (e.g., embedding lookups, common LLM responses) to avoid recomputing.
- Batching: Process multiple queries or embedding requests together to leverage parallel processing capabilities (especially for GPUs).
- Asynchronous Processing: Use
asyncioin Python to handle multiple requests concurrently without blocking the main thread. - Hardware Acceleration: Utilize GPUs for embedding generation and vector database operations where possible.
- Distributed Systems: For extreme scale, distribute your vector database and LLM inference across multiple servers.
Practical Example: Batching Embeddings
Mini-Project 5.2.1: Batching Embedding Generation
This shows a simple way to batch, which can be extended to parallel processing.
import time
from langchain_community.embeddings import HuggingFaceEmbeddings
# Create a list of text snippets to embed
texts_to_embed = [
f"This is document number {i}. It contains some random text to simulate real data for embedding."
for i in range(100)
]
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
print("--- Demonstrating Embedding Batching ---")
# Non-batched approach
start_time = time.time()
single_embeddings = [embedding_model.embed_query(text) for text in texts_to_embed]
end_time = time.time()
print(f"Non-batched embedding for {len(texts_to_embed)} texts took: {end_time - start_time:.4f} seconds")
# Batched approach
# HuggingFaceEmbeddings' embed_documents method is inherently batched/optimized
start_time = time.time()
batched_embeddings = embedding_model.embed_documents(texts_to_embed)
end_time = time.time()
print(f"Batched embedding for {len(texts_to_embed)} texts took: {end_time - start_time:.4f} seconds")
# Verify output dimensions
print(f"Dimension of single embedding: {len(single_embeddings[0])}")
print(f"Dimension of batched embedding: {len(batched_embeddings[0])}")
print(f"Number of batched embeddings: {len(batched_embeddings)}")
# Note: The actual speedup depends heavily on the model, hardware, and underlying implementation
# of the embedding provider. For local models, larger batches usually mean better GPU utilization.
Explanation:
- The
embed_documentsmethod ofHuggingFaceEmbeddings(and similarly for OpenAI/Google APIs) is designed to process lists of texts efficiently, often leveraging internal batching. - For very large datasets, you would manage your own batches and potentially parallelize their submission.
Exercise 5.2.1: Asynchronous RAG Calls (Conceptual)
Outline how you would modify a RAG system to handle multiple incoming user queries asynchronously using Python’s asyncio. Focus on making the retrieval and LLM generation steps non-blocking. (No runnable code expected, just a conceptual design).
# Conceptual Outline: Asynchronous RAG Calls
import asyncio
from typing import List, Dict
# Assume these are your existing synchronous RAG components
# In a real async system, these would ideally be async functions themselves.
def sync_retrieve_documents(query: str) -> List[Dict]:
"""Simulates a synchronous document retrieval."""
print(f" [Sync] Retrieving for: {query}")
# Placeholder for actual vector DB call
asyncio.sleep(0.5) # Simulate IO bound operation
return [{"content": f"Context for {query}"}]
def sync_generate_response(query: str, context: List[Dict]) -> str:
"""Simulates a synchronous LLM generation."""
print(f" [Sync] Generating for: {query}")
# Placeholder for actual LLM API call
asyncio.sleep(1.0) # Simulate CPU bound operation
return f"Answer for '{query}' based on context '{context[0]['content']}'"
# --- Async Wrappers for Synchronous RAG Components ---
async def async_retrieve_documents(query: str) -> List[Dict]:
"""Asynchronous wrapper for retrieval."""
# Use run_in_executor to run synchronous code in a thread pool,
# preventing it from blocking the async event loop.
return await asyncio.to_thread(sync_retrieve_documents, query)
async def async_generate_response(query: str, context: List[Dict]) -> str:
"""Asynchronous wrapper for generation."""
return await asyncio.to_thread(sync_generate_response, query, context)
# --- Full Asynchronous RAG Pipeline ---
async def async_rag_pipeline(query: str) -> str:
"""Runs the full RAG pipeline asynchronously."""
retrieved_context = await async_retrieve_documents(query)
response = await async_generate_response(query, retrieved_context)
return response
async def main():
queries = [
"What is the company's vacation policy?",
"Latest Q1 earnings report?",
"When is the next team building event?",
"What are the benefits of quantum computing?"
]
print("--- Running multiple RAG queries concurrently ---")
start_time = time.time()
tasks = [async_rag_pipeline(q) for q in queries]
results = await asyncio.gather(*tasks) # Run all tasks concurrently
end_time = time.time()
for i, (query, result) in enumerate(zip(queries, results)):
print(f"\nQuery {i+1}: {query}")
print(f"Result: {result}")
print(f"\nTotal time for {len(queries)} concurrent queries: {end_time - start_time:.4f} seconds")
print(f"Expected theoretical sync time: {len(queries) * (0.5 + 1.0):.4f} seconds")
print("Actual async time should be closer to the longest single pipeline execution time if truly parallel (e.g., ~1.5 seconds)")
if __name__ == "__main__":
import time
asyncio.run(main())
print("\n**Conceptual Design Notes:**")
print("- `asyncio.to_thread` (Python 3.9+) is used to safely run synchronous (blocking) I/O or CPU-bound code without blocking the event loop.")
print("- If your LLM/VectorDB clients have native async interfaces (e.g., `openai.AsyncClient`), use those directly instead of `asyncio.to_thread` for better performance.")
print("- `asyncio.gather` efficiently runs multiple async tasks concurrently.")
5.3 Scaling RAG for Production
Deploying RAG in production involves managing infrastructure, monitoring, and continuous improvement.
Core Concepts: Cloud Services, CI/CD, Observability
- Managed Vector Databases: Use cloud-managed vector stores (Pinecone, Weaviate, Qdrant Cloud, Milvus Cloud, Azure AI Search, AWS OpenSearch) for scalability, reliability, and ease of operations.
- Cloud LLM APIs: Leverage highly scalable LLM providers (OpenAI, Google Gemini, Anthropic, Cohere) with robust APIs.
- Containerization (Docker) and Orchestration (Kubernetes): Package your RAG application for consistent deployment across environments and manage scaling.
- CI/CD Pipelines: Automate testing, building, and deployment of your RAG system.
- Observability (Logging, Monitoring, Tracing): Implement comprehensive logging for all RAG stages, monitor key metrics (latency, error rates, token usage), and use tracing to debug complex multi-step interactions.
- A/B Testing: Experiment with different RAG configurations (chunking, retrievers, prompts) and measure their impact on user experience.
Practical Considerations for Deployment
Mini-Project 5.3.1: Conceptual Production RAG Architecture
This section focuses on architectural patterns rather than runnable code, as production deployments are infrastructure-heavy.
graph TD
A[User Request] --> B{API Gateway / Load Balancer};
B --> C[RAG Service (Python FastAPI / Flask)];
C --> D[Retrieve Context (Vector DB Client)];
D --> E[Managed Vector Database (e.g., Pinecone, Weaviate)];
C --> F[Generate Response (LLM Client)];
F --> G[Cloud LLM API (e.g., OpenAI, Gemini)];
C --> H[Logging / Monitoring (e.g., Prometheus, Grafana, ELK)];
C --> I[Tracing (e.g., OpenTelemetry, Langsmith)];
H --> J[Alerting];
G --> K[LLM Provider Infrastructure];
E --> L[Vector DB Infrastructure];
M[Data Ingestion Pipeline] --> N[Document Loaders];
N --> O[Text Splitters];
O --> P[Embedding Models];
P --> E;
Q[Scheduler (e.g., Airflow, Prefect)] --> M;
Key Architectural Components:
- API Gateway/Load Balancer: Entry point for user requests, handles traffic distribution.
- RAG Service: Your application logic (e.g., a FastAPI or Flask app) that orchestrates retrieval and generation. This would be containerized.
- Vector DB Client: Interacts with your chosen vector database.
- Managed Vector Database: Handles vector storage and ANN search at scale.
- LLM Client: Communicates with the LLM API.
- Cloud LLM API: Provides the generative capabilities.
- Data Ingestion Pipeline: An offline process for continuously updating your knowledge base.
- Scheduler: Automates the ingestion (e.g., daily, hourly).
- Document Loaders, Text Splitters, Embedding Models: The components you built in Part 1.
- Observability Stack:
- Logging: Centralized logging of application events, errors, and LLM interactions.
- Monitoring: Track metrics like request latency, error rates, token usage, and vector database health.
- Tracing: End-to-end visibility of requests across different RAG components. Tools like LangChain’s
LangSmithare specifically designed for this.
Exercise 5.3.1: Choosing a Production Vector Database Research three different cloud-managed vector databases (e.g., Pinecone, Weaviate, Qdrant Cloud). Compare them based on criteria like:
- Pricing model
- Scalability features
- Supported data types/filtering capabilities
- Ease of integration with Python/LangChain
- Advanced search features (e.g., hybrid search support, re-ranking integrations)
Present your findings as a brief summary for each, highlighting their strengths and weaknesses for a hypothetical RAG application (e.g., a customer support chatbot for a large e-commerce company).
Part 6: Ethical Considerations, Limitations, and Future Directions of RAG
As RAG becomes more sophisticated, it’s vital to consider its societal impact, inherent limitations, and the exciting research avenues ahead.
6.1 Ethical Considerations in RAG
RAG systems, by their nature of retrieving and synthesizing information, carry significant ethical implications.
Core Concepts: Bias, Misinformation, Transparency, Data Privacy
- Bias Amplification: If the underlying knowledge base contains biases (historical, societal, demographic), the RAG system will retrieve and potentially amplify them. The embedding model can also encode biases.
- Misinformation and “Hallucination” by Retrieval: While RAG reduces LLM hallucination, it can still propagate misinformation if the retrieved documents themselves are inaccurate, outdated, or malicious.
- Transparency and Explainability: Users need to understand where the information comes from. RAG systems can provide source attribution, but users may not always review it.
- Data Privacy and Security: Handling sensitive or proprietary data in a RAG system requires robust security measures to prevent unauthorized access or leakage.
- Copyright and IP: Using copyrighted material in a RAG knowledge base for commercial applications raises legal questions, especially concerning the transformation and synthesis of content.
Practical Example: Source Attribution
Mini-Project 6.1.1: Enhancing RAG Output with Source Citation
We’ve implicitly done this by printing doc.metadata.get('source'). Let’s make it a explicit part of the LLM’s answer generation.
import os
import shutil
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from typing import List
# --- Setup: Document loading and indexing ---
DOCS_DIR = "rag_citation_docs"
PERSIST_DIR = "./rag_citation_chroma_db"
if os.path.exists(DOCS_DIR): shutil.rmtree(DOCS_DIR, ignore_errors=True)
os.makedirs(DOCS_DIR, exist_ok=True)
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
with open(os.path.join(DOCS_DIR, "report_finance.txt"), "w") as f:
f.write("""
Source: Acme Corp 2024 Annual Financial Report.
Acme Corp's revenue for 2024 was $500 million, a 10% increase from the previous year.
Net profit stood at $50 million. The company diversified its investments into AI and cloud services.
""")
with open(os.path.join(DOCS_DIR, "press_release_product.txt"), "w") as f:
f.write("""
Source: Acme Corp Press Release, 2025-06-01.
Acme Corp announces 'DataGenius', a new AI-powered data analytics platform.
DataGenius helps businesses uncover insights from large datasets with intuitive dashboards.
""")
print(f"Created dummy documents in '{DOCS_DIR}'")
loader = DirectoryLoader(DOCS_DIR, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)
# Assign unique source to each chunk based on original document's source
for chunk in chunks:
if "source" in chunk.metadata:
# Example: if doc.metadata['source'] is 'rag_citation_docs/report_finance.txt'
# we want to extract 'report_finance.txt'
chunk.metadata['filename'] = os.path.basename(chunk.metadata['source'])
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db.persist()
reloaded_vector_db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
print("ChromaDB set up with documents.")
# --- LangChain RAG with Source Citation ---
try:
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
print("\nLangChain LLM (ChatOpenAI) initialized.")
except Exception as e:
print(f"Failed to initialize ChatOpenAI: {e}. Ensure OPENAI_API_KEY is set.")
llm = None
if llm:
# Modified prompt to ask the LLM to include sources
prompt_with_citation = ChatPromptTemplate.from_template("""
Answer the user's question based on the provided context only.
If you cannot find the answer in the context, explicitly state that the information is not available.
For each piece of information in your answer, always cite the source filename from which it was derived.
Use the format: "[Source: <filename>]" immediately after the relevant statement.
Do not invent information or sources.
Context:
{context}
Question: {input}
""")
# We need a custom way to format documents for the prompt, including their metadata
# The default create_stuff_documents_chain just uses page_content
# Here's a custom function to format documents for context
def format_docs_with_sources(docs: List[Document]) -> str:
formatted_string = ""
for i, doc in enumerate(docs):
filename = doc.metadata.get('filename', 'Unknown Source')
formatted_string += f"--- Document {i+1} [Source: {filename}] ---\n"
formatted_string += doc.page_content + "\n\n"
return formatted_string
# To pass formatted docs to the prompt, we need to adapt `create_stuff_documents_chain`
# or create a custom chain that does this. For simplicity with existing LangChain,
# we'll slightly adjust the prompt, hoping the LLM will pick up the 'Source:' within content.
# A more robust solution involves a custom Runnable or a different chain structure.
# Let's adjust the chunk content itself to include the source
# This happens during the initial data ingestion, but for demo:
chunks_with_embedded_sources = []
for chunk in chunks:
filename = chunk.metadata.get('filename', 'Unknown Source')
chunk.page_content = f"[[Source: {filename}]]\n" + chunk.page_content
chunks_with_embedded_sources.append(chunk)
# Re-create ChromaDB with the modified chunks
if os.path.exists(PERSIST_DIR): shutil.rmtree(PERSIST_DIR, ignore_errors=True)
vector_db_citations = Chroma.from_documents(
documents=chunks_with_embedded_sources,
embedding=embedding_model,
persist_directory=PERSIST_DIR
)
vector_db_citations.persist()
reloaded_vector_db_citations = Chroma(persist_directory=PERSIST_DIR, embedding_function=embedding_model)
document_combiner_chain_citations = create_stuff_documents_chain(llm, prompt_with_citation)
retriever_citations = reloaded_vector_db_citations.as_retriever(search_kwargs={"k": 3})
retrieval_chain_citations = create_retrieval_chain(retriever_citations, document_combiner_chain_citations)
print("\n--- RAG Chain with Source Citation Ready! ---")
query1 = "What was Acme Corp's revenue in 2024 and what is DataGenius?"
response1 = retrieval_chain_citations.invoke({"input": query1})
print(f"\nUser Query: {query1}")
print(f"RAG Response with Citation: {response1['answer']}")
query2 = "Tell me about the net profit and any new product announcements."
response2 = retrieval_chain_citations.invoke({"input": query2})
print(f"\nUser Query: {query2}")
print(f"RAG Response with Citation: {response2['answer']}")
else:
print("Skipping RAG with Citation demo due to LLM initialization failure.")
# --- Cleanup ---
print(f"\nCleaning up '{DOCS_DIR}' and '{PERSIST_DIR}'...")
shutil.rmtree(DOCS_DIR, ignore_errors=True)
shutil.rmtree(PERSIST_DIR, ignore_errors=True)
print("Cleanup complete.")
Explanation:
- We’ve modified the document chunks themselves to explicitly include
[[Source: <filename>]]at the beginning of theirpage_content. This makes the source information part of what the embedding model sees and what the LLM receives as context. - The
prompt_with_citationnow explicitly instructs the LLM to include this source information in its answer. This is a form of “in-context learning” for source attribution.
Benefits of Source Attribution:
- Trust and Transparency: Users can verify the information and understand its origin.
- Reduced Liability: Helps mitigate risks associated with misinformation.
- Debugging: Easier to trace back incorrect answers to faulty source documents or retrieval issues.
Ethical Considerations Exercise: Discuss how you would address potential bias in a RAG system designed for:
- Hiring recommendations: If the knowledge base contains historical performance reviews that reflect gender or racial biases.
- Medical diagnosis support: If the knowledge base disproportionately covers certain demographics or omits information relevant to rare diseases.
6.2 Limitations of RAG
Despite its power, RAG is not a silver bullet and has its own set of limitations.
- Reliance on Retrieval Quality: “Garbage in, garbage out.” If the retrieval system fails to find relevant information (e.g., poor embeddings, inadequate chunking, missing documents), the LLM cannot produce a good answer.
- Context Window Limits (Still): While RAG provides external context, the amount of retrieved context that can fit into the LLM’s prompt is still finite. Very complex queries might require synthesizing information from many documents that collectively exceed this limit.
- Information Overload (to LLM): Too many or too long retrieved documents, even if relevant, can sometimes “distract” the LLM, making it harder to extract the precise answer.
- Synthesis and Reasoning Gaps: LLMs are good at summarizing and rephrasing, but deep, multi-hop reasoning across disparate retrieved documents can still be challenging. They might struggle to connect dots that aren’t explicitly stated.
- Maintenance Overhead: Keeping the knowledge base up-to-date and managing the entire RAG pipeline (document loading, chunking, embedding, vector database) requires ongoing effort.
- Cost: Running high-quality embedding models and LLMs (especially powerful ones like GPT-4) can be expensive at scale.
6.3 Future Directions in RAG
The field of RAG is rapidly evolving, with ongoing research pushing its boundaries.
- Advanced Multi-Modal RAG: Extending RAG beyond text to retrieve and generate from images, audio, video, and structured data. Imagine querying with an image and retrieving related descriptive text and other images.
- Recursive and Iterative RAG: Systems that perform multiple rounds of retrieval and generation, or refine their queries based on initial retrieval results. An agent might retrieve some context, formulate a sub-query, retrieve more, and then synthesize.
- Generative Retrieval: Instead of storing embeddings of original documents, an LLM could generate “hypothetical documents” or summaries that are then embedded and stored. This could help with highly abstract queries.
- Personalized RAG: Tailoring retrieved content and generated responses based on individual user profiles, preferences, or interaction history.
- Optimized Indexing and Compression: Smarter ways to store and compress information in the vector database to improve search speed and reduce memory footprint.
- Self-Correcting RAG: Systems that can detect when their generated answers are likely incorrect or unsubstantiated and attempt to self-correct by re-retrieving or querying the LLM differently.
- RAG for Code and Structured Data: Applying RAG principles to codebases, databases, and other structured information to answer questions, generate code, or analyze data.
Conclusion
Retrieval-Augmented Generation has emerged as a transformative technology, enabling LLMs to move beyond their static training data and interact with the dynamic, vast, and proprietary knowledge of the real world. By grounding LLM responses in verifiable external information, RAG significantly enhances their accuracy, reduces hallucinations, and broadens their applicability across countless domains.
This practical guide has walked you through the journey from understanding the core components of RAG—document loading, chunking, embeddings, and vector databases—to building robust RAG pipelines with frameworks like LangChain, integrating RAG into intelligent agentic systems, and considering the critical aspects of evaluation, optimization, and ethical deployment.
The field of RAG is still in its early stages, promising even more sophisticated and intelligent applications in the future. Armed with the knowledge and hands-on experience gained from this document, you are well-prepared to contribute to this exciting frontier, building LLM-powered applications that are not just intelligent, but also informed, reliable, and trustworthy. The journey of continuous learning and experimentation is key to mastering RAG and unlocking the full potential of augmented AI.
(End of Document)