RAG: Retrieval-Augmented Generation - START HEREΒΆ
What is RAG?ΒΆ
RAG = Retrieval-Augmented Generation
A technique that enhances LLMs by giving them access to external knowledge!
The ProblemΒΆ
LLMs have limitations:
Knowledge cutoff date
Canβt access private/proprietary data
May hallucinate facts
Canβt update without retraining
The Solution: RAGΒΆ
Store documents in a vector database
When user asks a question, retrieve relevant docs
Send docs + question to LLM
LLM generates answer using retrieved context
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path
RAG ArchitectureΒΆ
User Query
β
[Embed Query] β Vector DB
β β
β [Retrieve Top-K]
β β
βββ [Combine with Context]
β
[Send to LLM]
β
Response
Key Components:
Document Store (Vector DB)
Embedding Model
Retriever
LLM (Generator)
Quick Demo: Minimal RAGΒΆ
Seeing the Core Idea in ActionΒΆ
Before diving into production-grade implementations, it helps to see RAG reduced to its essentials. The demo below creates a tiny knowledge base of five sentences, assigns each one a (simulated) embedding vector, and wires up a retrieval function that finds the most relevant documents for a query. In a real system, embeddings would come from a model like all-MiniLM-L6-v2 and the generator would be a large language model like GPT-4 or Claude β but the three-step pattern (embed, retrieve, generate) remains identical at any scale.
# Simple document store
documents = [
"Python is a high-level programming language created by Guido van Rossum.",
"Machine learning is a subset of AI that learns from data.",
"RAG combines retrieval and generation for better LLM responses.",
"Vector databases store embeddings for similarity search.",
"LangChain is a framework for building LLM applications."
]
# Simulate embeddings (normally use OpenAI/sentence-transformers)
import numpy as np
np.random.seed(42)
doc_embeddings = np.random.randn(len(documents), 384)
print(f"Loaded {len(documents)} documents")
print(f"Embedding dimension: {doc_embeddings.shape[1]}")
Retrieval FunctionΒΆ
Finding the Most Relevant DocumentsΒΆ
The retriever is the heart of any RAG system. Given a query embedding, it scores every document in the store by cosine similarity β the cosine of the angle between the query vector and each document vector β and returns the top-\(k\) highest-scoring documents. Cosine similarity ranges from \(-1\) to \(1\), where values near \(1\) indicate strong semantic overlap. In production, approximate nearest-neighbor indexes (HNSW, IVF) make this step sub-millisecond even over millions of documents.
def cosine_similarity(a, b):
"""Compute cosine similarity between vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query_embedding, top_k=2):
"""Retrieve top-k most similar documents."""
similarities = []
for i, doc_emb in enumerate(doc_embeddings):
sim = cosine_similarity(query_embedding, doc_emb)
similarities.append((i, sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
# Return top-k
results = []
for i, sim in similarities[:top_k]:
results.append({
'document': documents[i],
'similarity': sim
})
return results
# Test retrieval
query_emb = np.random.randn(384)
results = retrieve(query_emb, top_k=2)
print("Retrieved Documents:")
for i, result in enumerate(results, 1):
print(f"\n{i}. Similarity: {result['similarity']:.4f}")
print(f" {result['document']}")
Generation with ContextΒΆ
Constructing the LLM PromptΒΆ
The final RAG step stitches the retrieved documents into a structured prompt and sends it to a large language model. The prompt template explicitly instructs the model to answer based on the provided context, which grounds the response in factual source material and reduces hallucination. The LLMβs job is now much simpler: synthesize and summarize the retrieved passages rather than recall facts from its training data. Source attribution comes naturally because each context passage can be labeled with its origin.
def create_prompt(query: str, context_docs: List[str]) -> str:
"""Create prompt with retrieved context."""
context = "\n\n".join([f"Document {i+1}: {doc}"
for i, doc in enumerate(context_docs)])
prompt = f"""Answer the question based on the context below.
Context:
{context}
Question: {query}
Answer:"""
return prompt
# Example
query = "What is RAG?"
retrieved_docs = [r['document'] for r in results]
prompt = create_prompt(query, retrieved_docs)
print(prompt)
print("\n" + "="*70)
print("This prompt would be sent to an LLM (GPT-4, Claude, etc.)")
Why RAG WorksΒΆ
BenefitsΒΆ
β Up-to-date information - Add new docs anytime β Domain-specific knowledge - Use your own data β Reduced hallucinations - LLM has facts to reference β Source attribution - Know where answers come from β Cost effective - No need to fine-tune LLM
Use CasesΒΆ
Customer support chatbots
Document Q&A systems
Research assistants
Code documentation search
Legal/medical information retrieval
Next StepsΒΆ
This series covers:
Basic RAG - Build from scratch
Document Processing - Chunking strategies
LangChain RAG - Production framework
LlamaIndex RAG - Alternative framework
Advanced Retrieval - Hybrid search, re-ranking
Conversation RAG - Memory and context
Evaluation - Measuring RAG performance
Letβs dive in! π