RAG: Retrieval-Augmented Generation - START HEREΒΆ

What is RAG?ΒΆ

RAG = Retrieval-Augmented Generation

A technique that enhances LLMs by giving them access to external knowledge!

The ProblemΒΆ

LLMs have limitations:

  • Knowledge cutoff date

  • Can’t access private/proprietary data

  • May hallucinate facts

  • Can’t update without retraining

The Solution: RAGΒΆ

  1. Store documents in a vector database

  2. When user asks a question, retrieve relevant docs

  3. Send docs + question to LLM

  4. LLM generates answer using retrieved context

import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path

RAG ArchitectureΒΆ

User Query
    ↓
[Embed Query] β†’ Vector DB
    ↓             ↓
    ↓      [Retrieve Top-K]
    ↓             ↓
    └─→ [Combine with Context]
              ↓
          [Send to LLM]
              ↓
          Response

Key Components:

  1. Document Store (Vector DB)

  2. Embedding Model

  3. Retriever

  4. LLM (Generator)

Quick Demo: Minimal RAGΒΆ

Seeing the Core Idea in ActionΒΆ

Before diving into production-grade implementations, it helps to see RAG reduced to its essentials. The demo below creates a tiny knowledge base of five sentences, assigns each one a (simulated) embedding vector, and wires up a retrieval function that finds the most relevant documents for a query. In a real system, embeddings would come from a model like all-MiniLM-L6-v2 and the generator would be a large language model like GPT-4 or Claude – but the three-step pattern (embed, retrieve, generate) remains identical at any scale.

# Simple document store
documents = [
    "Python is a high-level programming language created by Guido van Rossum.",
    "Machine learning is a subset of AI that learns from data.",
    "RAG combines retrieval and generation for better LLM responses.",
    "Vector databases store embeddings for similarity search.",
    "LangChain is a framework for building LLM applications."
]

# Simulate embeddings (normally use OpenAI/sentence-transformers)
import numpy as np
np.random.seed(42)
doc_embeddings = np.random.randn(len(documents), 384)

print(f"Loaded {len(documents)} documents")
print(f"Embedding dimension: {doc_embeddings.shape[1]}")

Retrieval FunctionΒΆ

Finding the Most Relevant DocumentsΒΆ

The retriever is the heart of any RAG system. Given a query embedding, it scores every document in the store by cosine similarity – the cosine of the angle between the query vector and each document vector – and returns the top-\(k\) highest-scoring documents. Cosine similarity ranges from \(-1\) to \(1\), where values near \(1\) indicate strong semantic overlap. In production, approximate nearest-neighbor indexes (HNSW, IVF) make this step sub-millisecond even over millions of documents.

def cosine_similarity(a, b):
    """Compute cosine similarity between vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query_embedding, top_k=2):
    """Retrieve top-k most similar documents."""
    similarities = []
    for i, doc_emb in enumerate(doc_embeddings):
        sim = cosine_similarity(query_embedding, doc_emb)
        similarities.append((i, sim))
    
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Return top-k
    results = []
    for i, sim in similarities[:top_k]:
        results.append({
            'document': documents[i],
            'similarity': sim
        })
    return results

# Test retrieval
query_emb = np.random.randn(384)
results = retrieve(query_emb, top_k=2)

print("Retrieved Documents:")
for i, result in enumerate(results, 1):
    print(f"\n{i}. Similarity: {result['similarity']:.4f}")
    print(f"   {result['document']}")

Generation with ContextΒΆ

Constructing the LLM PromptΒΆ

The final RAG step stitches the retrieved documents into a structured prompt and sends it to a large language model. The prompt template explicitly instructs the model to answer based on the provided context, which grounds the response in factual source material and reduces hallucination. The LLM’s job is now much simpler: synthesize and summarize the retrieved passages rather than recall facts from its training data. Source attribution comes naturally because each context passage can be labeled with its origin.

def create_prompt(query: str, context_docs: List[str]) -> str:
    """Create prompt with retrieved context."""
    context = "\n\n".join([f"Document {i+1}: {doc}" 
                           for i, doc in enumerate(context_docs)])
    
    prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {query}

Answer:"""
    return prompt

# Example
query = "What is RAG?"
retrieved_docs = [r['document'] for r in results]
prompt = create_prompt(query, retrieved_docs)

print(prompt)
print("\n" + "="*70)
print("This prompt would be sent to an LLM (GPT-4, Claude, etc.)")

Why RAG WorksΒΆ

BenefitsΒΆ

βœ… Up-to-date information - Add new docs anytime βœ… Domain-specific knowledge - Use your own data βœ… Reduced hallucinations - LLM has facts to reference βœ… Source attribution - Know where answers come from βœ… Cost effective - No need to fine-tune LLM

Use CasesΒΆ

  • Customer support chatbots

  • Document Q&A systems

  • Research assistants

  • Code documentation search

  • Legal/medical information retrieval

Next StepsΒΆ

This series covers:

  1. Basic RAG - Build from scratch

  2. Document Processing - Chunking strategies

  3. LangChain RAG - Production framework

  4. LlamaIndex RAG - Alternative framework

  5. Advanced Retrieval - Hybrid search, re-ranking

  6. Conversation RAG - Memory and context

  7. Evaluation - Measuring RAG performance

Let’s dive in! πŸš€