Phase 8: RAGΒΆ
π― OverviewΒΆ
Combine your skills from previous phases to build production-grade RAG systems!
Prerequisites:
β Tokenization (Phase 3)
β Embeddings (Phase 4)
β Neural Networks (Phase 5)
β Vector Databases (Phase 6)
Time: 3-4 weeks | 60-80 hours
Outcome: Build AI applications that can query your knowledge base
π What Youβll LearnΒΆ
Core RAG ConceptsΒΆ
RAG architecture and pipeline
Document processing and chunking strategies
Retrieval methods (dense, sparse, hybrid)
Context management and prompt construction
Re-ranking and result filtering
LLM integration (OpenAI, Anthropic, local models)
Advanced RAG TechniquesΒΆ
Hybrid search (vector + keyword)
Query transformation and expansion
Multi-query retrieval
Parent-document retrieval
Self-query and metadata filtering
Conversation memory and context
ποΈ Module StructureΒΆ
08-rag/
βββ 00_START_HERE.ipynb # RAG overview and quick demo
βββ 01_basic_rag.ipynb # Simple RAG from scratch
βββ 02_document_processing.ipynb # Chunking strategies
βββ 03_langchain_rag.ipynb # Using LangChain framework
βββ 04_llamaindex_rag.ipynb # Using LlamaIndex framework
βββ 05_advanced_retrieval.ipynb # Hybrid search, re-ranking
βββ 06_conversation_rag.ipynb # Chat with memory
βββ 07_evaluation.ipynb # RAG evaluation metrics
βββ 09_advanced_retrieval.ipynb # Parent-child retrieval, ensemble
βββ 10_graphrag_visual_rag.ipynb # GraphRAG and multimodal RAG
βββ assignment.md # Phase assignment
βββ challenges.md # Hands-on challenges
βββ README.md # This file
π Quick StartΒΆ
1. Basic RAG PipelineΒΆ
# The fundamental RAG flow:
# 1. Index documents β embeddings β vector DB
# 2. User query β embedding β similarity search
# 3. Retrieved docs + query β LLM β answer
from sentence_transformers import SentenceTransformer
from your_vector_db import VectorDB # Chroma, Qdrant, etc.
from openai import OpenAI
# 1. Index your documents
# Use any embedding model - see 05-embeddings/embedding_comparison.md for options
# API: Gemini Embedding (cheapest + best), Voyage 3.5, or OpenAI
# Local: Qwen3-Embedding, BGE-M3, or all-MiniLM-L6-v2
model = SentenceTransformer('all-MiniLM-L6-v2') # local, fast
docs = ["Your documents here..."]
embeddings = model.encode(docs)
db.add(documents=docs, embeddings=embeddings)
# 2. Retrieve relevant context
query = "What is RAG?"
query_embedding = model.encode(query)
results = db.search(query_embedding, top_k=3)
# 3. Generate answer with LLM (Claude, GPT, Gemini, or local)
context = "\n".join(results)
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
response = llm.generate(prompt)
π Learning PathΒΆ
Week 1: RAG FundamentalsΒΆ
Complete
00_START_HERE.ipynbBuild basic RAG in
01_basic_rag.ipynbLearn chunking strategies in
02_document_processing.ipynbProject: Simple Q&A on your documents
Week 2: RAG FrameworksΒΆ
Learn LangChain in
03_langchain_rag.ipynbExplore LlamaIndex in
04_llamaindex_rag.ipynbCompare frameworks and choose your favorite
Project: Build a research paper assistant
Week 3: Advanced TechniquesΒΆ
Implement hybrid search in
05_advanced_retrieval.ipynbAdd conversation memory in
06_conversation_rag.ipynbLearn evaluation in
07_evaluation.ipynbProject: Code search system for your repos
Week 4: Production ProjectΒΆ
Build end-to-end RAG application
Add proper error handling
Implement caching and optimization
Deploy as API (preview of Phase 8)
Capstone: Personal knowledge assistant
π οΈ Technologies Youβll UseΒΆ
LLM Frameworks:
LangChain - Most popular, extensive ecosystem
LlamaIndex - Best for document indexing
Haystack - Production-focused
LLM Providers:
OpenAI (GPT-5.4, GPT-4.1, GPT-4.1-mini)
Anthropic (Claude Sonnet 4.6, Haiku 4.5)
Google (Gemini 3.1 Pro, Flash)
Local models (Qwen 3, Llama 4, DeepSeek R1 via Ollama)
Vector Databases:
Use what you learned in Phase 6!
Chroma, Qdrant, Weaviate, Milvus
Embeddings:
OpenAI embeddings (text-embedding-3-small/large)
Sentence Transformers (all-MiniLM-L6-v2, all-mpnet-base-v2)
Cohere embeddings
π Key Concepts ExplainedΒΆ
1. RAG PipelineΒΆ
Documents β Split β Embed β Store in Vector DB
β
User Query β Embed β Search β Retrieve Top-K
β
Retrieved Docs + Query β LLM Prompt β Answer
2. Chunking StrategiesΒΆ
Fixed-size chunks:
chunk_size = 512 # tokens or characters
overlap = 50 # overlap between chunks
Semantic chunks:
Split by paragraphs, sentences
Preserve document structure
Maintain context boundaries
Recursive splitting:
Try different separators (\n\n, \n, ., space)
Preserve hierarchy
3. Retrieval MethodsΒΆ
Dense (Vector Search):
Semantic similarity
Works for paraphrased queries
Requires embeddings
Sparse (Keyword Search):
BM25, TF-IDF
Exact keyword matching
Fast and interpretable
Hybrid:
Combine both approaches
Re-rank with RRF (Reciprocal Rank Fusion)
Best of both worlds
π― ProjectsΒΆ
Project 1: Personal Documentation Q&AΒΆ
Build a chatbot that answers questions about your personal notes, docs, PDFs.
Features:
Upload PDFs, TXTs, Markdown files
Chunk and embed documents
Conversational interface
Source citation
Project 2: Code Search EngineΒΆ
Semantic search across your GitHub repositories.
Features:
Index code files (Python, JavaScript, etc.)
Search by intent (βhow to connect to database?β)
Show relevant code snippets
Explain code functionality
Project 3: Research AssistantΒΆ
Query academic papers and scientific literature.
Features:
Process research papers (PDFs)
Extract citations and references
Summarize papers
Compare multiple papers
Project 4: Customer Support BotΒΆ
RAG-powered FAQ system.
Features:
Index support documentation
Handle common questions
Escalate to human when needed
Track conversation context
π Evaluation MetricsΒΆ
Retrieval QualityΒΆ
Precision@K: Relevant docs in top K results
Recall@K: % of relevant docs retrieved
MRR (Mean Reciprocal Rank): Position of first relevant result
NDCG: Normalized Discounted Cumulative Gain
Generation QualityΒΆ
Faithfulness: Answer grounded in context
Relevance: Answer addresses the question
Correctness: Factually accurate
Human evaluation: User satisfaction
System MetricsΒΆ
Latency: Response time
Cost: API costs per query
Cache hit rate: Efficiency
π‘ Best PracticesΒΆ
Document ProcessingΒΆ
β
Chunk size: 256-1024 tokens (experiment!)
β
Overlap: 10-20% of chunk size
β
Preserve metadata (source, date, author)
β
Clean text (remove headers, footers)
RetrievalΒΆ
β
Retrieve 3-10 documents (balance context vs noise)
β
Use hybrid search when possible
β
Re-rank results for better quality
β
Filter by metadata when relevant
PromptingΒΆ
β
Provide clear instructions
β
Include relevant context only
β
Ask LLM to cite sources
β
Handle βI donβt knowβ cases
ProductionΒΆ
β
Cache embeddings and results
β
Monitor LLM costs
β
Implement rate limiting
β
Add error handling and retries
π ResourcesΒΆ
DocumentationΒΆ
PapersΒΆ
CoursesΒΆ
ToolsΒΆ
β Completion ChecklistΒΆ
Before moving to Phase 8 (MLOps), you should be able to:
Explain RAG architecture and benefits
Process and chunk documents effectively
Build basic RAG pipeline from scratch
Use LangChain or LlamaIndex
Implement hybrid search (dense + sparse)
Add conversation memory to chatbots
Evaluate RAG system quality
Deploy a working RAG application
Understand cost/latency tradeoffs
Handle edge cases and errors
π Whatβs Next?ΒΆ
Phase 8: MLOps & Production β
Deploy RAG as scalable API
Monitor performance and costs
CI/CD for ML systems
Cloud deployment (AWS, Azure, GCP)
Phase 9: Specializations β
Multimodal RAG (images + text)
Agent systems with RAG
Advanced prompt engineering
Ready to build your first RAG system? β Start with 00_START_HERE.ipynb
Questions? β Check the assignment.md and challenges.md for practice exercises
π Letβs build intelligent systems that can learn from your data!