Challenges: RAG SystemsΒΆ
Hands-on challenges to master Retrieval-Augmented Generation
π Challenge 1: The Chunking Optimization GameΒΆ
Difficulty: ββ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries
The ProblemΒΆ
Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!
Your TaskΒΆ
Take a long technical document (e.g., Python documentation, research paper)
Create 10 test questions that require specific passages
Try 5 different chunking strategies:
Fixed size (256, 512, 1024 tokens)
Sentence-based
Paragraph-based
Semantic (embeddings-based)
Hierarchical (sections β paragraphs β sentences)
Measure which strategy retrieves the right passages most often
Evaluation MetricsΒΆ
# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions
# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])
Success CriteriaΒΆ
Test all 5 chunking methods
Create visualization comparing methods
Identify when each method works best
Provide recommendations
π‘ Hint
Different content types need different strategies: - Code documentation: Semantic chunking works well - Narrative text: Paragraph-based is often good - Q&A: Sentence-based can workπ Challenge 2: Query Expansion TechniquesΒΆ
Difficulty: βββ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE
The ProblemΒΆ
User queries are often vague or poorly worded. Expand them to improve retrieval!
Your TaskΒΆ
Implement 3 query expansion techniques:
Technique 1: Multi-Query Generation
# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"
Technique 2: Hypothetical Document Embeddings (HyDE)
# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)
Technique 3: Query Decomposition
# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"
Comparison TaskΒΆ
Test on 20 diverse questions
Compare retrieval accuracy for each method
Analyze latency and cost tradeoffs
Identify best use cases
π‘ Hint
Multi-query can be parallelized for speed. HyDE works great when you know the answer format. Query decomposition is powerful for complex questions.π Challenge 3: The Hallucination HunterΒΆ
Difficulty: ββββ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection
The ProblemΒΆ
LLMs sometimes βhallucinateβ - generate plausible-sounding but incorrect information. Catch them!
Your TaskΒΆ
Build a hallucination detection system:
Faithfulness Scoring
Check if answer is supported by retrieved context
Use entailment model or LLM-as-judge
Score 0-1 for how well grounded the answer is
Citation Verification
Extract claims from answer
Verify each claim against source documents
Flag unsupported claims
Confidence Calibration
Estimate answer confidence
Compare with actual correctness
Calibrate model to be more honest
ImplementationΒΆ
class HallucinationDetector:
def check_faithfulness(self, answer, context):
"""Score how well answer is supported by context."""
# TODO: Implement
pass
def verify_citations(self, answer, sources):
"""Verify each claim in answer."""
claims = self.extract_claims(answer)
verified = []
for claim in claims:
is_supported = self.verify_claim(claim, sources)
verified.append({
"claim": claim,
"supported": is_supported,
"confidence": ...
})
return verified
Test DatasetΒΆ
Create 30 questions with known hallucination triggers:
Questions outside knowledge base
Ambiguous questions
Questions with conflicting information
Questions requiring calculation/reasoning
π‘ Hint
Use models like "microsoft/deberta-v3-large" for entailment. Compare multiple answer generations - consistent = likely correct. Prompt engineering: "Only answer if you're certain. Otherwise say 'I don't know.'"π Challenge 4: Conversational RAGΒΆ
Difficulty: ββββ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory
The ProblemΒΆ
Most RAG systems handle single questions. Build one that handles multi-turn conversations!
Your TaskΒΆ
Handle conversation like this:
User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]
User: "What about performance?" # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]
User: "Compare it to Java" # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]
RequirementsΒΆ
Track conversation history
Rewrite queries with context (coreference resolution)
Maintain entity tracking
Handle follow-up questions
Know when to retrieve vs use previous context
Manage token budget (conversation history grows!)
Conversation ManagementΒΆ
class ConversationalRAG:
def __init__(self):
self.conversation_history = []
self.entity_tracker = {}
def rewrite_query_with_context(self, current_query, history):
"""Rewrite query to be standalone using conversation context."""
# "What about performance?" β "What about Python performance?"
pass
def should_retrieve(self, query, history):
"""Decide if we need new retrieval or can use context."""
# Avoid unnecessary retrievals for clarification questions
pass
def chat(self, user_message):
# Rewrite query
# Retrieve if needed
# Generate with conversation context
# Update history
pass
π‘ Hint
Use LLM to rewrite queries: "Given conversation history, rewrite this query to be standalone" Keep sliding window of last N turns to manage tokens. Detect if query is clarification vs new topic.π Challenge 5: Multi-Modal RAGΒΆ
Difficulty: βββββ Expert
Time: 4-6 hours
Concepts: Multi-modal embeddings, vision-language models, hybrid retrieval
The ProblemΒΆ
Real documents have images, tables, charts - not just text. Build RAG that handles it all!
Your TaskΒΆ
Build a system that processes:
Text: Standard RAG
Images: Visual search with CLIP
Tables: Structured data retrieval
Diagrams: Caption extraction + visual search
Code: Syntax-aware chunking
Example Use Case: Technical DocumentationΒΆ
User: "Show me the architecture diagram and explain the components"
System should:
1. Retrieve relevant diagram (image similarity)
2. Extract/generate diagram description
3. Retrieve text about components
4. Combine image + text in answer
Implementation ComponentsΒΆ
Multi-Modal Embeddings:
Text: sentence-transformers
Images: CLIP
Tables: Table-specific embedders
Hybrid Retrieval:
Combine results from different modalities
Weight by relevance and modality type
Multi-Modal Generation:
GPT-4 Vision for image understanding
Generate answers referencing both text and images
Success CriteriaΒΆ
Process PDFs with images/tables
Retrieve relevant visuals for queries
Generate answers combining modalities
Handle queries like βshow meβ, βdiagram ofβ, βtable showingβ
π‘ Hint
Use GPT-4 Vision or LLaVA for image understanding. CLIP for image-text similarity. Separate vector stores per modality, then merge results.π Meta Challenge: RAG Optimization CompetitionΒΆ
Difficulty: βββββ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation
The Ultimate ChallengeΒΆ
Build the best RAG system for a specific domain and prove it!
Competition FormatΒΆ
Choose Domain: Medical, legal, technical docs, customer support, etc.
Build System: Full RAG pipeline
Create Benchmark: 100+ test questions with ground truth
Optimize Everything:
Chunking strategy
Embedding model
Retrieval method
Re-ranking
Generation prompts
Cost/latency tradeoffs
Leaderboard MetricsΒΆ
Accuracy: % of correct answers
Faithfulness: % of answers supported by context
Latency: Average response time
Cost: $ per 1000 queries
User Satisfaction: Human evaluation (1-5)
DeliverablesΒΆ
Complete RAG system (code)
Benchmark dataset (questions + answers)
Evaluation results (metrics + analysis)
Technical report (methodology + findings)
Demo (Gradio/Streamlit app)
Bonus PointsΒΆ
Open-source your solution
Deploy publicly
Write blog post about optimizations
Beat baseline by >20% accuracy
π Challenge Progress TrackerΒΆ
Challenge 1: Chunking Optimization
Challenge 2: Query Expansion
Challenge 3: Hallucination Hunter
Challenge 4: Conversational RAG
Challenge 5: Multi-Modal RAG
Meta Challenge: RAG Optimization Competition
π‘ Tips for SuccessΒΆ
Start Simple: Get basic version working first
Measure Everything: Metrics guide optimization
Error Analysis: Study failures to improve
Read Papers: Many techniques have research backing
Use Tools: LangChain, LlamaIndex can speed things up
Iterate: First version wonβt be perfect
π Helpful ResourcesΒΆ
Happy building! π
Remember: RAG is about the journey of optimization, not just the destination!