Challenges: RAG Systems¶

Hands-on challenges to master Retrieval-Augmented Generation

🚀 Challenge 1: The Chunking Optimization Game¶

Difficulty: ⭐⭐ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries

The Problem¶

Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!

Your Task¶

Take a long technical document (e.g., Python documentation, research paper)
Create 10 test questions that require specific passages
Try 5 different chunking strategies:
- Fixed size (256, 512, 1024 tokens)
- Sentence-based
- Paragraph-based
- Semantic (embeddings-based)
- Hierarchical (sections → paragraphs → sentences)
Measure which strategy retrieves the right passages most often

Evaluation Metrics¶

# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions

# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])

Success Criteria¶

Test all 5 chunking methods
Create visualization comparing methods
Identify when each method works best
Provide recommendations

💡 Hint

Different content types need different strategies: - Code documentation: Semantic chunking works well - Narrative text: Paragraph-based is often good - Q&A: Sentence-based can work

🚀 Challenge 2: Query Expansion Techniques¶

Difficulty: ⭐⭐⭐ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE

The Problem¶

User queries are often vague or poorly worded. Expand them to improve retrieval!

Your Task¶

Implement 3 query expansion techniques:

Technique 1: Multi-Query Generation

# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"

Technique 2: Hypothetical Document Embeddings (HyDE)

# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)

Technique 3: Query Decomposition

# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"

Comparison Task¶

Test on 20 diverse questions
Compare retrieval accuracy for each method
Analyze latency and cost tradeoffs
Identify best use cases

💡 Hint

Multi-query can be parallelized for speed. HyDE works great when you know the answer format. Query decomposition is powerful for complex questions.

🚀 Challenge 3: The Hallucination Hunter¶

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection

The Problem¶

LLMs sometimes “hallucinate” - generate plausible-sounding but incorrect information. Catch them!

Your Task¶

Build a hallucination detection system:

Faithfulness Scoring
- Check if answer is supported by retrieved context
- Use entailment model or LLM-as-judge
- Score 0-1 for how well grounded the answer is
Citation Verification
- Extract claims from answer
- Verify each claim against source documents
- Flag unsupported claims
Confidence Calibration
- Estimate answer confidence
- Compare with actual correctness
- Calibrate model to be more honest

Implementation¶

class HallucinationDetector:
    def check_faithfulness(self, answer, context):
        """Score how well answer is supported by context."""
        # TODO: Implement
        pass
    
    def verify_citations(self, answer, sources):
        """Verify each claim in answer."""
        claims = self.extract_claims(answer)
        verified = []
        for claim in claims:
            is_supported = self.verify_claim(claim, sources)
            verified.append({
                "claim": claim,
                "supported": is_supported,
                "confidence": ...
            })
        return verified

Test Dataset¶

Create 30 questions with known hallucination triggers:

Questions outside knowledge base
Ambiguous questions
Questions with conflicting information
Questions requiring calculation/reasoning

💡 Hint

Use models like "microsoft/deberta-v3-large" for entailment. Compare multiple answer generations - consistent = likely correct. Prompt engineering: "Only answer if you're certain. Otherwise say 'I don't know.'"

🚀 Challenge 4: Conversational RAG¶

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory

The Problem¶

Most RAG systems handle single questions. Build one that handles multi-turn conversations!

Your Task¶

Handle conversation like this:

User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]

User: "What about performance?"  # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]

User: "Compare it to Java"  # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]

Requirements¶

Track conversation history
Rewrite queries with context (coreference resolution)
Maintain entity tracking
Handle follow-up questions
Know when to retrieve vs use previous context
Manage token budget (conversation history grows!)

Conversation Management¶

class ConversationalRAG:
    def __init__(self):
        self.conversation_history = []
        self.entity_tracker = {}
    
    def rewrite_query_with_context(self, current_query, history):
        """Rewrite query to be standalone using conversation context."""
        # "What about performance?" → "What about Python performance?"
        pass
    
    def should_retrieve(self, query, history):
        """Decide if we need new retrieval or can use context."""
        # Avoid unnecessary retrievals for clarification questions
        pass
    
    def chat(self, user_message):
        # Rewrite query
        # Retrieve if needed
        # Generate with conversation context
        # Update history
        pass

💡 Hint

Use LLM to rewrite queries: "Given conversation history, rewrite this query to be standalone" Keep sliding window of last N turns to manage tokens. Detect if query is clarification vs new topic.

🏆 Meta Challenge: RAG Optimization Competition¶

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation

The Ultimate Challenge¶

Build the best RAG system for a specific domain and prove it!

Competition Format¶

Choose Domain: Medical, legal, technical docs, customer support, etc.
Build System: Full RAG pipeline
Create Benchmark: 100+ test questions with ground truth
Optimize Everything:
- Chunking strategy
- Embedding model
- Retrieval method
- Re-ranking
- Generation prompts
- Cost/latency tradeoffs

Leaderboard Metrics¶

Accuracy: % of correct answers
Faithfulness: % of answers supported by context
Latency: Average response time
Cost: $ per 1000 queries
User Satisfaction: Human evaluation (1-5)

Deliverables¶

Complete RAG system (code)
Benchmark dataset (questions + answers)
Evaluation results (metrics + analysis)
Technical report (methodology + findings)
Demo (Gradio/Streamlit app)

Bonus Points¶

Open-source your solution
Deploy publicly
Write blog post about optimizations
Beat baseline by >20% accuracy

📊 Challenge Progress Tracker¶

Challenge 1: Chunking Optimization
Challenge 2: Query Expansion
Challenge 3: Hallucination Hunter
Challenge 4: Conversational RAG
Challenge 5: Multi-Modal RAG
Meta Challenge: RAG Optimization Competition

💡 Tips for Success¶

Start Simple: Get basic version working first
Measure Everything: Metrics guide optimization
Error Analysis: Study failures to improve
Read Papers: Many techniques have research backing
Use Tools: LangChain, LlamaIndex can speed things up
Iterate: First version won’t be perfect

📚 Helpful Resources¶

Happy building! 🚀

Remember: RAG is about the journey of optimization, not just the destination!

Challenges: RAG Systems¶

🚀 Challenge 1: The Chunking Optimization Game¶

The Problem¶

Your Task¶

Evaluation Metrics¶

Success Criteria¶

🚀 Challenge 2: Query Expansion Techniques¶

The Problem¶

Your Task¶

Comparison Task¶

🚀 Challenge 3: The Hallucination Hunter¶

The Problem¶

Your Task¶

Implementation¶

Test Dataset¶

🚀 Challenge 4: Conversational RAG¶

The Problem¶

Your Task¶

Requirements¶

Conversation Management¶

🚀 Challenge 5: Multi-Modal RAG¶

The Problem¶

Your Task¶

Example Use Case: Technical Documentation¶

Implementation Components¶

Success Criteria¶

🏆 Meta Challenge: RAG Optimization Competition¶

The Ultimate Challenge¶

Competition Format¶

Leaderboard Metrics¶

Deliverables¶

Bonus Points¶

📊 Challenge Progress Tracker¶

🏅 Share Your Work¶

💡 Tips for Success¶

📚 Helpful Resources¶