Assignment: Build a Production-Ready RAG SystemΒΆ

🎯 Objective¢

Build a complete Retrieval-Augmented Generation (RAG) system for a real-world use case. Your system should handle document ingestion, intelligent retrieval, and high-quality answer generation with proper evaluation metrics.

Estimated Time: 8-10 hours
Difficulty: ⭐⭐⭐⭐ Advanced
Due Date: 2 weeks from assignment

πŸ“‹ RequirementsΒΆ

Part 1: Document Processing Pipeline (20 points)ΒΆ

Build a robust document ingestion system:

  • Multi-format support: PDF, DOCX, TXT, Markdown, HTML

  • Intelligent chunking:

    • Semantic chunking (keep related content together)

    • Overlapping chunks for context preservation

    • Metadata extraction (title, author, date, section)

  • Text cleaning: Remove headers, footers, page numbers

  • Deduplication: Detect and remove duplicate chunks

class DocumentProcessor:
    def __init__(self, chunk_size=512, chunk_overlap=50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def process_document(self, file_path):
        """Process document and return structured chunks."""
        # TODO: Implement
        pass
    
    def chunk_text(self, text, metadata=None):
        """Split text into semantic chunks."""
        # TODO: Implement semantic chunking
        pass
    
    def extract_metadata(self, document):
        """Extract document metadata."""
        # TODO: Implement
        pass

Part 2: Vector Database & Retrieval (25 points)ΒΆ

Implement advanced retrieval strategies:

  • Vector store setup: Use ChromaDB, Pinecone, or Weaviate

  • Embedding generation: Use OpenAI embeddings or open-source alternatives

  • Hybrid search: Combine dense (semantic) + sparse (keyword) retrieval

  • Re-ranking: Implement cross-encoder re-ranking for top results

  • Metadata filtering: Filter by date, author, document type

class RAGRetriever:
    def __init__(self, vector_store, embedding_model):
        self.vector_store = vector_store
        self.embedding_model = embedding_model
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    def hybrid_search(self, query, top_k=10, alpha=0.5):
        """
        Combine semantic and keyword search.
        
        Args:
            query: User question
            top_k: Number of results
            alpha: Weight for semantic vs keyword (0-1)
        """
        # TODO: Implement hybrid search
        pass
    
    def rerank_results(self, query, candidates):
        """Re-rank retrieved documents using cross-encoder."""
        # TODO: Implement re-ranking
        pass
    
    def filter_by_metadata(self, results, filters):
        """Apply metadata filters to results."""
        # TODO: Implement filtering
        pass

Part 3: Answer Generation (25 points)ΒΆ

Create an intelligent answer generator:

  • Context assembly: Select and order most relevant chunks

  • Prompt engineering: Design effective RAG prompts

  • Citation tracking: Include source references in answers

  • Confidence scoring: Estimate answer confidence

  • Fallback handling: Graceful handling when no good answer exists

class AnswerGenerator:
    def __init__(self, llm_client, model="gpt-4-turbo"):
        self.client = llm_client
        self.model = model
    
    def generate_answer(self, query, context_chunks, include_citations=True):
        """
        Generate answer from retrieved context.
        
        Returns:
            {
                "answer": str,
                "citations": List[dict],
                "confidence": float,
                "sources_used": List[str]
            }
        """
        # TODO: Implement answer generation
        pass
    
    def build_prompt(self, query, context):
        """Build RAG prompt with query and context."""
        prompt = f"""Answer the question based on the context below. 
        If the answer is not in the context, say "I don't have enough information."
        
        Context:
        {context}
        
        Question: {query}
        
        Answer with citations:"""
        return prompt
    
    def estimate_confidence(self, answer, context):
        """Estimate answer confidence (0-1)."""
        # TODO: Implement confidence scoring
        pass

Part 4: Evaluation & Testing (30 points)ΒΆ

Comprehensively evaluate your RAG system:

  • Create test dataset: 50+ questions with ground truth answers

  • Retrieval metrics:

    • Precision@K, Recall@K

    • Mean Reciprocal Rank (MRR)

    • Normalized Discounted Cumulative Gain (NDCG)

  • Generation metrics:

    • ROUGE scores

    • BERTScore

    • Semantic similarity

    • Faithfulness (no hallucinations)

  • End-to-end metrics:

    • Answer accuracy

    • Latency (< 3 seconds)

    • Cost per query

class RAGEvaluator:
    def __init__(self, rag_system):
        self.rag_system = rag_system
    
    def evaluate_retrieval(self, test_set):
        """
        Evaluate retrieval quality.
        
        Returns metrics: precision, recall, MRR, NDCG
        """
        pass
    
    def evaluate_generation(self, test_set):
        """
        Evaluate answer quality.
        
        Returns metrics: ROUGE, BERTScore, faithfulness
        """
        pass
    
    def evaluate_end_to_end(self, test_set):
        """
        Full system evaluation.
        
        Returns: accuracy, latency, cost
        """
        pass
    
    def create_evaluation_report(self):
        """Generate comprehensive evaluation report."""
        pass

πŸ“Š Grading RubricΒΆ

Criteria

Exemplary (A: 90-100%)

Proficient (B: 80-89%)

Adequate (C: 70-79%)

Needs Work (D/F: <70%)

Document Processing (20pts)

Multi-format, semantic chunking, metadata

Good chunking, basic metadata

Simple chunking only

Broken or incomplete

Retrieval (25pts)

Hybrid search + reranking, excellent relevance

Good retrieval, some reranking

Basic semantic search

Poor retrieval quality

Generation (25pts)

Citations, confidence, high quality answers

Good answers, some citations

Basic answers generated

Poor answer quality

Evaluation (30pts)

Comprehensive metrics, >50 test cases, deep analysis

Good metrics, 30-50 tests

Basic eval, 20-30 tests

Incomplete evaluation

🎯 Use Case Options (Choose One)¢

Option 1: Technical Documentation AssistantΒΆ

Dataset: Python/React/AWS documentation
Challenge: Handle code examples, API references
Special requirement: Syntax highlighting in answers

Option 2: Research Paper Q&AΒΆ

Dataset: ArXiv papers in your field of interest
Challenge: Mathematical notation, citations
Special requirement: LaTeX rendering

Option 3: Company Knowledge BaseΒΆ

Dataset: Internal docs, wikis, Slack conversations
Challenge: Privacy, access control
Special requirement: User permissions

🌟 Bonus Challenges (+10 points each, max +40)¢

Bonus 1: Conversational RAG (+10)ΒΆ

  • Multi-turn conversations with context

  • Follow-up question handling

  • Conversation memory management

  • Context window optimization

Bonus 2: Advanced Retrieval (+10)ΒΆ

  • Query expansion with LLMs

  • Multi-query retrieval

  • Parent document retrieval

  • Hypothetical Document Embeddings (HyDE)

Bonus 3: Deployment (+10)ΒΆ

  • FastAPI backend

  • Gradio/Streamlit frontend

  • Docker containerization

  • Deploy to cloud (Hugging Face Spaces/Railway)

Bonus 4: Monitoring & Analytics (+10)ΒΆ

  • Query analytics dashboard

  • User feedback collection

  • A/B testing framework

  • Cost tracking per user/query

πŸ“¦ Submission RequirementsΒΆ

Repository StructureΒΆ

your-name-rag-system/
β”œβ”€β”€ README.md                          # Setup and usage guide
β”œβ”€β”€ requirements.txt                   # Dependencies
β”œβ”€β”€ .env.example                       # Environment variables template
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ document_processor.py          # Part 1
β”‚   β”œβ”€β”€ retriever.py                   # Part 2
β”‚   β”œβ”€β”€ generator.py                   # Part 3
β”‚   β”œβ”€β”€ evaluator.py                   # Part 4
β”‚   └── rag_system.py                  # Main system
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ documents/                     # Source documents
β”‚   β”œβ”€β”€ test_set.json                  # Evaluation questions
β”‚   └── ground_truth.json              # Expected answers
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_preparation.ipynb
β”‚   β”œβ”€β”€ 02_retrieval_experiments.ipynb
β”‚   β”œβ”€β”€ 03_generation_tuning.ipynb
β”‚   └── 04_evaluation_analysis.ipynb
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_processor.py
β”‚   β”œβ”€β”€ test_retriever.py
β”‚   β”œβ”€β”€ test_generator.py
β”‚   └── test_integration.py
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ metrics.json
β”‚   β”œβ”€β”€ error_analysis.md
β”‚   └── charts/
└── EVALUATION_REPORT.md               # Detailed analysis

DeliverablesΒΆ

  1. Working RAG System:

    • All 4 parts implemented

    • Passes all tests

    • CLI or API interface

    • Demo notebook

  2. Evaluation Report:

    • Methodology description

    • Metrics tables and charts

    • Error analysis

    • Optimization attempts

    • Conclusions

  3. Test Dataset:

    • 50+ diverse questions

    • Ground truth answers

    • Difficulty levels

    • Coverage of edge cases

  4. Demo:

    • 5-minute video OR

    • Live Gradio/Streamlit app

    • Show: ingestion β†’ retrieval β†’ generation β†’ evaluation

πŸ’‘ Advanced TipsΒΆ

Tip 1: Semantic Chunking Strategy
def semantic_chunking(text, max_chunk_size=512):
    """
    Chunk text at semantic boundaries.
    Priority: paragraphs > sentences > words
    """
    # Try paragraph-level first
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        if len(current_chunk) + len(para) < max_chunk_size:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks
Tip 2: Hybrid Search Implementation
def hybrid_search(self, query, top_k=10, alpha=0.7):
    """
    Combine dense (embeddings) + sparse (BM25) search.
    
    alpha: weight for dense search (1-alpha for sparse)
    """
    # Dense retrieval
    query_embedding = self.embed(query)
    dense_results = self.vector_store.similarity_search(
        query_embedding, k=top_k*2
    )
    
    # Sparse retrieval (BM25)
    sparse_results = self.bm25.get_top_n(query, self.documents, n=top_k*2)
    
    # Combine with weighted scoring
    combined_scores = {}
    for doc, score in dense_results:
        combined_scores[doc.id] = alpha * score
    
    for doc, score in sparse_results:
        combined_scores[doc.id] = (
            combined_scores.get(doc.id, 0) + (1-alpha) * score
        )
    
    # Sort and return top k
    sorted_docs = sorted(
        combined_scores.items(), 
        key=lambda x: x[1], 
        reverse=True
    )[:top_k]
    
    return [self.get_doc(doc_id) for doc_id, _ in sorted_docs]
Tip 3: Citation Extraction
def generate_with_citations(self, query, chunks):
    """Generate answer with inline citations."""
    # Number each chunk
    context = "\n\n".join([
        f"[{i+1}] {chunk.text}"
        for i, chunk in enumerate(chunks)
    ])
    
    prompt = f"""Answer using the numbered sources below. 
    Include citations like [1], [2] in your answer.
    
    {context}
    
    Question: {query}
    Answer:"""
    
    answer = self.llm(prompt)
    
    # Extract citations and map to sources
    import re
    citations = re.findall(r'\[(\d+)\]', answer)
    sources = [chunks[int(c)-1].metadata for c in set(citations)]
    
    return {
        "answer": answer,
        "sources": sources,
        "citation_count": len(set(citations))
    }

πŸ“š ResourcesΒΆ

Essential ReadingΒΆ

Tools & LibrariesΒΆ

  • Vector Stores: ChromaDB, Pinecone, Weaviate, Qdrant

  • Embeddings: OpenAI, Cohere, Sentence-Transformers

  • Frameworks: LangChain, LlamaIndex, Haystack

  • Evaluation: RAGAS, DeepEval

PapersΒΆ

πŸŽ“ Learning ObjectivesΒΆ

After completing this assignment, you will:

  • βœ… Build end-to-end RAG systems

  • βœ… Implement advanced retrieval techniques

  • βœ… Optimize for quality, speed, and cost

  • βœ… Evaluate RAG systems comprehensively

  • βœ… Deploy production-ready AI applications

πŸ’¬ SupportΒΆ

Good luck building your RAG system! πŸš€