Run this notebook: Open in Colab Open in Kaggle

RAG Evaluation¶

Measuring RAG Performance¶

Key Metrics¶

Retrieval Metrics
- Precision@K
- Recall@K
- MRR (Mean Reciprocal Rank)
Generation Metrics
- Faithfulness (to context)
- Answer relevancy
- Context precision
End-to-End
- User satisfaction
- Response time

import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

Evaluation Framework¶

Measuring What Matters¶

The ragas library provides a structured evaluation framework for RAG systems. It computes metrics like faithfulness (does the answer stay true to the retrieved context?), answer relevancy (does the answer address the question?), and context precision (are the retrieved documents actually useful?). Each metric is computed by prompting an LLM to judge the relationship between the question, retrieved context, generated answer, and ground-truth answer. Running these evaluations on a curated test set lets you compare chunking strategies, embedding models, retrieval algorithms, and prompt templates with quantitative rigor rather than subjective impression.

# Define test cases
test_cases = [
    {
        "question": "What is machine learning?",
        "expected_context": "ML definition",
        "ground_truth": "ML is a subset of AI..."
    }
]

# Evaluate
results = evaluate(
    test_cases,
    metrics=[faithfulness, answer_relevancy]
)

print(f"Faithfulness: {results['faithfulness']}")
print(f"Relevancy: {results['answer_relevancy']}")

Best Practices¶

✅ Create evaluation datasets ✅ Monitor retrieval quality ✅ Track latency and costs ✅ A/B test different approaches ✅ Collect user feedback

Congratulations! You now understand RAG systems! 🎉