RAG EvaluationΒΆ

Measuring RAG PerformanceΒΆ

Key MetricsΒΆ

  1. Retrieval Metrics

    • Precision@K

    • Recall@K

    • MRR (Mean Reciprocal Rank)

  2. Generation Metrics

    • Faithfulness (to context)

    • Answer relevancy

    • Context precision

  3. End-to-End

    • User satisfaction

    • Response time

import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

Evaluation FrameworkΒΆ

Measuring What MattersΒΆ

The ragas library provides a structured evaluation framework for RAG systems. It computes metrics like faithfulness (does the answer stay true to the retrieved context?), answer relevancy (does the answer address the question?), and context precision (are the retrieved documents actually useful?). Each metric is computed by prompting an LLM to judge the relationship between the question, retrieved context, generated answer, and ground-truth answer. Running these evaluations on a curated test set lets you compare chunking strategies, embedding models, retrieval algorithms, and prompt templates with quantitative rigor rather than subjective impression.

# Define test cases
test_cases = [
    {
        "question": "What is machine learning?",
        "expected_context": "ML definition",
        "ground_truth": "ML is a subset of AI..."
    }
]

# Evaluate
results = evaluate(
    test_cases,
    metrics=[faithfulness, answer_relevancy]
)

print(f"Faithfulness: {results['faithfulness']}")
print(f"Relevancy: {results['answer_relevancy']}")

Best PracticesΒΆ

βœ… Create evaluation datasets βœ… Monitor retrieval quality βœ… Track latency and costs βœ… A/B test different approaches βœ… Collect user feedback

Congratulations! You now understand RAG systems! πŸŽ‰