RAG EvaluationΒΆ
Measuring RAG PerformanceΒΆ
Key MetricsΒΆ
Retrieval Metrics
Precision@K
Recall@K
MRR (Mean Reciprocal Rank)
Generation Metrics
Faithfulness (to context)
Answer relevancy
Context precision
End-to-End
User satisfaction
Response time
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
Evaluation FrameworkΒΆ
Measuring What MattersΒΆ
The ragas library provides a structured evaluation framework for RAG systems. It computes metrics like faithfulness (does the answer stay true to the retrieved context?), answer relevancy (does the answer address the question?), and context precision (are the retrieved documents actually useful?). Each metric is computed by prompting an LLM to judge the relationship between the question, retrieved context, generated answer, and ground-truth answer. Running these evaluations on a curated test set lets you compare chunking strategies, embedding models, retrieval algorithms, and prompt templates with quantitative rigor rather than subjective impression.
# Define test cases
test_cases = [
{
"question": "What is machine learning?",
"expected_context": "ML definition",
"ground_truth": "ML is a subset of AI..."
}
]
# Evaluate
results = evaluate(
test_cases,
metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {results['faithfulness']}")
print(f"Relevancy: {results['answer_relevancy']}")
Best PracticesΒΆ
β Create evaluation datasets β Monitor retrieval quality β Track latency and costs β A/B test different approaches β Collect user feedback
Congratulations! You now understand RAG systems! π