Phase 16: Model Evaluation β€” Start HereΒΆ

Measure whether your AI models actually work β€” the science of choosing the right metrics, detecting bias, and comparing models objectively.

Why Evaluation MattersΒΆ

A model that scores 99% accuracy but fails on the 1% of critical cases is dangerous in production. Proper evaluation reveals where models actually fail.

Notebooks in This PhaseΒΆ

Notebook

Topic

01_classification_metrics.ipynb

Accuracy, precision, recall, F1, ROC-AUC

02_regression_metrics.ipynb

MAE, MSE, RMSE, RΒ², MAPE

03_llm_evaluation.ipynb

BLEU, ROUGE, BERTScore, LLM-as-judge

04_bias_fairness.ipynb

Detect and measure model bias across groups

05_model_comparison.ipynb

A/B testing, statistical significance, benchmarks

Metric Cheat SheetΒΆ

Task

Primary Metric

When to Use Secondary

Classification (balanced)

Accuracy

β€”

Classification (imbalanced)

F1 / PR-AUC

Always check recall

Binary with costs

ROC-AUC

Precision at recall threshold

Regression

RMSE

RΒ² for explained variance

Text generation

ROUGE-L

BERTScore for semantics

LLM quality

LLM-as-judge

Human eval for high stakes

PrerequisitesΒΆ

  • Machine learning basics

  • Python/NumPy/scikit-learn

Learning PathΒΆ

01_classification_metrics.ipynb  ← Start here
02_regression_metrics.ipynb
03_llm_evaluation.ipynb
04_bias_fairness.ipynb
05_model_comparison.ipynb