Phase 16: Model Evaluation β Start HereΒΆ
Measure whether your AI models actually work β the science of choosing the right metrics, detecting bias, and comparing models objectively.
Why Evaluation MattersΒΆ
A model that scores 99% accuracy but fails on the 1% of critical cases is dangerous in production. Proper evaluation reveals where models actually fail.
Notebooks in This PhaseΒΆ
Notebook |
Topic |
|---|---|
|
Accuracy, precision, recall, F1, ROC-AUC |
|
MAE, MSE, RMSE, RΒ², MAPE |
|
BLEU, ROUGE, BERTScore, LLM-as-judge |
|
Detect and measure model bias across groups |
|
A/B testing, statistical significance, benchmarks |
Metric Cheat SheetΒΆ
Task |
Primary Metric |
When to Use Secondary |
|---|---|---|
Classification (balanced) |
Accuracy |
β |
Classification (imbalanced) |
F1 / PR-AUC |
Always check recall |
Binary with costs |
ROC-AUC |
Precision at recall threshold |
Regression |
RMSE |
RΒ² for explained variance |
Text generation |
ROUGE-L |
BERTScore for semantics |
LLM quality |
LLM-as-judge |
Human eval for high stakes |
PrerequisitesΒΆ
Machine learning basics
Python/NumPy/scikit-learn
Learning PathΒΆ
01_classification_metrics.ipynb β Start here
02_regression_metrics.ipynb
03_llm_evaluation.ipynb
04_bias_fairness.ipynb
05_model_comparison.ipynb