Phase 15 Assignment: Complete Model Evaluation PipelineΒΆ

πŸ“‹ OverviewΒΆ

Build a comprehensive model evaluation pipeline that compares multiple models across various metrics, tests for statistical significance, and provides deployment recommendations.

Due: End of Phase 15
Points: 100 points (+ 20 bonus points)
Submission: Jupyter Notebook + Report

🎯 Objectives¢

You will:

  • Implement end-to-end model evaluation

  • Compare 3+ models with cross-validation

  • Evaluate fairness and bias

  • Perform statistical significance testing

  • Make data-driven deployment decisions

πŸ“ RequirementsΒΆ

Part 1: Dataset Selection (10 points)ΒΆ

Choose ONE dataset:

Option A: Classification (Recommended)

  • UCI Adult Income dataset

  • Credit approval dataset

  • Customer churn dataset

Option B: Regression

  • Boston Housing dataset

  • California Housing dataset

  • Your own regression dataset

Requirements:

  • At least 1000 samples

  • Include protected attribute (gender, age, race) for fairness analysis

  • Document data source and description

  • Perform EDA (5+ visualizations)

Part 2: Model Training & Comparison (25 points)ΒΆ

Train at least 3 different models:

Classification options:

  • Logistic Regression

  • Random Forest

  • Gradient Boosting

  • SVM

  • Neural Network

Regression options:

  • Linear Regression

  • Ridge/Lasso

  • Random Forest Regressor

  • Gradient Boosting Regressor

  • Neural Network

Requirements:

  • Use 5-fold or 10-fold cross-validation

  • Track multiple metrics (accuracy, precision, recall, F1, AUC for classification OR MSE, RMSE, MAE, RΒ² for regression)

  • Measure training and prediction time

  • Create comparison table

  • Visualize results (3+ plots)

Deliverables:

  • Comparison DataFrame with all metrics

  • Bar plots comparing models

  • Training/prediction time comparison

  • Best performing model identification

Part 3: Statistical Testing (20 points)ΒΆ

Compare your top 2 models statistically:

  • Paired t-test on cross-validation scores

  • McNemar’s test (classification) or correlation test (regression)

  • Calculate confidence intervals

  • Report p-values and interpret

  • Determine if difference is statistically significant

Deliverables:

  • Statistical test results

  • Interpretation write-up (200+ words)

  • Decision: Is one model significantly better?

Part 4: Fairness & Bias Evaluation (25 points)ΒΆ

Analyze bias across protected attribute:

Classification:

  • Calculate demographic parity difference

  • Calculate demographic parity ratio

  • Calculate equalized odds difference

  • Check 80% rule compliance

  • Performance breakdown by group (confusion matrices)

Regression:

  • Calculate metrics by group (MAE, RMSE, RΒ²)

  • Test for statistical difference in errors

  • Analyze residual distributions by group

Requirements:

  • Test for bias in at least 1 protected attribute

  • Create visualizations (2+ plots)

  • Document findings

  • Recommend mitigation if bias detected

Deliverables:

  • Fairness metrics table

  • Group-wise performance comparison

  • Bias detection report (300+ words)

  • Mitigation recommendations (if applicable)

Part 5: Model Selection & Recommendation (20 points)ΒΆ

Make final deployment recommendation:

Consider:

  • Performance metrics

  • Statistical significance

  • Fairness metrics

  • Speed/latency

  • Interpretability needs

  • Resource constraints

Requirements:

  • Multi-objective scoring (create weighted composite)

  • Sensitivity analysis (test different weights)

  • Trade-off visualization

  • Final recommendation with justification

Deliverables:

  • Composite scoring table

  • Trade-off plots (accuracy vs speed, accuracy vs fairness)

  • Final recommendation report (500+ words)

  • Deployment plan

🎁 Bonus Challenges (20 extra points)¢

Bonus 1: Advanced Metrics (5 points)ΒΆ

  • Implement LLM-specific metrics (BLEU, ROUGE, BERTScore) if using text data

  • Or implement custom domain-specific metrics

Bonus 2: A/B Test Simulation (5 points)ΒΆ

  • Simulate A/B test between top 2 models

  • Calculate required sample size

  • Perform power analysis

  • Create monitoring dashboard mockup

Bonus 3: Bias Mitigation (5 points)ΒΆ

  • Implement 2+ bias mitigation techniques:

    • Reweighting

    • Threshold optimization

    • Fairlearn constraints

  • Compare fairness before/after mitigation

Bonus 4: Production Monitoring Plan (5 points)ΒΆ

  • Design monitoring dashboard

  • Define alerting thresholds

  • Create model card documentation

  • Plan A/B test for deployment

πŸ“Š Grading RubricΒΆ

Component

Points

Criteria

Dataset & EDA

10

Appropriate dataset, thorough EDA, clear documentation

Model Comparison

25

3+ models, proper cross-validation, comprehensive metrics, good visualizations

Statistical Testing

20

Correct tests applied, proper interpretation, clear conclusions

Fairness Analysis

25

Multiple fairness metrics, group analysis, actionable insights

Final Recommendation

20

Well-reasoned decision, multi-objective analysis, clear documentation

Code Quality

10

Clean, documented, reproducible code

Report Quality

10

Clear writing, professional formatting, complete analysis

Bonus

+20

Extra challenges completed

TOTAL

120

πŸ“¦ Submission RequirementsΒΆ

Files to Submit:ΒΆ

  1. Jupyter Notebook (evaluation_pipeline.ipynb)

    • All code with outputs

    • Markdown explanations

    • Visualizations embedded

  2. PDF Report (evaluation_report.pdf)

    • Executive summary

    • Methods description

    • Results and findings

    • Final recommendation

    • Appendix with all plots

  3. README (README.md)

    • How to run code

    • Dependencies list

    • Dataset download instructions

    • Expected runtime

Submission Format:ΒΆ

phase15_assignment_[your_name]/
β”œβ”€β”€ evaluation_pipeline.ipynb
β”œβ”€β”€ evaluation_report.pdf
β”œβ”€β”€ README.md
β”œβ”€β”€ data/
β”‚   └── dataset.csv (if small)
└── figures/
    └── *.png (exported plots)

πŸ’‘ Tips for SuccessΒΆ

  1. Start Early

    • This assignment requires significant analysis

    • Leave time for statistical testing and report writing

  2. Document Everything

    • Explain your choices

    • Interpret results in context

    • Justify your recommendation

  3. Focus on Insights

    • Don’t just report numbers

    • Explain what they mean

    • Connect to real-world impact

  4. Be Thorough

    • Check all requirements

    • Include all visualizations

    • Test your code runs end-to-end

  5. Professional Presentation

    • Clean code with comments

    • Well-formatted plots

    • Polished report

πŸ” Example StructureΒΆ

# 1. Data Loading & EDA
# 2. Data Preprocessing
# 3. Model Training
# 4. Cross-Validation Evaluation
# 5. Statistical Comparison
# 6. Fairness Analysis
# 7. Multi-Objective Selection
# 8. Final Recommendation

πŸ“š ResourcesΒΆ

❓ FAQΒΆ

Q: Can I use my own dataset?
A: Yes, if it meets the requirements (1000+ samples, has protected attribute).

Q: What if my data doesn’t have protected attributes?
A: Use UCI Adult, Credit Approval, or similar public dataset with demographic info.

Q: How many models minimum?
A: 3 different model types required. More is better!

Q: Can I work in a group?
A: Check with instructor. Usually individual work preferred.

Q: What if models perform very similarly?
A: Great! Discuss why statistical testing is important. Recommend based on other factors (speed, interpretability, cost).

πŸŽ“ Learning OutcomesΒΆ

After completing this assignment, you will be able to:

  • βœ… Design comprehensive model evaluation pipelines

  • βœ… Select appropriate metrics for different scenarios

  • βœ… Perform rigorous statistical comparisons

  • βœ… Identify and measure model bias

  • βœ… Make data-driven deployment decisions

  • βœ… Communicate technical findings effectively

Good luck! This is your chance to demonstrate mastery of model evaluation! πŸš€