Phase 15 Assignment: Complete Model Evaluation PipelineΒΆ
π OverviewΒΆ
Build a comprehensive model evaluation pipeline that compares multiple models across various metrics, tests for statistical significance, and provides deployment recommendations.
Due: End of Phase 15
Points: 100 points (+ 20 bonus points)
Submission: Jupyter Notebook + Report
π― ObjectivesΒΆ
You will:
Implement end-to-end model evaluation
Compare 3+ models with cross-validation
Evaluate fairness and bias
Perform statistical significance testing
Make data-driven deployment decisions
π RequirementsΒΆ
Part 1: Dataset Selection (10 points)ΒΆ
Choose ONE dataset:
Option A: Classification (Recommended)
UCI Adult Income dataset
Credit approval dataset
Customer churn dataset
Option B: Regression
Boston Housing dataset
California Housing dataset
Your own regression dataset
Requirements:
At least 1000 samples
Include protected attribute (gender, age, race) for fairness analysis
Document data source and description
Perform EDA (5+ visualizations)
Part 2: Model Training & Comparison (25 points)ΒΆ
Train at least 3 different models:
Classification options:
Logistic Regression
Random Forest
Gradient Boosting
SVM
Neural Network
Regression options:
Linear Regression
Ridge/Lasso
Random Forest Regressor
Gradient Boosting Regressor
Neural Network
Requirements:
Use 5-fold or 10-fold cross-validation
Track multiple metrics (accuracy, precision, recall, F1, AUC for classification OR MSE, RMSE, MAE, RΒ² for regression)
Measure training and prediction time
Create comparison table
Visualize results (3+ plots)
Deliverables:
Comparison DataFrame with all metrics
Bar plots comparing models
Training/prediction time comparison
Best performing model identification
Part 3: Statistical Testing (20 points)ΒΆ
Compare your top 2 models statistically:
Paired t-test on cross-validation scores
McNemarβs test (classification) or correlation test (regression)
Calculate confidence intervals
Report p-values and interpret
Determine if difference is statistically significant
Deliverables:
Statistical test results
Interpretation write-up (200+ words)
Decision: Is one model significantly better?
Part 4: Fairness & Bias Evaluation (25 points)ΒΆ
Analyze bias across protected attribute:
Classification:
Calculate demographic parity difference
Calculate demographic parity ratio
Calculate equalized odds difference
Check 80% rule compliance
Performance breakdown by group (confusion matrices)
Regression:
Calculate metrics by group (MAE, RMSE, RΒ²)
Test for statistical difference in errors
Analyze residual distributions by group
Requirements:
Test for bias in at least 1 protected attribute
Create visualizations (2+ plots)
Document findings
Recommend mitigation if bias detected
Deliverables:
Fairness metrics table
Group-wise performance comparison
Bias detection report (300+ words)
Mitigation recommendations (if applicable)
Part 5: Model Selection & Recommendation (20 points)ΒΆ
Make final deployment recommendation:
Consider:
Performance metrics
Statistical significance
Fairness metrics
Speed/latency
Interpretability needs
Resource constraints
Requirements:
Multi-objective scoring (create weighted composite)
Sensitivity analysis (test different weights)
Trade-off visualization
Final recommendation with justification
Deliverables:
Composite scoring table
Trade-off plots (accuracy vs speed, accuracy vs fairness)
Final recommendation report (500+ words)
Deployment plan
π Bonus Challenges (20 extra points)ΒΆ
Bonus 1: Advanced Metrics (5 points)ΒΆ
Implement LLM-specific metrics (BLEU, ROUGE, BERTScore) if using text data
Or implement custom domain-specific metrics
Bonus 2: A/B Test Simulation (5 points)ΒΆ
Simulate A/B test between top 2 models
Calculate required sample size
Perform power analysis
Create monitoring dashboard mockup
Bonus 3: Bias Mitigation (5 points)ΒΆ
Implement 2+ bias mitigation techniques:
Reweighting
Threshold optimization
Fairlearn constraints
Compare fairness before/after mitigation
Bonus 4: Production Monitoring Plan (5 points)ΒΆ
Design monitoring dashboard
Define alerting thresholds
Create model card documentation
Plan A/B test for deployment
π Grading RubricΒΆ
Component |
Points |
Criteria |
|---|---|---|
Dataset & EDA |
10 |
Appropriate dataset, thorough EDA, clear documentation |
Model Comparison |
25 |
3+ models, proper cross-validation, comprehensive metrics, good visualizations |
Statistical Testing |
20 |
Correct tests applied, proper interpretation, clear conclusions |
Fairness Analysis |
25 |
Multiple fairness metrics, group analysis, actionable insights |
Final Recommendation |
20 |
Well-reasoned decision, multi-objective analysis, clear documentation |
Code Quality |
10 |
Clean, documented, reproducible code |
Report Quality |
10 |
Clear writing, professional formatting, complete analysis |
Bonus |
+20 |
Extra challenges completed |
TOTAL |
120 |
π¦ Submission RequirementsΒΆ
Files to Submit:ΒΆ
Jupyter Notebook (
evaluation_pipeline.ipynb)All code with outputs
Markdown explanations
Visualizations embedded
PDF Report (
evaluation_report.pdf)Executive summary
Methods description
Results and findings
Final recommendation
Appendix with all plots
README (
README.md)How to run code
Dependencies list
Dataset download instructions
Expected runtime
Submission Format:ΒΆ
phase15_assignment_[your_name]/
βββ evaluation_pipeline.ipynb
βββ evaluation_report.pdf
βββ README.md
βββ data/
β βββ dataset.csv (if small)
βββ figures/
βββ *.png (exported plots)
π‘ Tips for SuccessΒΆ
Start Early
This assignment requires significant analysis
Leave time for statistical testing and report writing
Document Everything
Explain your choices
Interpret results in context
Justify your recommendation
Focus on Insights
Donβt just report numbers
Explain what they mean
Connect to real-world impact
Be Thorough
Check all requirements
Include all visualizations
Test your code runs end-to-end
Professional Presentation
Clean code with comments
Well-formatted plots
Polished report
π Example StructureΒΆ
# 1. Data Loading & EDA
# 2. Data Preprocessing
# 3. Model Training
# 4. Cross-Validation Evaluation
# 5. Statistical Comparison
# 6. Fairness Analysis
# 7. Multi-Objective Selection
# 8. Final Recommendation
π ResourcesΒΆ
Scikit-learn documentation: https://scikit-learn.org/stable/
Fairlearn: https://fairlearn.org/
Statistical testing guide: scipy.stats
All Phase 15 notebooks
β FAQΒΆ
Q: Can I use my own dataset?
A: Yes, if it meets the requirements (1000+ samples, has protected attribute).
Q: What if my data doesnβt have protected attributes?
A: Use UCI Adult, Credit Approval, or similar public dataset with demographic info.
Q: How many models minimum?
A: 3 different model types required. More is better!
Q: Can I work in a group?
A: Check with instructor. Usually individual work preferred.
Q: What if models perform very similarly?
A: Great! Discuss why statistical testing is important. Recommend based on other factors (speed, interpretability, cost).
π Learning OutcomesΒΆ
After completing this assignment, you will be able to:
β Design comprehensive model evaluation pipelines
β Select appropriate metrics for different scenarios
β Perform rigorous statistical comparisons
β Identify and measure model bias
β Make data-driven deployment decisions
β Communicate technical findings effectively
Good luck! This is your chance to demonstrate mastery of model evaluation! π