Chapter 1: Introduction (Additional Problems)¶
Problem 1.1: Dataset Analysis¶
Load
fetch_openml('titanic', version=1)datasetIdentify all variable types (numerical, categorical)
Determine if this is regression or classification
Calculate survival rate by passenger class
Create 3 visualizations showing key patterns
# Your implementation here
Problem 1.2: Model Comparison¶
Use California Housing dataset
Compare Linear Regression, Decision Tree, Random Forest
Calculate R², RMSE for each on test set
Plot predictions vs actual for best model
Explain which model is best and why
# Your implementation here
Problem 1.3: Overfitting Deep Dive¶
Generate data:
y = 3x² - 2x + 1 + noiseFit polynomials degree 1-20
Plot both training and validation error curves
Identify optimal degree
Explain bias-variance trade-off observed
# Your implementation here
Chapter 2: Statistical Learning (Additional Problems)¶
Problem 2.1: Bias-Variance Decomposition¶
Simulate data with known function
For polynomial degrees 1, 5, 15:
Estimate bias
Estimate variance
Calculate MSE
Visualize bias²+variance=MSE relationship
Identify optimal complexity
# Your implementation here
Problem 2.2: Irreducible Error¶
Generate:
y = sin(x) + noisewith σ=0.5Fit perfect model (know true function)
Show that error cannot go below σ²
Demonstrate with multiple noise levels
Discuss practical implications
# Your implementation here
Problem 2.3: Curse of Dimensionality¶
Generate uniform data in 1D, 2D, 5D, 10D
For each dimension, calculate fraction of data within unit ball
Visualize sparsity as dimension increases
Explain impact on k-NN
Suggest solutions
# Your implementation here
Chapter 3: Linear Regression (Additional Problems)¶
Problem 3.1: Multiple Collinearity¶
# Create correlated features
import numpy as np
X1 = np.random.randn(100)
X2 = X1 + np.random.randn(100) * 0.1 # Highly correlated
X3 = np.random.randn(100)
# y = 2*X1 + 3*X3 + noise
# Tasks:
# - Fit regression with all features
# - Calculate VIF (Variance Inflation Factor)
# - Compare with Ridge regression
# - Explain coefficient instability
Problem 3.2: Residual Diagnostics¶
Load Boston Housing dataset
Fit linear regression
Create 4-panel diagnostic plot:
Residuals vs Fitted
Q-Q plot
Scale-Location plot
Residuals vs Leverage
Identify violations of assumptions
Suggest transformations
# Your implementation here
Problem 3.3: Interaction Terms¶
# Advertising data with interaction
# Sales = β0 + β1*TV + β2*Radio + β3*TV*Radio + ε
# Tasks:
# - Fit model with and without interaction
# - Compare R² and adjusted R²
# - Test interaction significance
# - Interpret interaction coefficient
# - Visualize interaction effect
Chapter 4: Classification (Additional Problems)¶
Problem 4.1: Imbalanced Classes¶
Create dataset: 95% class 0, 5% class 1
Train classifier on imbalanced data
Calculate accuracy, precision, recall
Apply SMOTE or class weights
Compare performance before/after
# Your implementation here
Problem 4.2: ROC Curves¶
Use breast cancer dataset
Train Logistic Regression, SVM, Random Forest
Plot ROC curves for all three
Calculate AUC for each
Identify optimal threshold for each
Compare at different operating points
# Your implementation here
Problem 4.3: Multi-class Classification¶
Load digits dataset (10 classes)
Implement OvR (One vs Rest)
Implement OvO (One vs One)
Create confusion matrix heatmap
Calculate per-class precision/recall
Identify most confused class pairs
# Your implementation here
Chapter 5: Resampling Methods (Additional Problems)¶
Problem 5.1: Cross-Validation Comparison¶
Use same dataset
Compare: LOOCV, 5-fold, 10-fold, 20-fold
For each: calculate mean and std of error
Compare computational time
Plot error vs number of folds
Recommend best approach
# Your implementation here
Problem 5.2: Bootstrap Confidence Intervals¶
# Estimate CI for correlation coefficient
def boot_corr(data, indices):
return np.corrcoef(data[indices, 0], data[indices, 1])[0,1]
# Tasks:
# - Generate bivariate normal data
# - Bootstrap 10,000 times
# - Calculate 95% CI (percentile method)
# - Compare with analytical CI
# - Visualize bootstrap distribution
Problem 5.3: Nested Cross-Validation¶
Outer loop: 5-fold CV for performance
Inner loop: 3-fold CV for hyperparameter tuning
Apply to SVM with C and gamma parameters
Report unbiased performance estimate
Compare with simple CV
# Your implementation here
Chapter 6: Regularization (Additional Problems)¶
Problem 6.1: Regularization Path¶
Load diabetes dataset
For Ridge: try λ from 10⁻⁴ to 10⁴
For Lasso: try λ from 10⁻⁴ to 10⁴
Plot coefficient paths
Identify when coefficients hit zero (Lasso)
Select λ via CV
# Your implementation here
Problem 6.2: Elastic Net Tuning¶
Grid search over α ∈ [0, 1] and λ
α=0 (Ridge), α=1 (Lasso), α=0.5 (middle)
Use 10-fold CV
Create heatmap of CV error
Identify optimal (α, λ) pair
Compare with pure Ridge/Lasso
# Your implementation here
Problem 6.3: Feature Selection Comparison¶
methods = ['Forward Selection', 'Backward Selection', 'Lasso']
# Tasks:
# - Apply all three to high-dimensional data
# - Compare selected features
# - Compare prediction performance
# - Analyze computational cost
# - Discuss stability of selections
Chapter 7: Non-Linearity (Additional Problems)¶
Problem 7.1: Spline Knot Selection¶
Generate non-linear data
Fit B-splines with 3, 5, 10, 20 knots
Use CV to select optimal knots
Plot fitted curves
Compare with polynomial approach
# Your implementation here
Problem 7.2: GAM Implementation¶
from pygam import GAM, s
# Fit: y ~ s(x1) + s(x2) + s(x3)
# Tasks:
# - Load auto dataset
# - Fit GAM with smoothing splines
# - Extract partial dependence plots
# - Compare with linear model
# - Interpret smooth functions
Problem 7.3: Local Regression (LOESS)¶
Implement or use LOESS
Try span values: 0.1, 0.3, 0.5, 0.75
Visualize fits
Calculate CV error for each
Compare with polynomial regression
# Your implementation here
Chapter 8: Tree Methods (Additional Problems)¶
Problem 8.1: Tree Pruning¶
Grow deep tree (max_depth=20)
Extract cost-complexity path
Plot CV error vs tree size
Select optimal tree size
Compare full vs pruned tree
# Your implementation here
Problem 8.2: Feature Importance Analysis¶
Train Random Forest on iris
Extract feature importances
Create bar plot
Permutation importance
Compare Gini vs permutation importance
Test stability with different seeds
# Your implementation here
Problem 8.3: Boosting Hyperparameters¶
Grid search over:
n_estimators: [50, 100, 200, 500]
learning_rate: [0.01, 0.05, 0.1, 0.5]
max_depth: [1, 3, 5, 7]
Tasks:
Use early stopping
Visualize learning curves
Report best combination
# Your implementation here
Chapter 9: SVM (Additional Problems)¶
Problem 9.1: Kernel Comparison¶
Kernels to try:
Linear
Polynomial (degree 2, 3, 5)
RBF (γ = 0.001, 0.01, 0.1, 1)
Sigmoid
Tasks:
Plot decision boundaries (2D)
Calculate accuracy for each
Visualize support vectors
Identify best kernel
# Your implementation here
Problem 9.2: Soft Margin Analysis¶
Try C values: [0.01, 0.1, 1, 10, 100]
For each C, count support vectors
Plot: # SVs vs C
Plot: test error vs C
Plot: margin width vs C
Explain trade-offs
# Your implementation here
Problem 9.3: SVM for Regression (SVR)¶
from sklearn.svm import SVR
# Tasks:
# - Use California housing
# - Try different ε values
# - Compare with Ridge regression
# - Identify when SVR is beneficial
# - Visualize ε-tube
Chapter 10: Deep Learning (Additional Problems)¶
Problem 10.1: Architecture Search¶
Try architectures:
[16]
[32, 16]
[64, 32, 16]
[128, 64, 32, 16]
[256, 128, 64, 32, 16]
For each:
Train on MNIST
Plot learning curves
Calculate test accuracy
Compare training time
Identify best architecture
# Your implementation here
Problem 10.2: Regularization Comparison¶
On same architecture, compare:
No regularization
L2 (α = 0.0001, 0.001, 0.01)
Dropout (p = 0.2, 0.5, 0.8)
Early stopping
L2 + Dropout
Plot validation curves Identify best combination
# Your implementation here
Problem 10.3: Activation Functions¶
Compare: ReLU, Leaky ReLU, ELU, tanh, sigmoid
Train on same dataset
Plot learning curves
Compare convergence speed
Analyze gradient flow
Recommend best choice
# Your implementation here
Chapter 11: Survival Analysis (Additional Problems)¶
Problem 11.1: Kaplan-Meier Comparison¶
Load lung cancer dataset
Compare survival by sex
Perform log-rank test
Calculate median survival for each
Plot survival curves with CI
Interpret results
# Your implementation here
Problem 11.2: Cox PH Interpretation¶
from lifelines import CoxPHFitter
# Tasks:
# - Fit Cox model with multiple covariates
# - Extract hazard ratios
# - Calculate 95% CI for each
# - Test proportional hazards assumption
# - Interpret coefficients
# - Make predictions
Problem 11.3: Time-Dependent Effects¶
Check PH assumption
If violated, use stratification
Or fit time-varying coefficient model
Compare models with AIC
Visualize time-varying effect
# Your implementation here
Chapter 12: Unsupervised Learning (Additional Problems)¶
Problem 12.1: PCA Deep Dive¶
Load handwritten digits
Apply PCA
Plot scree plot
Determine components for 90%, 95%, 99% variance
Visualize first 4 PCs
Reconstruct images with different # PCs
Show reconstruction error vs # components
# Your implementation here
Problem 12.2: Clustering Comparison¶
On same dataset, compare:
K-Means (k=2 to 10)
Hierarchical (4 linkages)
DBSCAN (vary eps and min_samples)
Gaussian Mixture Models
Metrics:
Silhouette score
Davies-Bouldin index
Calinski-Harabasz index
# Your implementation here
Problem 12.3: Hierarchical Clustering Analysis¶
Try all 4 linkage methods
Create dendrograms
Cut at different heights
Compare resulting clusters
Calculate cophenetic correlation
Identify best linkage
# Your implementation here
Chapter 13: Multiple Testing (Additional Problems)¶
Problem 13.1: Simulation Study¶
Design:
m = 1000 tests
m₀ = 800 true nulls
Effect size = 0.5
n = 30 per group
Tasks:
Run 1000 simulations
For each: apply Bonferroni, Holm, BH, BY
Calculate FWER, FDR, power for each
Verify theoretical guarantees
Plot distributions
# Your implementation here
Problem 13.2: Genomics Application¶
Simulate gene expression:
10,000 genes
20 samples (10 per group)
100 truly differentially expressed
Apply multiple testing corrections
Create volcano plot
Manhattan plot
Report discoveries
# Your implementation here
Problem 13.3: Power Analysis¶
For fixed m = 100:
Vary n: [10, 20, 30, 50, 100]
Vary effect size: [0.2, 0.5, 0.8]
For each combination:
Calculate power under BH
Calculate power under Bonferroni
Create heatmaps
Determine required n for 80% power
# Your implementation here
Project Ideas¶
Project 1: Complete ML Pipeline¶
Dataset: Titanic survival
EDA with 10+ visualizations
Feature engineering
Try 5 different models
Hyperparameter tuning
Ensemble methods
Final evaluation
Professional report
# Project 1 implementation
Project 2: Time Series Prediction¶
Dataset: Stock prices or weather
Explore temporal patterns
Create lag features
Train-test split (temporal)
Compare: ARIMA, Random Forest, LSTM
Evaluate predictions
Discuss limitations
# Project 2 implementation
Project 3: Image Classification¶
Dataset: CIFAR-10 or Fashion-MNIST
Build CNN from scratch
Try transfer learning
Data augmentation
Compare architectures
Visualize learned features
Report final accuracy
# Project 3 implementation
Challenge Problems¶
Challenge 1: Build from Scratch¶
Implement without sklearn:
Linear regression (gradient descent)
Logistic regression
k-NN
Decision tree
k-Means
PCA
Compare performance with sklearn
# Challenge 1 implementation
Challenge 2: Kaggle Competition¶
Choose active competition
Form team or solo
Apply techniques from all chapters
Document approach
Submit predictions
Aim for top 25%
# Challenge 2 implementation
Challenge 3: Research Paper Implementation¶
Choose recent ML paper
Understand methodology
Implement algorithm
Reproduce results
Apply to new dataset
Write summary report
# Challenge 3 implementation
Solutions and Hints¶
General Tips¶
For All Exercises:
Always set random seed for reproducibility
Split data before any analysis
Scale features when necessary
Use cross-validation for model selection
Visualize results
Document your code
Interpret findings
Common Mistakes to Avoid:
Data leakage (scaling before split)
Not validating assumptions
Overfitting to test set
Ignoring class imbalance
Not checking for multicollinearity
Forgetting to standardize
Debugging Checklist:
Data loaded correctly?
Train-test split done?
Features scaled?
No data leakage?
Sensible hyperparameters?
Converged properly?
Results reproducible?
Progress Tracker¶
Beginner Level (Complete 5):
Chapter 1: Exercises 1.1-1.3
Chapter 2: Exercises 2.1-2.3
Chapter 3: Exercises 3.1-3.3
Chapter 4: Exercises 4.1-4.3
Chapter 5: Exercises 5.1-5.3
Intermediate Level (Complete 5):
Chapter 6: Exercises 6.1-6.3
Chapter 7: Exercises 7.1-7.3
Chapter 8: Exercises 8.1-8.3
Chapter 9: Exercises 9.1-9.3
Chapter 10: Exercises 10.1-10.3
Advanced Level (Complete 5):
Chapter 11: Exercises 11.1-11.3
Chapter 12: Exercises 12.1-12.3
Chapter 13: Exercises 13.1-13.3
2 Projects from list
1 Challenge problem
Master Level (Complete all):
All chapter exercises
All 8 projects
3+ challenge problems
Kaggle competition participation
Research paper implementation
Certificate of Completion¶
Upon completing all exercises, you will have:
✅ Mastered statistical learning fundamentals
✅ Implemented 50+ algorithms
✅ Completed 8+ real-world projects
✅ Built comprehensive ML portfolio
✅ Ready for industry or research positions
Document your journey and share your work!
Good luck with your statistical learning journey! 🚀