Run this notebook: Open in Colab Open in Kaggle

Chapter 1: Introduction (Additional Problems)¶

Problem 1.1: Dataset Analysis¶

Load fetch_openml('titanic', version=1) dataset
Identify all variable types (numerical, categorical)
Determine if this is regression or classification
Calculate survival rate by passenger class
Create 3 visualizations showing key patterns

# Your implementation here

Problem 1.2: Model Comparison¶

Use California Housing dataset
Compare Linear Regression, Decision Tree, Random Forest
Calculate R², RMSE for each on test set
Plot predictions vs actual for best model
Explain which model is best and why

# Your implementation here

Problem 1.3: Overfitting Deep Dive¶

Generate data: y = 3x² - 2x + 1 + noise
Fit polynomials degree 1-20
Plot both training and validation error curves
Identify optimal degree
Explain bias-variance trade-off observed

# Your implementation here

Chapter 2: Statistical Learning (Additional Problems)¶

Problem 2.1: Bias-Variance Decomposition¶

Simulate data with known function
For polynomial degrees 1, 5, 15:
- Estimate bias
- Estimate variance
- Calculate MSE
Visualize bias²+variance=MSE relationship
Identify optimal complexity

# Your implementation here

Problem 2.2: Irreducible Error¶

Generate: y = sin(x) + noise with σ=0.5
Fit perfect model (know true function)
Show that error cannot go below σ²
Demonstrate with multiple noise levels
Discuss practical implications

# Your implementation here

Problem 2.3: Curse of Dimensionality¶

Generate uniform data in 1D, 2D, 5D, 10D
For each dimension, calculate fraction of data within unit ball
Visualize sparsity as dimension increases
Explain impact on k-NN
Suggest solutions

# Your implementation here

Chapter 3: Linear Regression (Additional Problems)¶

Problem 3.1: Multiple Collinearity¶

# Create correlated features
import numpy as np
X1 = np.random.randn(100)
X2 = X1 + np.random.randn(100) * 0.1  # Highly correlated
X3 = np.random.randn(100)
# y = 2*X1 + 3*X3 + noise

# Tasks:
# - Fit regression with all features
# - Calculate VIF (Variance Inflation Factor)
# - Compare with Ridge regression
# - Explain coefficient instability

Problem 3.2: Residual Diagnostics¶

Load Boston Housing dataset
Fit linear regression
Create 4-panel diagnostic plot:
- Residuals vs Fitted
- Q-Q plot
- Scale-Location plot
- Residuals vs Leverage
Identify violations of assumptions
Suggest transformations

# Your implementation here

Problem 3.3: Interaction Terms¶

# Advertising data with interaction
# Sales = β0 + β1*TV + β2*Radio + β3*TV*Radio + ε

# Tasks:
# - Fit model with and without interaction
# - Compare R² and adjusted R²
# - Test interaction significance
# - Interpret interaction coefficient
# - Visualize interaction effect

Chapter 4: Classification (Additional Problems)¶

Problem 4.1: Imbalanced Classes¶

Create dataset: 95% class 0, 5% class 1
Train classifier on imbalanced data
Calculate accuracy, precision, recall
Apply SMOTE or class weights
Compare performance before/after

# Your implementation here

Problem 4.2: ROC Curves¶

Use breast cancer dataset
Train Logistic Regression, SVM, Random Forest
Plot ROC curves for all three
Calculate AUC for each
Identify optimal threshold for each
Compare at different operating points

# Your implementation here

Problem 4.3: Multi-class Classification¶

Load digits dataset (10 classes)
Implement OvR (One vs Rest)
Implement OvO (One vs One)
Create confusion matrix heatmap
Calculate per-class precision/recall
Identify most confused class pairs

# Your implementation here

Chapter 5: Resampling Methods (Additional Problems)¶

Problem 5.1: Cross-Validation Comparison¶

Use same dataset
Compare: LOOCV, 5-fold, 10-fold, 20-fold
For each: calculate mean and std of error
Compare computational time
Plot error vs number of folds
Recommend best approach

# Your implementation here

Problem 5.2: Bootstrap Confidence Intervals¶

# Estimate CI for correlation coefficient
def boot_corr(data, indices):
    return np.corrcoef(data[indices, 0], data[indices, 1])[0,1]

# Tasks:
# - Generate bivariate normal data
# - Bootstrap 10,000 times
# - Calculate 95% CI (percentile method)
# - Compare with analytical CI
# - Visualize bootstrap distribution

Problem 5.3: Nested Cross-Validation¶

Outer loop: 5-fold CV for performance
Inner loop: 3-fold CV for hyperparameter tuning
Apply to SVM with C and gamma parameters
Report unbiased performance estimate
Compare with simple CV

# Your implementation here

Chapter 6: Regularization (Additional Problems)¶

Problem 6.1: Regularization Path¶

Load diabetes dataset
For Ridge: try λ from 10⁻⁴ to 10⁴
For Lasso: try λ from 10⁻⁴ to 10⁴
Plot coefficient paths
Identify when coefficients hit zero (Lasso)
Select λ via CV

# Your implementation here

Problem 6.2: Elastic Net Tuning¶

Grid search over α ∈ [0, 1] and λ
α=0 (Ridge), α=1 (Lasso), α=0.5 (middle)
Use 10-fold CV
Create heatmap of CV error
Identify optimal (α, λ) pair
Compare with pure Ridge/Lasso

# Your implementation here

Problem 6.3: Feature Selection Comparison¶

methods = ['Forward Selection', 'Backward Selection', 'Lasso']

# Tasks:
# - Apply all three to high-dimensional data
# - Compare selected features
# - Compare prediction performance
# - Analyze computational cost
# - Discuss stability of selections

Chapter 7: Non-Linearity (Additional Problems)¶

Problem 7.1: Spline Knot Selection¶

Generate non-linear data
Fit B-splines with 3, 5, 10, 20 knots
Use CV to select optimal knots
Plot fitted curves
Compare with polynomial approach

# Your implementation here

Problem 7.2: GAM Implementation¶

from pygam import GAM, s
# Fit: y ~ s(x1) + s(x2) + s(x3)

# Tasks:
# - Load auto dataset
# - Fit GAM with smoothing splines
# - Extract partial dependence plots
# - Compare with linear model
# - Interpret smooth functions

Problem 7.3: Local Regression (LOESS)¶

Implement or use LOESS
Try span values: 0.1, 0.3, 0.5, 0.75
Visualize fits
Calculate CV error for each
Compare with polynomial regression

# Your implementation here

Chapter 8: Tree Methods (Additional Problems)¶

Problem 8.1: Tree Pruning¶

Grow deep tree (max_depth=20)
Extract cost-complexity path
Plot CV error vs tree size
Select optimal tree size
Compare full vs pruned tree

# Your implementation here

Problem 8.2: Feature Importance Analysis¶

Train Random Forest on iris
Extract feature importances
Create bar plot
Permutation importance
Compare Gini vs permutation importance
Test stability with different seeds

# Your implementation here

Problem 8.3: Boosting Hyperparameters¶

Grid search over:

n_estimators: [50, 100, 200, 500]
learning_rate: [0.01, 0.05, 0.1, 0.5]
max_depth: [1, 3, 5, 7]

Tasks:

Use early stopping
Visualize learning curves
Report best combination

# Your implementation here

Chapter 9: SVM (Additional Problems)¶

Problem 9.1: Kernel Comparison¶

Kernels to try:

Linear
Polynomial (degree 2, 3, 5)
RBF (γ = 0.001, 0.01, 0.1, 1)
Sigmoid

Tasks:

Plot decision boundaries (2D)
Calculate accuracy for each
Visualize support vectors
Identify best kernel

# Your implementation here

Problem 9.2: Soft Margin Analysis¶

Try C values: [0.01, 0.1, 1, 10, 100]
For each C, count support vectors
Plot: # SVs vs C
Plot: test error vs C
Plot: margin width vs C
Explain trade-offs

# Your implementation here

Problem 9.3: SVM for Regression (SVR)¶

from sklearn.svm import SVR

# Tasks:
# - Use California housing
# - Try different ε values
# - Compare with Ridge regression
# - Identify when SVR is beneficial
# - Visualize ε-tube

Chapter 10: Deep Learning (Additional Problems)¶

Problem 10.1: Architecture Search¶

Try architectures:

[16]
[32, 16]
[64, 32, 16]
[128, 64, 32, 16]
[256, 128, 64, 32, 16]

For each:

Train on MNIST
Plot learning curves
Calculate test accuracy
Compare training time
Identify best architecture

# Your implementation here

Problem 10.2: Regularization Comparison¶

On same architecture, compare:

No regularization
L2 (α = 0.0001, 0.001, 0.01)
Dropout (p = 0.2, 0.5, 0.8)
Early stopping
L2 + Dropout

Plot validation curves Identify best combination

# Your implementation here

Problem 10.3: Activation Functions¶

Compare: ReLU, Leaky ReLU, ELU, tanh, sigmoid

Train on same dataset
Plot learning curves
Compare convergence speed
Analyze gradient flow
Recommend best choice

# Your implementation here

Chapter 11: Survival Analysis (Additional Problems)¶

Problem 11.1: Kaplan-Meier Comparison¶

Load lung cancer dataset
Compare survival by sex
Perform log-rank test
Calculate median survival for each
Plot survival curves with CI
Interpret results

# Your implementation here

Problem 11.2: Cox PH Interpretation¶

from lifelines import CoxPHFitter

# Tasks:
# - Fit Cox model with multiple covariates
# - Extract hazard ratios
# - Calculate 95% CI for each
# - Test proportional hazards assumption
# - Interpret coefficients
# - Make predictions

Problem 11.3: Time-Dependent Effects¶

Check PH assumption
If violated, use stratification
Or fit time-varying coefficient model
Compare models with AIC
Visualize time-varying effect

# Your implementation here

Chapter 12: Unsupervised Learning (Additional Problems)¶

Problem 12.1: PCA Deep Dive¶

Load handwritten digits
Apply PCA
Plot scree plot
Determine components for 90%, 95%, 99% variance
Visualize first 4 PCs
Reconstruct images with different # PCs
Show reconstruction error vs # components

# Your implementation here

Problem 12.2: Clustering Comparison¶

On same dataset, compare:

K-Means (k=2 to 10)
Hierarchical (4 linkages)
DBSCAN (vary eps and min_samples)
Gaussian Mixture Models

Metrics:

Silhouette score
Davies-Bouldin index
Calinski-Harabasz index

# Your implementation here

Problem 12.3: Hierarchical Clustering Analysis¶

Try all 4 linkage methods
Create dendrograms
Cut at different heights
Compare resulting clusters
Calculate cophenetic correlation
Identify best linkage

# Your implementation here

Chapter 13: Multiple Testing (Additional Problems)¶

Problem 13.1: Simulation Study¶

Design:

m = 1000 tests
m₀ = 800 true nulls
Effect size = 0.5
n = 30 per group

Tasks:

Run 1000 simulations
For each: apply Bonferroni, Holm, BH, BY
Calculate FWER, FDR, power for each
Verify theoretical guarantees
Plot distributions

# Your implementation here

Problem 13.2: Genomics Application¶

Simulate gene expression:

10,000 genes
20 samples (10 per group)
100 truly differentially expressed
Apply multiple testing corrections
Create volcano plot
Manhattan plot
Report discoveries

# Your implementation here

Problem 13.3: Power Analysis¶

For fixed m = 100:

Vary n: [10, 20, 30, 50, 100]
Vary effect size: [0.2, 0.5, 0.8]
For each combination:
- Calculate power under BH
- Calculate power under Bonferroni
Create heatmaps
Determine required n for 80% power

# Your implementation here

Project Ideas¶

Project 1: Complete ML Pipeline¶

Dataset: Titanic survival

EDA with 10+ visualizations
Feature engineering
Try 5 different models
Hyperparameter tuning
Ensemble methods
Final evaluation
Professional report

# Project 1 implementation

Project 2: Time Series Prediction¶

Dataset: Stock prices or weather

Explore temporal patterns
Create lag features
Train-test split (temporal)
Compare: ARIMA, Random Forest, LSTM
Evaluate predictions
Discuss limitations

# Project 2 implementation

Project 3: Image Classification¶

Dataset: CIFAR-10 or Fashion-MNIST

Build CNN from scratch
Try transfer learning
Data augmentation
Compare architectures
Visualize learned features
Report final accuracy

# Project 3 implementation

Challenge Problems¶

Challenge 1: Build from Scratch¶

Implement without sklearn:

Linear regression (gradient descent)
Logistic regression
k-NN
Decision tree
k-Means
PCA

Compare performance with sklearn

# Challenge 1 implementation

Challenge 2: Kaggle Competition¶

Choose active competition
Form team or solo
Apply techniques from all chapters
Document approach
Submit predictions
Aim for top 25%

# Challenge 2 implementation

Challenge 3: Research Paper Implementation¶

Choose recent ML paper
Understand methodology
Implement algorithm
Reproduce results
Apply to new dataset
Write summary report

# Challenge 3 implementation

Solutions and Hints¶

General Tips¶

For All Exercises:

Always set random seed for reproducibility
Split data before any analysis
Scale features when necessary
Use cross-validation for model selection
Visualize results
Document your code
Interpret findings

Common Mistakes to Avoid:

Data leakage (scaling before split)
Not validating assumptions
Overfitting to test set
Ignoring class imbalance
Not checking for multicollinearity
Forgetting to standardize

Debugging Checklist:

Certificate of Completion¶

Upon completing all exercises, you will have:

✅ Mastered statistical learning fundamentals
✅ Implemented 50+ algorithms
✅ Completed 8+ real-world projects
✅ Built comprehensive ML portfolio
✅ Ready for industry or research positions

Document your journey and share your work!

Good luck with your statistical learning journey! 🚀

Chapter 1: Introduction (Additional Problems)¶

Problem 1.1: Dataset Analysis¶

Problem 1.2: Model Comparison¶

Problem 1.3: Overfitting Deep Dive¶

Chapter 2: Statistical Learning (Additional Problems)¶

Problem 2.1: Bias-Variance Decomposition¶

Problem 2.2: Irreducible Error¶

Problem 2.3: Curse of Dimensionality¶

Chapter 3: Linear Regression (Additional Problems)¶

Problem 3.1: Multiple Collinearity¶

Problem 3.2: Residual Diagnostics¶

Problem 3.3: Interaction Terms¶

Chapter 4: Classification (Additional Problems)¶

Problem 4.1: Imbalanced Classes¶

Problem 4.2: ROC Curves¶

Problem 4.3: Multi-class Classification¶

Chapter 5: Resampling Methods (Additional Problems)¶

Problem 5.1: Cross-Validation Comparison¶

Problem 5.2: Bootstrap Confidence Intervals¶

Problem 5.3: Nested Cross-Validation¶

Chapter 6: Regularization (Additional Problems)¶

Problem 6.1: Regularization Path¶

Problem 6.2: Elastic Net Tuning¶

Problem 6.3: Feature Selection Comparison¶

Chapter 7: Non-Linearity (Additional Problems)¶

Problem 7.1: Spline Knot Selection¶

Problem 7.2: GAM Implementation¶

Problem 7.3: Local Regression (LOESS)¶

Chapter 8: Tree Methods (Additional Problems)¶

Problem 8.1: Tree Pruning¶

Problem 8.2: Feature Importance Analysis¶

Problem 8.3: Boosting Hyperparameters¶

Chapter 9: SVM (Additional Problems)¶

Problem 9.1: Kernel Comparison¶

Problem 9.2: Soft Margin Analysis¶

Problem 9.3: SVM for Regression (SVR)¶

Chapter 10: Deep Learning (Additional Problems)¶

Problem 10.1: Architecture Search¶

Problem 10.2: Regularization Comparison¶

Problem 10.3: Activation Functions¶

Chapter 11: Survival Analysis (Additional Problems)¶

Problem 11.1: Kaplan-Meier Comparison¶

Problem 11.2: Cox PH Interpretation¶

Problem 11.3: Time-Dependent Effects¶

Chapter 12: Unsupervised Learning (Additional Problems)¶

Problem 12.1: PCA Deep Dive¶

Problem 12.2: Clustering Comparison¶

Problem 12.3: Hierarchical Clustering Analysis¶

Chapter 13: Multiple Testing (Additional Problems)¶

Problem 13.1: Simulation Study¶

Problem 13.2: Genomics Application¶

Problem 13.3: Power Analysis¶

Project Ideas¶

Project 1: Complete ML Pipeline¶

Project 2: Time Series Prediction¶

Project 3: Image Classification¶

Challenge Problems¶

Challenge 1: Build from Scratch¶

Challenge 2: Kaggle Competition¶

Challenge 3: Research Paper Implementation¶

Solutions and Hints¶

General Tips¶

Progress Tracker¶

Certificate of Completion¶