Chapter 1: Introduction (Additional Problems)

Problem 1.1: Dataset Analysis

  • Load fetch_openml('titanic', version=1) dataset

  • Identify all variable types (numerical, categorical)

  • Determine if this is regression or classification

  • Calculate survival rate by passenger class

  • Create 3 visualizations showing key patterns

# Your implementation here

Problem 1.2: Model Comparison

  • Use California Housing dataset

  • Compare Linear Regression, Decision Tree, Random Forest

  • Calculate R², RMSE for each on test set

  • Plot predictions vs actual for best model

  • Explain which model is best and why

# Your implementation here

Problem 1.3: Overfitting Deep Dive

  • Generate data: y = 3x² - 2x + 1 + noise

  • Fit polynomials degree 1-20

  • Plot both training and validation error curves

  • Identify optimal degree

  • Explain bias-variance trade-off observed

# Your implementation here

Chapter 2: Statistical Learning (Additional Problems)

Problem 2.1: Bias-Variance Decomposition

  • Simulate data with known function

  • For polynomial degrees 1, 5, 15:

    • Estimate bias

    • Estimate variance

    • Calculate MSE

  • Visualize bias²+variance=MSE relationship

  • Identify optimal complexity

# Your implementation here

Problem 2.2: Irreducible Error

  • Generate: y = sin(x) + noise with σ=0.5

  • Fit perfect model (know true function)

  • Show that error cannot go below σ²

  • Demonstrate with multiple noise levels

  • Discuss practical implications

# Your implementation here

Problem 2.3: Curse of Dimensionality

  • Generate uniform data in 1D, 2D, 5D, 10D

  • For each dimension, calculate fraction of data within unit ball

  • Visualize sparsity as dimension increases

  • Explain impact on k-NN

  • Suggest solutions

# Your implementation here

Chapter 3: Linear Regression (Additional Problems)

Problem 3.1: Multiple Collinearity

# Create correlated features
import numpy as np
X1 = np.random.randn(100)
X2 = X1 + np.random.randn(100) * 0.1  # Highly correlated
X3 = np.random.randn(100)
# y = 2*X1 + 3*X3 + noise

# Tasks:
# - Fit regression with all features
# - Calculate VIF (Variance Inflation Factor)
# - Compare with Ridge regression
# - Explain coefficient instability

Problem 3.2: Residual Diagnostics

  • Load Boston Housing dataset

  • Fit linear regression

  • Create 4-panel diagnostic plot:

    • Residuals vs Fitted

    • Q-Q plot

    • Scale-Location plot

    • Residuals vs Leverage

  • Identify violations of assumptions

  • Suggest transformations

# Your implementation here

Problem 3.3: Interaction Terms

# Advertising data with interaction
# Sales = β0 + β1*TV + β2*Radio + β3*TV*Radio + ε

# Tasks:
# - Fit model with and without interaction
# - Compare R² and adjusted R²
# - Test interaction significance
# - Interpret interaction coefficient
# - Visualize interaction effect

Chapter 4: Classification (Additional Problems)

Problem 4.1: Imbalanced Classes

  • Create dataset: 95% class 0, 5% class 1

  • Train classifier on imbalanced data

  • Calculate accuracy, precision, recall

  • Apply SMOTE or class weights

  • Compare performance before/after

# Your implementation here

Problem 4.2: ROC Curves

  • Use breast cancer dataset

  • Train Logistic Regression, SVM, Random Forest

  • Plot ROC curves for all three

  • Calculate AUC for each

  • Identify optimal threshold for each

  • Compare at different operating points

# Your implementation here

Problem 4.3: Multi-class Classification

  • Load digits dataset (10 classes)

  • Implement OvR (One vs Rest)

  • Implement OvO (One vs One)

  • Create confusion matrix heatmap

  • Calculate per-class precision/recall

  • Identify most confused class pairs

# Your implementation here

Chapter 5: Resampling Methods (Additional Problems)

Problem 5.1: Cross-Validation Comparison

  • Use same dataset

  • Compare: LOOCV, 5-fold, 10-fold, 20-fold

  • For each: calculate mean and std of error

  • Compare computational time

  • Plot error vs number of folds

  • Recommend best approach

# Your implementation here

Problem 5.2: Bootstrap Confidence Intervals

# Estimate CI for correlation coefficient
def boot_corr(data, indices):
    return np.corrcoef(data[indices, 0], data[indices, 1])[0,1]

# Tasks:
# - Generate bivariate normal data
# - Bootstrap 10,000 times
# - Calculate 95% CI (percentile method)
# - Compare with analytical CI
# - Visualize bootstrap distribution

Problem 5.3: Nested Cross-Validation

  • Outer loop: 5-fold CV for performance

  • Inner loop: 3-fold CV for hyperparameter tuning

  • Apply to SVM with C and gamma parameters

  • Report unbiased performance estimate

  • Compare with simple CV

# Your implementation here

Chapter 6: Regularization (Additional Problems)

Problem 6.1: Regularization Path

  • Load diabetes dataset

  • For Ridge: try λ from 10⁻⁴ to 10⁴

  • For Lasso: try λ from 10⁻⁴ to 10⁴

  • Plot coefficient paths

  • Identify when coefficients hit zero (Lasso)

  • Select λ via CV

# Your implementation here

Problem 6.2: Elastic Net Tuning

  • Grid search over α ∈ [0, 1] and λ

  • α=0 (Ridge), α=1 (Lasso), α=0.5 (middle)

  • Use 10-fold CV

  • Create heatmap of CV error

  • Identify optimal (α, λ) pair

  • Compare with pure Ridge/Lasso

# Your implementation here

Problem 6.3: Feature Selection Comparison

methods = ['Forward Selection', 'Backward Selection', 'Lasso']

# Tasks:
# - Apply all three to high-dimensional data
# - Compare selected features
# - Compare prediction performance
# - Analyze computational cost
# - Discuss stability of selections

Chapter 7: Non-Linearity (Additional Problems)

Problem 7.1: Spline Knot Selection

  • Generate non-linear data

  • Fit B-splines with 3, 5, 10, 20 knots

  • Use CV to select optimal knots

  • Plot fitted curves

  • Compare with polynomial approach

# Your implementation here

Problem 7.2: GAM Implementation

from pygam import GAM, s
# Fit: y ~ s(x1) + s(x2) + s(x3)

# Tasks:
# - Load auto dataset
# - Fit GAM with smoothing splines
# - Extract partial dependence plots
# - Compare with linear model
# - Interpret smooth functions

Problem 7.3: Local Regression (LOESS)

  • Implement or use LOESS

  • Try span values: 0.1, 0.3, 0.5, 0.75

  • Visualize fits

  • Calculate CV error for each

  • Compare with polynomial regression

# Your implementation here

Chapter 8: Tree Methods (Additional Problems)

Problem 8.1: Tree Pruning

  • Grow deep tree (max_depth=20)

  • Extract cost-complexity path

  • Plot CV error vs tree size

  • Select optimal tree size

  • Compare full vs pruned tree

# Your implementation here

Problem 8.2: Feature Importance Analysis

  • Train Random Forest on iris

  • Extract feature importances

  • Create bar plot

  • Permutation importance

  • Compare Gini vs permutation importance

  • Test stability with different seeds

# Your implementation here

Problem 8.3: Boosting Hyperparameters

Grid search over:

  • n_estimators: [50, 100, 200, 500]

  • learning_rate: [0.01, 0.05, 0.1, 0.5]

  • max_depth: [1, 3, 5, 7]

Tasks:

  • Use early stopping

  • Visualize learning curves

  • Report best combination

# Your implementation here

Chapter 9: SVM (Additional Problems)

Problem 9.1: Kernel Comparison

Kernels to try:

  • Linear

  • Polynomial (degree 2, 3, 5)

  • RBF (γ = 0.001, 0.01, 0.1, 1)

  • Sigmoid

Tasks:

  • Plot decision boundaries (2D)

  • Calculate accuracy for each

  • Visualize support vectors

  • Identify best kernel

# Your implementation here

Problem 9.2: Soft Margin Analysis

  • Try C values: [0.01, 0.1, 1, 10, 100]

  • For each C, count support vectors

  • Plot: # SVs vs C

  • Plot: test error vs C

  • Plot: margin width vs C

  • Explain trade-offs

# Your implementation here

Problem 9.3: SVM for Regression (SVR)

from sklearn.svm import SVR

# Tasks:
# - Use California housing
# - Try different ε values
# - Compare with Ridge regression
# - Identify when SVR is beneficial
# - Visualize ε-tube

Chapter 10: Deep Learning (Additional Problems)

Problem 10.2: Regularization Comparison

On same architecture, compare:

  • No regularization

  • L2 (α = 0.0001, 0.001, 0.01)

  • Dropout (p = 0.2, 0.5, 0.8)

  • Early stopping

  • L2 + Dropout

Plot validation curves Identify best combination

# Your implementation here

Problem 10.3: Activation Functions

Compare: ReLU, Leaky ReLU, ELU, tanh, sigmoid

  • Train on same dataset

  • Plot learning curves

  • Compare convergence speed

  • Analyze gradient flow

  • Recommend best choice

# Your implementation here

Chapter 11: Survival Analysis (Additional Problems)

Problem 11.1: Kaplan-Meier Comparison

  • Load lung cancer dataset

  • Compare survival by sex

  • Perform log-rank test

  • Calculate median survival for each

  • Plot survival curves with CI

  • Interpret results

# Your implementation here

Problem 11.2: Cox PH Interpretation

from lifelines import CoxPHFitter

# Tasks:
# - Fit Cox model with multiple covariates
# - Extract hazard ratios
# - Calculate 95% CI for each
# - Test proportional hazards assumption
# - Interpret coefficients
# - Make predictions

Problem 11.3: Time-Dependent Effects

  • Check PH assumption

  • If violated, use stratification

  • Or fit time-varying coefficient model

  • Compare models with AIC

  • Visualize time-varying effect

# Your implementation here

Chapter 12: Unsupervised Learning (Additional Problems)

Problem 12.1: PCA Deep Dive

  • Load handwritten digits

  • Apply PCA

  • Plot scree plot

  • Determine components for 90%, 95%, 99% variance

  • Visualize first 4 PCs

  • Reconstruct images with different # PCs

  • Show reconstruction error vs # components

# Your implementation here

Problem 12.2: Clustering Comparison

On same dataset, compare:

  • K-Means (k=2 to 10)

  • Hierarchical (4 linkages)

  • DBSCAN (vary eps and min_samples)

  • Gaussian Mixture Models

Metrics:

  • Silhouette score

  • Davies-Bouldin index

  • Calinski-Harabasz index

# Your implementation here

Problem 12.3: Hierarchical Clustering Analysis

  • Try all 4 linkage methods

  • Create dendrograms

  • Cut at different heights

  • Compare resulting clusters

  • Calculate cophenetic correlation

  • Identify best linkage

# Your implementation here

Chapter 13: Multiple Testing (Additional Problems)

Problem 13.1: Simulation Study

Design:

  • m = 1000 tests

  • m₀ = 800 true nulls

  • Effect size = 0.5

  • n = 30 per group

Tasks:

  • Run 1000 simulations

  • For each: apply Bonferroni, Holm, BH, BY

  • Calculate FWER, FDR, power for each

  • Verify theoretical guarantees

  • Plot distributions

# Your implementation here

Problem 13.2: Genomics Application

Simulate gene expression:

  • 10,000 genes

  • 20 samples (10 per group)

  • 100 truly differentially expressed

  • Apply multiple testing corrections

  • Create volcano plot

  • Manhattan plot

  • Report discoveries

# Your implementation here

Problem 13.3: Power Analysis

For fixed m = 100:

  • Vary n: [10, 20, 30, 50, 100]

  • Vary effect size: [0.2, 0.5, 0.8]

  • For each combination:

    • Calculate power under BH

    • Calculate power under Bonferroni

  • Create heatmaps

  • Determine required n for 80% power

# Your implementation here

Project Ideas

Project 1: Complete ML Pipeline

Dataset: Titanic survival

  • EDA with 10+ visualizations

  • Feature engineering

  • Try 5 different models

  • Hyperparameter tuning

  • Ensemble methods

  • Final evaluation

  • Professional report

# Project 1 implementation

Project 2: Time Series Prediction

Dataset: Stock prices or weather

  • Explore temporal patterns

  • Create lag features

  • Train-test split (temporal)

  • Compare: ARIMA, Random Forest, LSTM

  • Evaluate predictions

  • Discuss limitations

# Project 2 implementation

Project 3: Image Classification

Dataset: CIFAR-10 or Fashion-MNIST

  • Build CNN from scratch

  • Try transfer learning

  • Data augmentation

  • Compare architectures

  • Visualize learned features

  • Report final accuracy

# Project 3 implementation

Challenge Problems

Challenge 1: Build from Scratch

Implement without sklearn:

  • Linear regression (gradient descent)

  • Logistic regression

  • k-NN

  • Decision tree

  • k-Means

  • PCA

Compare performance with sklearn

# Challenge 1 implementation

Challenge 2: Kaggle Competition

  • Choose active competition

  • Form team or solo

  • Apply techniques from all chapters

  • Document approach

  • Submit predictions

  • Aim for top 25%

# Challenge 2 implementation

Challenge 3: Research Paper Implementation

  • Choose recent ML paper

  • Understand methodology

  • Implement algorithm

  • Reproduce results

  • Apply to new dataset

  • Write summary report

# Challenge 3 implementation

Solutions and Hints

General Tips

For All Exercises:

  1. Always set random seed for reproducibility

  2. Split data before any analysis

  3. Scale features when necessary

  4. Use cross-validation for model selection

  5. Visualize results

  6. Document your code

  7. Interpret findings

Common Mistakes to Avoid:

  • Data leakage (scaling before split)

  • Not validating assumptions

  • Overfitting to test set

  • Ignoring class imbalance

  • Not checking for multicollinearity

  • Forgetting to standardize

Debugging Checklist:

  • Data loaded correctly?

  • Train-test split done?

  • Features scaled?

  • No data leakage?

  • Sensible hyperparameters?

  • Converged properly?

  • Results reproducible?

Progress Tracker

Beginner Level (Complete 5):

  • Chapter 1: Exercises 1.1-1.3

  • Chapter 2: Exercises 2.1-2.3

  • Chapter 3: Exercises 3.1-3.3

  • Chapter 4: Exercises 4.1-4.3

  • Chapter 5: Exercises 5.1-5.3

Intermediate Level (Complete 5):

  • Chapter 6: Exercises 6.1-6.3

  • Chapter 7: Exercises 7.1-7.3

  • Chapter 8: Exercises 8.1-8.3

  • Chapter 9: Exercises 9.1-9.3

  • Chapter 10: Exercises 10.1-10.3

Advanced Level (Complete 5):

  • Chapter 11: Exercises 11.1-11.3

  • Chapter 12: Exercises 12.1-12.3

  • Chapter 13: Exercises 13.1-13.3

  • 2 Projects from list

  • 1 Challenge problem

Master Level (Complete all):

  • All chapter exercises

  • All 8 projects

  • 3+ challenge problems

  • Kaggle competition participation

  • Research paper implementation

Certificate of Completion

Upon completing all exercises, you will have:

✅ Mastered statistical learning fundamentals
✅ Implemented 50+ algorithms
✅ Completed 8+ real-world projects
✅ Built comprehensive ML portfolio
✅ Ready for industry or research positions

Document your journey and share your work!

Good luck with your statistical learning journey! 🚀