Run this notebook: Open in Colab Open in Kaggle

# Setup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import multivariate_normal

sns.set_style('whitegrid')
np.random.seed(42)

print("✅ Libraries loaded!")

Chapter 9: Linear Regression¶

Exercise 9.1: Maximum Likelihood Estimation 🟢¶

Generate synthetic data: y = 3x + 2 + noise

Tasks:

Generate 50 samples with Gaussian noise (σ=0.5)
Implement MLE using normal equations: w = (X^T X)^(-1) X^T y
Plot data, true line, and fitted line
Compare with sklearn’s LinearRegression

# Your code here

Exercise 9.2: Polynomial Regression & Overfitting 🟡¶

Explore bias-variance tradeoff with polynomial features.

Tasks:

Generate data from y = sin(2πx) + noise (20 samples)
Fit polynomials of degree 1, 3, 5, 9, 15
Plot all fits on same graph
Compute training MSE for each degree
Which degree overfits? Why?

# Your code here
# Hint: Use np.vander() for polynomial features

Exercise 9.3: Ridge Regression (L2 Regularization) 🟡¶

Implement Ridge regression from scratch.

Tasks:

Generate high-dimensional data (n=50 samples, p=100 features)
Implement Ridge: w = (X^T X + λI)^(-1) X^T y
Try different λ values: [0.001, 0.01, 0.1, 1, 10, 100]
Plot ||w|| vs log(λ)
Find optimal λ using cross-validation

# Your code here

Exercise 9.4: Bayesian Linear Regression 🔴¶

Implement Bayesian regression with uncertainty quantification.

Tasks:

Start with prior: w ~ N(0, αI)
Compute posterior after observing data (use conjugate Gaussian)
Make predictions with uncertainty (predictive distribution)
Plot: data, posterior mean, and confidence bands (±2σ)
Show how uncertainty decreases with more data (try N=5, 20, 100)

# Your code here
# Posterior: w ~ N(m_N, S_N) where
# S_N^(-1) = α I + β X^T X
# m_N = β S_N X^T y

Chapter 10: PCA (Principal Component Analysis)¶

Exercise 10.1: PCA from Scratch 🟢¶

Implement PCA using eigendecomposition.

Tasks:

Generate 2D data (200 points) with correlation
Center the data (subtract mean)
Compute covariance matrix
Find eigenvectors (principal components)
Project data onto first PC
Visualize: original data, PCs as arrows, projections

# Your code here

Exercise 10.2: Dimensionality Reduction on Digits 🟡¶

Apply PCA to handwritten digits dataset.

Tasks:

Load digits dataset (8×8 images, so 64 features)
Implement PCA using SVD (more efficient than eigendecomposition)
Plot explained variance ratio (scree plot)
How many components for 95% variance?
Reduce to 2D and visualize with different colors per digit
Reconstruct images using k=5, 10, 20, 64 components

# Your code here

Exercise 10.3: PCA vs Random Projections 🟡¶

Compare PCA with random projections for dimensionality reduction.

Tasks:

Generate high-dim data (n=100, p=50)
Reduce to d=2 using PCA
Reduce to d=2 using random Gaussian projection
Compute reconstruction error for both
Repeat 10 times and plot error distributions
Why does PCA win?

# Your code here

Exercise 10.4: Whitening Transformation 🔴¶

Implement PCA whitening (decorrelation + unit variance).

Tasks:

Generate correlated 2D data
Apply PCA whitening: X_white = X_pca @ Λ^(-1/2)
Verify whitened data has identity covariance
Visualize: original data ellipse → whitened data (circle)
Application: preprocess data for ICA (Independent Component Analysis)

# Your code here

Chapter 11: Gaussian Mixture Models¶

Exercise 11.1: EM Algorithm Implementation 🟢¶

Implement Gaussian Mixture Model with EM algorithm.

Tasks:

Generate data from 3 Gaussians (known parameters)
Implement E-step (compute responsibilities)
Implement M-step (update means, covariances, weights)
Iterate until convergence (monitor log-likelihood)
Plot: data, true clusters, EM-discovered clusters

# Your code here
# E-step: r_nk = π_k N(x_n|μ_k,Σ_k) / Σ_j π_j N(x_n|μ_j,Σ_j)
# M-step: Update π, μ, Σ using responsibilities

Exercise 11.2: Model Selection with BIC 🟡¶

Use Bayesian Information Criterion to select number of components.

Tasks:

Generate data from 4 Gaussians
Fit GMMs with K=1 to 10 components
Compute BIC for each: BIC = -2log_likelihood + klog(n)
Plot BIC vs K
Select optimal K (minimum BIC)
Compare with AIC (different penalty)

# Your code here

Exercise 11.3: GMM vs K-Means 🟡¶

Compare soft clustering (GMM) with hard clustering (K-Means).

Tasks:

Generate overlapping clusters (3 Gaussians with some overlap)
Fit GMM and K-Means (K=3)
Visualize cluster assignments (hard vs soft)
For GMM: show points with high uncertainty (entropy of responsibilities)
Which method better captures uncertainty in overlap regions?

# Your code here

Exercise 11.4: Anomaly Detection with GMM 🔴¶

Use GMM for outlier detection via density estimation.

Tasks:

Generate inliers from 2 Gaussians (500 points)
Add outliers uniformly distributed (50 points)
Fit GMM to all data (K=2)
Compute density for each point: p(x) = Σ π_k N(x|μ_k,Σ_k)
Identify outliers (low density, bottom 5%)
Visualize: decision boundary at density threshold, mark true outliers

# Your code here

Chapter 12: Support Vector Machines¶

Exercise 12.1: Linear SVM with Hard Margin 🟢¶

Understand maximum margin concept.

Tasks:

Generate linearly separable data (2 classes, 2D)
Fit linear SVM (sklearn with large C for hard margin)
Identify support vectors
Plot: data, decision boundary, margin, support vectors
Verify: distance from support vectors to hyperplane = 1/||w||

# Your code here

Exercise 12.2: Kernel Trick - XOR Problem 🟡¶

Solve non-linearly separable problem with kernels.

Tasks:

Generate XOR pattern (4 clusters at corners)
Try linear SVM → should fail (plot decision boundary)
Try polynomial kernel (degree 2, 3) → should work
Try RBF kernel with varying γ (0.1, 1, 10)
Visualize all decision boundaries
Which kernel is best for XOR?

# Your code here

Exercise 12.3: Soft Margin & C Parameter 🟡¶

Explore effect of regularization parameter C.

Tasks:

Generate overlapping classes (not perfectly separable)
Fit SVMs with C = [0.01, 0.1, 1, 10, 100]
For each C:
- Count support vectors
- Compute training accuracy
- Plot decision boundary
Explain: What does small C do? Large C?
Which C generalizes best? (split into train/test)

# Your code here

Exercise 12.4: Multi-class SVM 🔴¶

Extend binary SVM to multi-class classification.

Tasks:

Load digits dataset (10 classes)
Implement One-vs-Rest strategy:
- Train 10 binary SVMs (one per class)
- Predict: argmax of decision functions
Implement One-vs-One strategy:
- Train C(10,2)=45 binary SVMs
- Predict: majority voting
Compare accuracy and training time
Use RBF kernel with grid search for γ and C

# Your code here

🎯 Bonus Challenge: Complete ML Pipeline 🔴🔴¶

End-to-End Classification with All Techniques¶

Build a complete pipeline combining everything you’ve learned.

Dataset: sklearn.datasets.make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=3)

Pipeline Steps:

Data Exploration:
- Visualize feature correlations (heatmap)
- Check class balance
- Split into train (70%), validation (15%), test (15%)
Preprocessing:
- Standardize features (zero mean, unit variance)
- Apply PCA to reduce to 10 dimensions
- Visualize first 2 PCs with class colors
Model Comparison:
- Fit 3 models:
  - Logistic regression (baseline)
  - SVM with RBF kernel
  - GMM-based classifier (Bayes classifier)
- Tune hyperparameters on validation set
Evaluation:
- Compute test accuracy, precision, recall, F1
- Plot confusion matrices
- ROC curves (one-vs-rest)
Analysis:
- Which features are most important? (PCA loadings)
- Where do models disagree?
- Ensemble: majority voting of 3 models

Bonus: Add Bayesian regression for uncertainty in predictions!

# Your complete ML pipeline here - Good luck! 🚀

📊 Self-Assessment¶

Can you answer these conceptual questions?

Linear Regression:

Why do we need regularization? When does Ridge work better than Lasso?
What’s the difference between MLE and MAP estimation?
How does Bayesian regression quantify uncertainty?

PCA:

Why does PCA find directions of maximum variance?
When would PCA fail? (Think: non-linear structure)
How is PCA related to SVD?

GMM:

What’s the advantage of soft clustering over hard clustering?
Why does EM converge to local optima? How to mitigate?
When would you use GMM vs K-Means?

SVM:

Why maximize the margin?
How does the kernel trick enable non-linear boundaries?
What’s the tradeoff controlled by parameter C?

🎓 Next Steps¶

✅ Check solutions in mml_solutions_part2.ipynb
📖 Implement variations (different kernels, priors, etc.)
🏆 Apply to real datasets (Kaggle, UCI ML Repository)
📚 Read original papers:
- Vapnik (1995) - SVM theory
- Dempster et al. (1977) - EM algorithm
- Pearson (1901) - PCA origins

You now understand the math behind ML! 🎉

These 4 algorithms (regression, PCA, GMM, SVM) form the foundation for:

Deep learning (gradient descent, regularization)
Modern ML (kernel methods, Bayesian inference)
Data science (dimensionality reduction, clustering)

Keep building! 🚀