# Setup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import multivariate_normal
sns.set_style('whitegrid')
np.random.seed(42)
print("β
Libraries loaded!")
Chapter 9: Linear RegressionΒΆ
Exercise 9.1: Maximum Likelihood Estimation π’ΒΆ
Generate synthetic data: y = 3x + 2 + noise
Tasks:
Generate 50 samples with Gaussian noise (Ο=0.5)
Implement MLE using normal equations: w = (X^T X)^(-1) X^T y
Plot data, true line, and fitted line
Compare with sklearnβs LinearRegression
# Your code here
Exercise 9.2: Polynomial Regression & Overfitting π‘ΒΆ
Explore bias-variance tradeoff with polynomial features.
Tasks:
Generate data from y = sin(2Οx) + noise (20 samples)
Fit polynomials of degree 1, 3, 5, 9, 15
Plot all fits on same graph
Compute training MSE for each degree
Which degree overfits? Why?
# Your code here
# Hint: Use np.vander() for polynomial features
Exercise 9.3: Ridge Regression (L2 Regularization) π‘ΒΆ
Implement Ridge regression from scratch.
Tasks:
Generate high-dimensional data (n=50 samples, p=100 features)
Implement Ridge: w = (X^T X + Ξ»I)^(-1) X^T y
Try different Ξ» values: [0.001, 0.01, 0.1, 1, 10, 100]
Plot ||w|| vs log(Ξ»)
Find optimal Ξ» using cross-validation
# Your code here
Exercise 9.4: Bayesian Linear Regression π΄ΒΆ
Implement Bayesian regression with uncertainty quantification.
Tasks:
Start with prior: w ~ N(0, Ξ±I)
Compute posterior after observing data (use conjugate Gaussian)
Make predictions with uncertainty (predictive distribution)
Plot: data, posterior mean, and confidence bands (Β±2Ο)
Show how uncertainty decreases with more data (try N=5, 20, 100)
# Your code here
# Posterior: w ~ N(m_N, S_N) where
# S_N^(-1) = Ξ± I + Ξ² X^T X
# m_N = Ξ² S_N X^T y
Chapter 10: PCA (Principal Component Analysis)ΒΆ
Exercise 10.1: PCA from Scratch π’ΒΆ
Implement PCA using eigendecomposition.
Tasks:
Generate 2D data (200 points) with correlation
Center the data (subtract mean)
Compute covariance matrix
Find eigenvectors (principal components)
Project data onto first PC
Visualize: original data, PCs as arrows, projections
# Your code here
Exercise 10.2: Dimensionality Reduction on Digits π‘ΒΆ
Apply PCA to handwritten digits dataset.
Tasks:
Load digits dataset (8Γ8 images, so 64 features)
Implement PCA using SVD (more efficient than eigendecomposition)
Plot explained variance ratio (scree plot)
How many components for 95% variance?
Reduce to 2D and visualize with different colors per digit
Reconstruct images using k=5, 10, 20, 64 components
# Your code here
Exercise 10.3: PCA vs Random Projections π‘ΒΆ
Compare PCA with random projections for dimensionality reduction.
Tasks:
Generate high-dim data (n=100, p=50)
Reduce to d=2 using PCA
Reduce to d=2 using random Gaussian projection
Compute reconstruction error for both
Repeat 10 times and plot error distributions
Why does PCA win?
# Your code here
Exercise 10.4: Whitening Transformation π΄ΒΆ
Implement PCA whitening (decorrelation + unit variance).
Tasks:
Generate correlated 2D data
Apply PCA whitening: X_white = X_pca @ Ξ^(-1/2)
Verify whitened data has identity covariance
Visualize: original data ellipse β whitened data (circle)
Application: preprocess data for ICA (Independent Component Analysis)
# Your code here
Chapter 11: Gaussian Mixture ModelsΒΆ
Exercise 11.1: EM Algorithm Implementation π’ΒΆ
Implement Gaussian Mixture Model with EM algorithm.
Tasks:
Generate data from 3 Gaussians (known parameters)
Implement E-step (compute responsibilities)
Implement M-step (update means, covariances, weights)
Iterate until convergence (monitor log-likelihood)
Plot: data, true clusters, EM-discovered clusters
# Your code here
# E-step: r_nk = Ο_k N(x_n|ΞΌ_k,Ξ£_k) / Ξ£_j Ο_j N(x_n|ΞΌ_j,Ξ£_j)
# M-step: Update Ο, ΞΌ, Ξ£ using responsibilities
Exercise 11.2: Model Selection with BIC π‘ΒΆ
Use Bayesian Information Criterion to select number of components.
Tasks:
Generate data from 4 Gaussians
Fit GMMs with K=1 to 10 components
Compute BIC for each: BIC = -2log_likelihood + klog(n)
Plot BIC vs K
Select optimal K (minimum BIC)
Compare with AIC (different penalty)
# Your code here
Exercise 11.3: GMM vs K-Means π‘ΒΆ
Compare soft clustering (GMM) with hard clustering (K-Means).
Tasks:
Generate overlapping clusters (3 Gaussians with some overlap)
Fit GMM and K-Means (K=3)
Visualize cluster assignments (hard vs soft)
For GMM: show points with high uncertainty (entropy of responsibilities)
Which method better captures uncertainty in overlap regions?
# Your code here
Exercise 11.4: Anomaly Detection with GMM π΄ΒΆ
Use GMM for outlier detection via density estimation.
Tasks:
Generate inliers from 2 Gaussians (500 points)
Add outliers uniformly distributed (50 points)
Fit GMM to all data (K=2)
Compute density for each point: p(x) = Ξ£ Ο_k N(x|ΞΌ_k,Ξ£_k)
Identify outliers (low density, bottom 5%)
Visualize: decision boundary at density threshold, mark true outliers
# Your code here
Chapter 12: Support Vector MachinesΒΆ
Exercise 12.1: Linear SVM with Hard Margin π’ΒΆ
Understand maximum margin concept.
Tasks:
Generate linearly separable data (2 classes, 2D)
Fit linear SVM (sklearn with large C for hard margin)
Identify support vectors
Plot: data, decision boundary, margin, support vectors
Verify: distance from support vectors to hyperplane = 1/||w||
# Your code here
Exercise 12.2: Kernel Trick - XOR Problem π‘ΒΆ
Solve non-linearly separable problem with kernels.
Tasks:
Generate XOR pattern (4 clusters at corners)
Try linear SVM β should fail (plot decision boundary)
Try polynomial kernel (degree 2, 3) β should work
Try RBF kernel with varying Ξ³ (0.1, 1, 10)
Visualize all decision boundaries
Which kernel is best for XOR?
# Your code here
Exercise 12.3: Soft Margin & C Parameter π‘ΒΆ
Explore effect of regularization parameter C.
Tasks:
Generate overlapping classes (not perfectly separable)
Fit SVMs with C = [0.01, 0.1, 1, 10, 100]
For each C:
Count support vectors
Compute training accuracy
Plot decision boundary
Explain: What does small C do? Large C?
Which C generalizes best? (split into train/test)
# Your code here
Exercise 12.4: Multi-class SVM π΄ΒΆ
Extend binary SVM to multi-class classification.
Tasks:
Load digits dataset (10 classes)
Implement One-vs-Rest strategy:
Train 10 binary SVMs (one per class)
Predict: argmax of decision functions
Implement One-vs-One strategy:
Train C(10,2)=45 binary SVMs
Predict: majority voting
Compare accuracy and training time
Use RBF kernel with grid search for Ξ³ and C
# Your code here
π― Bonus Challenge: Complete ML Pipeline π΄π΄ΒΆ
End-to-End Classification with All TechniquesΒΆ
Build a complete pipeline combining everything youβve learned.
Dataset: sklearn.datasets.make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=3)
Pipeline Steps:
Data Exploration:
Visualize feature correlations (heatmap)
Check class balance
Split into train (70%), validation (15%), test (15%)
Preprocessing:
Standardize features (zero mean, unit variance)
Apply PCA to reduce to 10 dimensions
Visualize first 2 PCs with class colors
Model Comparison:
Fit 3 models:
Logistic regression (baseline)
SVM with RBF kernel
GMM-based classifier (Bayes classifier)
Tune hyperparameters on validation set
Evaluation:
Compute test accuracy, precision, recall, F1
Plot confusion matrices
ROC curves (one-vs-rest)
Analysis:
Which features are most important? (PCA loadings)
Where do models disagree?
Ensemble: majority voting of 3 models
Bonus: Add Bayesian regression for uncertainty in predictions!
# Your complete ML pipeline here - Good luck! π
π Self-AssessmentΒΆ
Can you answer these conceptual questions?
Linear Regression:
Why do we need regularization? When does Ridge work better than Lasso?
Whatβs the difference between MLE and MAP estimation?
How does Bayesian regression quantify uncertainty?
PCA:
Why does PCA find directions of maximum variance?
When would PCA fail? (Think: non-linear structure)
How is PCA related to SVD?
GMM:
Whatβs the advantage of soft clustering over hard clustering?
Why does EM converge to local optima? How to mitigate?
When would you use GMM vs K-Means?
SVM:
Why maximize the margin?
How does the kernel trick enable non-linear boundaries?
Whatβs the tradeoff controlled by parameter C?
π Next StepsΒΆ
β Check solutions in
mml_solutions_part2.ipynbπ Implement variations (different kernels, priors, etc.)
π Apply to real datasets (Kaggle, UCI ML Repository)
π Read original papers:
Vapnik (1995) - SVM theory
Dempster et al. (1977) - EM algorithm
Pearson (1901) - PCA origins
You now understand the math behind ML! π
These 4 algorithms (regression, PCA, GMM, SVM) form the foundation for:
Deep learning (gradient descent, regularization)
Modern ML (kernel methods, Bayesian inference)
Data science (dimensionality reduction, clustering)
Keep building! π