# Setup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import multivariate_normal

sns.set_style('whitegrid')
np.random.seed(42)

print("βœ… Libraries loaded!")

Chapter 9: Linear RegressionΒΆ

Exercise 9.1: Maximum Likelihood Estimation 🟒¢

Generate synthetic data: y = 3x + 2 + noise

Tasks:

  1. Generate 50 samples with Gaussian noise (Οƒ=0.5)

  2. Implement MLE using normal equations: w = (X^T X)^(-1) X^T y

  3. Plot data, true line, and fitted line

  4. Compare with sklearn’s LinearRegression

# Your code here

Exercise 9.2: Polynomial Regression & Overfitting 🟑¢

Explore bias-variance tradeoff with polynomial features.

Tasks:

  1. Generate data from y = sin(2Ο€x) + noise (20 samples)

  2. Fit polynomials of degree 1, 3, 5, 9, 15

  3. Plot all fits on same graph

  4. Compute training MSE for each degree

  5. Which degree overfits? Why?

# Your code here
# Hint: Use np.vander() for polynomial features

Exercise 9.3: Ridge Regression (L2 Regularization) 🟑¢

Implement Ridge regression from scratch.

Tasks:

  1. Generate high-dimensional data (n=50 samples, p=100 features)

  2. Implement Ridge: w = (X^T X + Ξ»I)^(-1) X^T y

  3. Try different Ξ» values: [0.001, 0.01, 0.1, 1, 10, 100]

  4. Plot ||w|| vs log(Ξ»)

  5. Find optimal Ξ» using cross-validation

# Your code here

Exercise 9.4: Bayesian Linear Regression πŸ”΄ΒΆ

Implement Bayesian regression with uncertainty quantification.

Tasks:

  1. Start with prior: w ~ N(0, Ξ±I)

  2. Compute posterior after observing data (use conjugate Gaussian)

  3. Make predictions with uncertainty (predictive distribution)

  4. Plot: data, posterior mean, and confidence bands (Β±2Οƒ)

  5. Show how uncertainty decreases with more data (try N=5, 20, 100)

# Your code here
# Posterior: w ~ N(m_N, S_N) where
# S_N^(-1) = Ξ± I + Ξ² X^T X
# m_N = Ξ² S_N X^T y

Chapter 10: PCA (Principal Component Analysis)ΒΆ

Exercise 10.1: PCA from Scratch 🟒¢

Implement PCA using eigendecomposition.

Tasks:

  1. Generate 2D data (200 points) with correlation

  2. Center the data (subtract mean)

  3. Compute covariance matrix

  4. Find eigenvectors (principal components)

  5. Project data onto first PC

  6. Visualize: original data, PCs as arrows, projections

# Your code here

Exercise 10.2: Dimensionality Reduction on Digits 🟑¢

Apply PCA to handwritten digits dataset.

Tasks:

  1. Load digits dataset (8Γ—8 images, so 64 features)

  2. Implement PCA using SVD (more efficient than eigendecomposition)

  3. Plot explained variance ratio (scree plot)

  4. How many components for 95% variance?

  5. Reduce to 2D and visualize with different colors per digit

  6. Reconstruct images using k=5, 10, 20, 64 components

# Your code here

Exercise 10.3: PCA vs Random Projections 🟑¢

Compare PCA with random projections for dimensionality reduction.

Tasks:

  1. Generate high-dim data (n=100, p=50)

  2. Reduce to d=2 using PCA

  3. Reduce to d=2 using random Gaussian projection

  4. Compute reconstruction error for both

  5. Repeat 10 times and plot error distributions

  6. Why does PCA win?

# Your code here

Exercise 10.4: Whitening Transformation πŸ”΄ΒΆ

Implement PCA whitening (decorrelation + unit variance).

Tasks:

  1. Generate correlated 2D data

  2. Apply PCA whitening: X_white = X_pca @ Ξ›^(-1/2)

  3. Verify whitened data has identity covariance

  4. Visualize: original data ellipse β†’ whitened data (circle)

  5. Application: preprocess data for ICA (Independent Component Analysis)

# Your code here

Chapter 11: Gaussian Mixture ModelsΒΆ

Exercise 11.1: EM Algorithm Implementation 🟒¢

Implement Gaussian Mixture Model with EM algorithm.

Tasks:

  1. Generate data from 3 Gaussians (known parameters)

  2. Implement E-step (compute responsibilities)

  3. Implement M-step (update means, covariances, weights)

  4. Iterate until convergence (monitor log-likelihood)

  5. Plot: data, true clusters, EM-discovered clusters

# Your code here
# E-step: r_nk = Ο€_k N(x_n|ΞΌ_k,Ξ£_k) / Ξ£_j Ο€_j N(x_n|ΞΌ_j,Ξ£_j)
# M-step: Update Ο€, ΞΌ, Ξ£ using responsibilities

Exercise 11.2: Model Selection with BIC 🟑¢

Use Bayesian Information Criterion to select number of components.

Tasks:

  1. Generate data from 4 Gaussians

  2. Fit GMMs with K=1 to 10 components

  3. Compute BIC for each: BIC = -2log_likelihood + klog(n)

  4. Plot BIC vs K

  5. Select optimal K (minimum BIC)

  6. Compare with AIC (different penalty)

# Your code here

Exercise 11.3: GMM vs K-Means 🟑¢

Compare soft clustering (GMM) with hard clustering (K-Means).

Tasks:

  1. Generate overlapping clusters (3 Gaussians with some overlap)

  2. Fit GMM and K-Means (K=3)

  3. Visualize cluster assignments (hard vs soft)

  4. For GMM: show points with high uncertainty (entropy of responsibilities)

  5. Which method better captures uncertainty in overlap regions?

# Your code here

Exercise 11.4: Anomaly Detection with GMM πŸ”΄ΒΆ

Use GMM for outlier detection via density estimation.

Tasks:

  1. Generate inliers from 2 Gaussians (500 points)

  2. Add outliers uniformly distributed (50 points)

  3. Fit GMM to all data (K=2)

  4. Compute density for each point: p(x) = Ξ£ Ο€_k N(x|ΞΌ_k,Ξ£_k)

  5. Identify outliers (low density, bottom 5%)

  6. Visualize: decision boundary at density threshold, mark true outliers

# Your code here

Chapter 12: Support Vector MachinesΒΆ

Exercise 12.1: Linear SVM with Hard Margin 🟒¢

Understand maximum margin concept.

Tasks:

  1. Generate linearly separable data (2 classes, 2D)

  2. Fit linear SVM (sklearn with large C for hard margin)

  3. Identify support vectors

  4. Plot: data, decision boundary, margin, support vectors

  5. Verify: distance from support vectors to hyperplane = 1/||w||

# Your code here

Exercise 12.2: Kernel Trick - XOR Problem 🟑¢

Solve non-linearly separable problem with kernels.

Tasks:

  1. Generate XOR pattern (4 clusters at corners)

  2. Try linear SVM β†’ should fail (plot decision boundary)

  3. Try polynomial kernel (degree 2, 3) β†’ should work

  4. Try RBF kernel with varying Ξ³ (0.1, 1, 10)

  5. Visualize all decision boundaries

  6. Which kernel is best for XOR?

# Your code here

Exercise 12.3: Soft Margin & C Parameter 🟑¢

Explore effect of regularization parameter C.

Tasks:

  1. Generate overlapping classes (not perfectly separable)

  2. Fit SVMs with C = [0.01, 0.1, 1, 10, 100]

  3. For each C:

    • Count support vectors

    • Compute training accuracy

    • Plot decision boundary

  4. Explain: What does small C do? Large C?

  5. Which C generalizes best? (split into train/test)

# Your code here

Exercise 12.4: Multi-class SVM πŸ”΄ΒΆ

Extend binary SVM to multi-class classification.

Tasks:

  1. Load digits dataset (10 classes)

  2. Implement One-vs-Rest strategy:

    • Train 10 binary SVMs (one per class)

    • Predict: argmax of decision functions

  3. Implement One-vs-One strategy:

    • Train C(10,2)=45 binary SVMs

    • Predict: majority voting

  4. Compare accuracy and training time

  5. Use RBF kernel with grid search for Ξ³ and C

# Your code here

🎯 Bonus Challenge: Complete ML Pipeline πŸ”΄πŸ”΄ΒΆ

End-to-End Classification with All TechniquesΒΆ

Build a complete pipeline combining everything you’ve learned.

Dataset: sklearn.datasets.make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=3)

Pipeline Steps:

  1. Data Exploration:

    • Visualize feature correlations (heatmap)

    • Check class balance

    • Split into train (70%), validation (15%), test (15%)

  2. Preprocessing:

    • Standardize features (zero mean, unit variance)

    • Apply PCA to reduce to 10 dimensions

    • Visualize first 2 PCs with class colors

  3. Model Comparison:

    • Fit 3 models:

      • Logistic regression (baseline)

      • SVM with RBF kernel

      • GMM-based classifier (Bayes classifier)

    • Tune hyperparameters on validation set

  4. Evaluation:

    • Compute test accuracy, precision, recall, F1

    • Plot confusion matrices

    • ROC curves (one-vs-rest)

  5. Analysis:

    • Which features are most important? (PCA loadings)

    • Where do models disagree?

    • Ensemble: majority voting of 3 models

Bonus: Add Bayesian regression for uncertainty in predictions!

# Your complete ML pipeline here - Good luck! πŸš€

πŸ“Š Self-AssessmentΒΆ

Can you answer these conceptual questions?

Linear Regression:

  • Why do we need regularization? When does Ridge work better than Lasso?

  • What’s the difference between MLE and MAP estimation?

  • How does Bayesian regression quantify uncertainty?

PCA:

  • Why does PCA find directions of maximum variance?

  • When would PCA fail? (Think: non-linear structure)

  • How is PCA related to SVD?

GMM:

  • What’s the advantage of soft clustering over hard clustering?

  • Why does EM converge to local optima? How to mitigate?

  • When would you use GMM vs K-Means?

SVM:

  • Why maximize the margin?

  • How does the kernel trick enable non-linear boundaries?

  • What’s the tradeoff controlled by parameter C?

πŸŽ“ Next StepsΒΆ

  1. βœ… Check solutions in mml_solutions_part2.ipynb

  2. πŸ“– Implement variations (different kernels, priors, etc.)

  3. πŸ† Apply to real datasets (Kaggle, UCI ML Repository)

  4. πŸ“š Read original papers:

    • Vapnik (1995) - SVM theory

    • Dempster et al. (1977) - EM algorithm

    • Pearson (1901) - PCA origins

You now understand the math behind ML! πŸŽ‰

These 4 algorithms (regression, PCA, GMM, SVM) form the foundation for:

  • Deep learning (gradient descent, regularization)

  • Modern ML (kernel methods, Bayesian inference)

  • Data science (dimensionality reduction, clustering)

Keep building! πŸš€