Run this notebook: Open in Colab Open in Kaggle

Lecture 1: Linear Regression (10 Problems)¶

1.1 🟢 Basic Implementation¶

Implement linear regression without using numpy’s advanced functions:

Test on: y = 2x + 3 + noise

def manual_linear_regression(X, y):
    # Use only basic Python and loops
    # Return theta (parameters)
    pass

# Your implementation here

1.2 🟡 Gradient Descent Variants¶

Implement and compare:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch GD (batch_size=32)

On dataset with m=10,000, compare:

Convergence speed (iterations to converge)
Final cost
Time per iteration

# Your implementation here

1.3 🟡 Learning Rate Schedule¶

Implement learning rate decay:

α(t) = α₀ / (1 + decay_rate × t)

Compare with constant learning rate on California Housing.

# Your implementation here

1.4 🔴 Closed-Form vs Iterative¶

For different feature dimensions n = [10, 100, 1000, 5000]:

Time Normal Equation
Time Gradient Descent (1000 iterations)
Plot time vs n
Identify crossover point

# Your implementation here

1.5 🟢 Feature Scaling Impact¶

Generate data where features have different scales (e.g., [0.001, 1000]).

Train without scaling
Train with standardization
Compare convergence

# Your implementation here

1.6 🟡 Polynomial Regression¶

Fit polynomials of degree 1-10 to: y = sin(2πx) + noise

Use cross-validation to select degree
Plot train vs validation error
Identify optimal degree

# Your implementation here

1.7 🔴 Regularization Path¶

For Ridge regression, try λ ∈ [10⁻⁶, 10⁶]:

Plot coefficient values vs log(λ)
Identify when coefficients shrink to zero
Find optimal λ using CV

# Your implementation here

1.8 🟡 Weighted Linear Regression¶

Implement locally weighted regression:

weight[i] = exp(-||x⁽ⁱ⁾ - x||² / (2τ²))

Try different τ values
Compare with standard linear regression
Visualize predictions

# Your implementation here

1.9 🔴 Online Learning¶

Implement online gradient descent:

Receive examples one at a time
Update parameters after each example
Track running average cost
Compare with batch learning

# Your implementation here

1.10 🏆 Distributed Gradient Descent¶

Simulate distributed training:

Split data across 4 “workers”
Each computes local gradients
Aggregate gradients
Update parameters
Compare speedup vs communication cost

# Your implementation here

Lecture 2: Logistic Regression (10 Problems)¶

2.1 🟢 Sigmoid Properties¶

Prove mathematically:

σ’(z) = σ(z)(1 - σ(z))
1 - σ(z) = σ(-z)
σ(z) → 1 as z → ∞

Plot sigmoid, derivative, and second derivative.

# Your implementation here

2.2 🟡 Decision Boundary Visualization¶

For 2D dataset:

Train logistic regression
Plot decision boundary
Show regions with P(y=1|x) > 0.5
Add confidence contours

# Your implementation here

2.3 🟡 Multi-class Classification¶

Implement One-vs-All:

def train_one_vs_all(X, y, num_classes):
    # Train K binary classifiers
    # Return K parameter vectors
    pass

# Test on iris dataset (3 classes)

2.4 🔴 Newton’s Method¶

Implement Newton’s method for logistic regression:

θ := θ - H⁻¹∇J(θ)

where H is Hessian matrix. Compare with gradient descent on convergence speed.

# Your implementation here

2.5 🟢 Regularized Logistic Regression¶

Add L2 regularization:

J(θ) = -(1/m)Σ[y log h + (1-y) log(1-h)] + (λ/2m)Σθⱼ²

Try λ = [0.001, 0.01, 0.1, 1.0, 10]. Plot validation error vs λ.

# Your implementation here

2.6 🟡 Imbalanced Classes¶

Create dataset: 95% class 0, 5% class 1

Train standard logistic regression
Calculate: accuracy, precision, recall, F1
Apply class weights
Compare performance

# Your implementation here

2.7 🔴 Calibration Analysis¶

For logistic regression predictions:

Bin predictions into 10 buckets
For each bucket, calculate actual positive rate
Plot calibration curve
Compare well-calibrated vs poorly-calibrated

# Your implementation here

2.8 🟡 Feature Engineering¶

For text classification:

Implement TF-IDF vectorization
Train logistic regression
Identify most important words
Compare with bag-of-words

# Your implementation here

2.9 🔴 Softmax Regression¶

Implement softmax (multi-class logistic regression):

P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)

Derive gradient and implement from scratch. Test on MNIST digits.

# Your implementation here

2.10 🏆 Adversarial Examples¶

Generate adversarial examples:

Find minimal perturbation that flips prediction
Visualize perturbations
Test adversarial training
Measure robustness

# Your implementation here

Lecture 3: Regularization (10 Problems)¶

3.1 🟢 Overfitting Demonstration¶

Generate data: y = x² + noise (n=20 samples)

Fit polynomials degree 1-15
Plot all fits
Show overfitting visually

# Your implementation here

3.2 🟡 Cross-Validation Implementation¶

Implement k-fold CV from scratch:

def k_fold_cv(X, y, k, model):
    # Split data into k folds
    # Train on k-1, validate on 1
    # Return average validation error
    pass

# Your implementation here

3.3 🟡 Ridge vs Lasso Comparison¶

On dataset with correlated features:

Plot regularization paths for both
Identify when Lasso sets coefficients to 0
Compare prediction performance

# Your implementation here

3.4 🔴 Elastic Net Tuning¶

Grid search over:

α ∈ [0, 0.25, 0.5, 0.75, 1.0]
λ ∈ [0.001, 0.01, 0.1, 1.0, 10]

Create heatmap of CV error. Identify optimal combination.

# Your implementation here

3.5 🟢 Early Stopping¶

Implement early stopping:

Monitor validation error
Stop when no improvement for 10 epochs
Restore best weights
Compare with fixed epochs

# Your implementation here

3.6 🟡 Dropout Implementation¶

Implement dropout regularization:

def dropout_forward(X, p=0.5):
    mask = (np.random.rand(*X.shape) > p) / (1 - p)
    return X * mask, mask

# Test on neural network

3.7 🔴 Bayesian Interpretation¶

Show Ridge regression equivalent to MAP estimation with:

Gaussian prior on θ: θ ~ N(0, (1/λ)I)
Likelihood: y|x,θ ~ N(θᵀx, σ²)

Derive λ from prior variance.

# Your implementation here

3.8 🟡 Feature Selection¶

Compare feature selection methods:

Forward selection
Backward elimination
Lasso
Random Forest importance

On dataset with 50 features (10 relevant).

# Your implementation here

3.9 🔴 Non-convex Regularization¶

Implement L0 regularization (count non-zeros):

J(θ) = MSE + λ × ||θ||₀

Use greedy approximation or IHT algorithm.

# Your implementation here

3.10 🏆 Regularization for Deep Learning¶

Compare regularization in deep neural networks:

L2 weight decay
Dropout
Batch normalization
Data augmentation
Label smoothing

On CIFAR-10 dataset.

# Your implementation here

Lecture 4: Generative Models (10 Problems)¶

4.1 🟢 GDA Implementation¶

Implement Gaussian Discriminant Analysis:

def fit_gda(X, y):
    # Estimate μ₀, μ₁, Σ for each class
    # Return parameters
    pass

# Test on synthetic 2D data

4.2 🟡 GDA vs Logistic Regression¶

Generate data from:

Gaussian distributions
Non-Gaussian distributions

Compare performance of GDA vs Logistic Regression. When does each work better?

# Your implementation here

4.3 🟡 Naive Bayes for Text¶

Implement Multinomial Naive Bayes:

P(x|y) = Π P(xᵢ|y)^count(xᵢ)

Apply to 20 Newsgroups dataset. Calculate accuracy on test set.

# Your implementation here

4.4 🔴 Laplace Smoothing¶

Compare Naive Bayes with different smoothing:

No smoothing (α=0)
Laplace smoothing (α=1)
Lidstone smoothing (α=0.1, 0.5, 2.0)

How does α affect rare words?

# Your implementation here

4.5 🟢 Event Models¶

Implement both:

Multinomial event model
Bernoulli event model

For email spam detection. Compare accuracy.

# Your implementation here

4.6 🟡 Continuous Features¶

Extend Naive Bayes to continuous features:

Assume Gaussian: P(xⱼ|y) ~ N(μⱼy, σⱼy²)
Estimate parameters
Test on iris dataset

# Your implementation here

4.7 🔴 Kernel Density Estimation¶

Implement KDE for class-conditional densities:

P(x|y) = (1/n)Σ K((x-xᵢ)/h)

Use Gaussian kernel. Compare with parametric GDA.

# Your implementation here

4.8 🟡 Discriminative vs Generative¶

Theoretical comparison:

Sample complexity
Asymptotic error
When to prefer each

Empirical study with varying training size.

# Your implementation here

4.9 🔴 Hidden Markov Models¶

Implement HMM for sequence labeling:

Forward algorithm (likelihood)
Viterbi algorithm (best path)
Baum-Welch (parameter learning)

Apply to POS tagging.

# Your implementation here

4.10 🏆 Semi-Supervised Learning¶

Use generative models for semi-supervised learning:

Train on labeled data
Use EM to leverage unlabeled data
Compare with supervised-only

# Your implementation here

Lecture 5: SVMs (10 Problems)¶

5.1 🟢 Margin Calculation¶

For linearly separable data:

def compute_margin(X, y, w, b):
    # Return geometric margin
    pass

# Visualize margin for 2D dataset

5.2 🟡 Kernel Functions¶

Implement kernels:

Linear: K(x,z) = xᵀz
Polynomial: K(x,z) = (xᵀz + c)^d
RBF: K(x,z) = exp(-γ||x-z||²)
Sigmoid: K(x,z) = tanh(κxᵀz + Θ)

Test on 2D XOR problem.

# Your implementation here

5.3 🟡 Soft Margin SVM¶

Implement soft margin:

minimize: (1/2)||w||² + C × Σξᵢ
subject to: yᵢ(wᵀxᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0

Solve using quadratic programming.

# Your implementation here

5.4 🔴 SVM Dual Formulation¶

Derive dual form:

maximize: Σαᵢ - (1/2)ΣΣαᵢαⱼyᵢyⱼxᵢᵀxⱼ
subject to: Σαᵢyᵢ = 0, 0 ≤ αᵢ ≤ C

Implement and compare with primal.

# Your implementation here

5.5 🟢 Hyperparameter Tuning¶

Grid search for SVM:

C ∈ [0.001, 0.01, 0.1, 1, 10, 100]
γ ∈ [0.001, 0.01, 0.1, 1, 10] (for RBF)

Create heatmap of accuracy.

# Your implementation here

5.6 🟡 Support Vector Analysis¶

After training:

Identify support vectors
Calculate percentage of SVs
Visualize SVs on plot
How does C affect # of SVs?

# Your implementation here

5.7 🔴 Multi-class SVM¶

Implement:

One-vs-One (OvO)
One-vs-All (OvA)

Compare:

Training time
Number of models
Accuracy
Confusion matrix

# Your implementation here

5.8 🟡 SVM for Regression (SVR)¶

Implement ε-insensitive loss:

L(y, f(x)) = max(0, |y - f(x)| - ε)

Compare with linear regression. Visualize ε-tube.

# Your implementation here

5.9 🔴 Kernel PCA¶

Implement kernel PCA:

Compute kernel matrix K
Center in feature space
Eigendecomposition
Project data

Apply to Swiss roll dataset.

# Your implementation here

5.10 🏆 Custom Kernel Design¶

Design problem-specific kernel for:

String matching (sequence alignment)
Graph similarity
Image comparison

Implement and test on real data.

# Your implementation here

Lecture 6: Neural Networks - Basics (15 Problems)¶

6.1 🟢 Activation Functions¶

Implement and plot:

Sigmoid
Tanh
ReLU
Leaky ReLU
ELU
Swish

Compare derivatives.

# Your implementation here

6.2 🟡 Backpropagation Step-by-Step¶

For 3-layer network (2-4-1):

Forward pass (show all activations)
Backward pass (show all gradients)
Parameter updates
Verify with numerical gradients

# Your implementation here

6.3 🟡 Gradient Checking¶

Implement numerical gradient:

grad_approx = (J(θ+ε) - J(θ-ε)) / (2ε)

Compare with analytical gradient. When is difference acceptable?

# Your implementation here

6.4 🔴 Weight Initialization¶

Compare initialization schemes:

Zero initialization
Random uniform [-1, 1]
Xavier/Glorot
He initialization

Train on MNIST, compare convergence.

# Your implementation here

6.5 🟢 Mini-batch Training¶

Implement mini-batch training:

Shuffle data each epoch
Process batches of size 32
Update parameters after each batch
Track loss per epoch

# Your implementation here

6.6 🟡 Learning Rate Schedules¶

Implement:

Step decay: α × 0.5 every 10 epochs
Exponential decay: α × e^(-kt)
1/t decay: α₀ / (1 + kt)
Cosine annealing

Compare convergence curves.

# Your implementation here

6.7 🔴 Batch Normalization¶

Implement batch norm layer:

def batch_norm(X, gamma, beta, eps=1e-5):
    mu = X.mean(axis=0)
    var = X.var(axis=0)
    X_norm = (X - mu) / sqrt(var + eps)
    return gamma * X_norm + beta

# Train with/without, compare

6.8 🟡 Dropout Regularization¶

Implement dropout:

Training: randomly drop neurons
Testing: use all neurons, scale activations
Try p = [0.1, 0.3, 0.5, 0.7]
Compare overfitting

# Your implementation here

6.9 🔴 Vanishing Gradients¶

Create deep network (10+ layers) with sigmoid:

Plot gradient magnitude per layer
Observe vanishing gradients
Switch to ReLU, compare
Try residual connections

# Your implementation here

6.10 🟢 Architecture Search¶

Try architectures for MNIST:

[256]
[512, 256]
[512, 256, 128]
[1024, 512, 256, 128]

Which gives best accuracy vs parameters?

# Your implementation here

6.11 🟡 Early Stopping¶

Implement early stopping:

Monitor validation loss
Save best model
Restore if no improvement for patience=10

Compare with fixed epochs.

# Your implementation here

6.12 🔴 Learning Curves¶

Plot learning curves:

Training/validation loss vs epoch
Training/validation accuracy vs epoch
Diagnose: overfitting, underfitting, or good fit

# Your implementation here

6.13 🟡 Optimization Algorithms¶

Implement and compare:

SGD
SGD + Momentum
AdaGrad
RMSprop
Adam

On same network and dataset.

# Your implementation here

6.14 🔴 Neural Network from Scratch¶

Implement complete neural network without libraries:

Forward propagation
Backpropagation
Multiple layers
Activation functions
Training loop

Test on iris dataset.

# Your implementation here

6.15 🏆 Interpretability¶

Visualize what network learned:

Plot first layer weights as images
Activation maximization
Gradient-based saliency maps
Layer-wise relevance propagation

# Your implementation here

Comprehensive Projects (10 Projects)¶

The following sections contain major projects that integrate multiple concepts.

Project 1: House Price Prediction 🏆¶

Goal: Predict house prices with RMSE < $50k

Dataset: Kaggle House Prices
Techniques:

Feature engineering
Polynomial features
Ridge/Lasso regularization
Ensemble methods

Deliverables:

EDA report with 10+ visualizations
Model comparison table
Final model with predictions
Written analysis

# Project 1 implementation

Project 2: Spam Email Detection 🏆¶

Goal: Classify emails with > 95% F1-score

Dataset: Enron spam dataset
Techniques:

Text preprocessing
TF-IDF vectorization
Naive Bayes
Logistic regression
SVM

Deliverables:

Preprocessing pipeline
Model comparison
Error analysis
Deployment-ready classifier

# Project 2 implementation

Project 3: MNIST Digit Recognition 🏆¶

Goal: Achieve > 98% test accuracy

Dataset: MNIST (70k images)
Techniques:

Neural networks
CNNs
Data augmentation
Ensemble

Deliverables:

Multiple architectures tested
Learning curves
Confusion matrix analysis
Misclassified examples study

# Project 3 implementation

Solutions and Hints¶

General Guidelines¶

For All Exercises:

✅ Set random seed: np.random.seed(42)
✅ Split data before anything else
✅ Scale features when needed
✅ Use cross-validation
✅ Visualize results
✅ Document code
✅ Interpret findings

Common Mistakes:

❌ Data leakage (scaling before split)
❌ Not checking for NaN/inf
❌ Ignoring class imbalance
❌ Not validating assumptions
❌ Overfitting to test set

Specific Hints¶

Linear Regression:

Feature scaling crucial for gradient descent
Normal equation fails if XᵀX not invertible
Check condition number of XᵀX
Learning rate typically 0.01 - 0.1

Logistic Regression:

Initialize weights near zero
Check sigmoid doesn’t overflow
Use log-sum-exp trick for numerical stability
Class weights help with imbalance

Neural Networks:

Xavier init for tanh: U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]
He init for ReLU: N(0, √(2/n_in))
Gradient clipping prevents exploding gradients
Batch norm reduces internal covariate shift

SVM:

Scale features to [0,1] or standardize
Start with RBF kernel
C controls trade-off: small C → large margin
γ controls decision boundary smoothness

Lecture 1: Linear Regression (10 Problems)¶

1.1 🟢 Basic Implementation¶

1.2 🟡 Gradient Descent Variants¶

1.3 🟡 Learning Rate Schedule¶

1.4 🔴 Closed-Form vs Iterative¶

1.5 🟢 Feature Scaling Impact¶

1.6 🟡 Polynomial Regression¶

1.7 🔴 Regularization Path¶

1.8 🟡 Weighted Linear Regression¶

1.9 🔴 Online Learning¶

1.10 🏆 Distributed Gradient Descent¶

Lecture 2: Logistic Regression (10 Problems)¶

2.1 🟢 Sigmoid Properties¶

2.2 🟡 Decision Boundary Visualization¶

2.3 🟡 Multi-class Classification¶

2.4 🔴 Newton’s Method¶

2.5 🟢 Regularized Logistic Regression¶

2.6 🟡 Imbalanced Classes¶

2.7 🔴 Calibration Analysis¶

2.8 🟡 Feature Engineering¶

2.9 🔴 Softmax Regression¶

2.10 🏆 Adversarial Examples¶

Lecture 3: Regularization (10 Problems)¶

3.1 🟢 Overfitting Demonstration¶

3.2 🟡 Cross-Validation Implementation¶

3.3 🟡 Ridge vs Lasso Comparison¶

3.4 🔴 Elastic Net Tuning¶

3.5 🟢 Early Stopping¶

3.6 🟡 Dropout Implementation¶

3.7 🔴 Bayesian Interpretation¶

3.8 🟡 Feature Selection¶

3.9 🔴 Non-convex Regularization¶

3.10 🏆 Regularization for Deep Learning¶

Lecture 4: Generative Models (10 Problems)¶

4.1 🟢 GDA Implementation¶

4.2 🟡 GDA vs Logistic Regression¶

4.3 🟡 Naive Bayes for Text¶

4.4 🔴 Laplace Smoothing¶

4.5 🟢 Event Models¶

4.6 🟡 Continuous Features¶

4.7 🔴 Kernel Density Estimation¶

4.8 🟡 Discriminative vs Generative¶

4.9 🔴 Hidden Markov Models¶

4.10 🏆 Semi-Supervised Learning¶

Lecture 5: SVMs (10 Problems)¶

5.1 🟢 Margin Calculation¶

5.2 🟡 Kernel Functions¶

5.3 🟡 Soft Margin SVM¶

5.4 🔴 SVM Dual Formulation¶

5.5 🟢 Hyperparameter Tuning¶

5.6 🟡 Support Vector Analysis¶

5.7 🔴 Multi-class SVM¶

5.8 🟡 SVM for Regression (SVR)¶

5.9 🔴 Kernel PCA¶

5.10 🏆 Custom Kernel Design¶

Lecture 6: Neural Networks - Basics (15 Problems)¶

6.1 🟢 Activation Functions¶

6.2 🟡 Backpropagation Step-by-Step¶

6.3 🟡 Gradient Checking¶

6.4 🔴 Weight Initialization¶

6.5 🟢 Mini-batch Training¶

6.6 🟡 Learning Rate Schedules¶

6.7 🔴 Batch Normalization¶

6.8 🟡 Dropout Regularization¶

6.9 🔴 Vanishing Gradients¶

6.10 🟢 Architecture Search¶

6.11 🟡 Early Stopping¶

6.12 🔴 Learning Curves¶

6.13 🟡 Optimization Algorithms¶

6.14 🔴 Neural Network from Scratch¶

6.15 🏆 Interpretability¶

Comprehensive Projects (10 Projects)¶

Project 1: House Price Prediction 🏆¶

Project 2: Spam Email Detection 🏆¶

Project 3: MNIST Digit Recognition 🏆¶

Solutions and Hints¶

General Guidelines¶

Specific Hints¶

Progress Tracker¶

By Lecture¶