Lecture 1: Linear Regression (10 Problems)

1.1 🟢 Basic Implementation

Implement linear regression without using numpy’s advanced functions:

Test on: y = 2x + 3 + noise

def manual_linear_regression(X, y):
    # Use only basic Python and loops
    # Return theta (parameters)
    pass

# Your implementation here

1.2 🟡 Gradient Descent Variants

Implement and compare:

  • Batch Gradient Descent

  • Stochastic Gradient Descent

  • Mini-batch GD (batch_size=32)

On dataset with m=10,000, compare:

  • Convergence speed (iterations to converge)

  • Final cost

  • Time per iteration

# Your implementation here

1.3 🟡 Learning Rate Schedule

Implement learning rate decay:

α(t) = α / (1 + decay_rate × t)

Compare with constant learning rate on California Housing.

# Your implementation here

1.4 🔴 Closed-Form vs Iterative

For different feature dimensions n = [10, 100, 1000, 5000]:

  • Time Normal Equation

  • Time Gradient Descent (1000 iterations)

  • Plot time vs n

  • Identify crossover point

# Your implementation here

1.5 🟢 Feature Scaling Impact

Generate data where features have different scales (e.g., [0.001, 1000]).

  • Train without scaling

  • Train with standardization

  • Compare convergence

# Your implementation here

1.6 🟡 Polynomial Regression

Fit polynomials of degree 1-10 to: y = sin(2πx) + noise

  • Use cross-validation to select degree

  • Plot train vs validation error

  • Identify optimal degree

# Your implementation here

1.7 🔴 Regularization Path

For Ridge regression, try λ ∈ [10⁻⁶, 10⁶]:

  • Plot coefficient values vs log(λ)

  • Identify when coefficients shrink to zero

  • Find optimal λ using CV

# Your implementation here

1.8 🟡 Weighted Linear Regression

Implement locally weighted regression:

weight[i] = exp(-||x - x||² / (2τ²))
  • Try different τ values

  • Compare with standard linear regression

  • Visualize predictions

# Your implementation here

1.9 🔴 Online Learning

Implement online gradient descent:

  • Receive examples one at a time

  • Update parameters after each example

  • Track running average cost

  • Compare with batch learning

# Your implementation here

1.10 🏆 Distributed Gradient Descent

Simulate distributed training:

  • Split data across 4 “workers”

  • Each computes local gradients

  • Aggregate gradients

  • Update parameters

  • Compare speedup vs communication cost

# Your implementation here

Lecture 2: Logistic Regression (10 Problems)

2.1 🟢 Sigmoid Properties

Prove mathematically:

  1. σ’(z) = σ(z)(1 - σ(z))

  2. 1 - σ(z) = σ(-z)

  3. σ(z) → 1 as z → ∞

Plot sigmoid, derivative, and second derivative.

# Your implementation here

2.2 🟡 Decision Boundary Visualization

For 2D dataset:

  • Train logistic regression

  • Plot decision boundary

  • Show regions with P(y=1|x) > 0.5

  • Add confidence contours

# Your implementation here

2.3 🟡 Multi-class Classification

Implement One-vs-All:

def train_one_vs_all(X, y, num_classes):
    # Train K binary classifiers
    # Return K parameter vectors
    pass

# Test on iris dataset (3 classes)

2.4 🔴 Newton’s Method

Implement Newton’s method for logistic regression:

θ := θ - H⁻¹∇J(θ)

where H is Hessian matrix. Compare with gradient descent on convergence speed.

# Your implementation here

2.5 🟢 Regularized Logistic Regression

Add L2 regularization:

J(θ) = -(1/m)Σ[y log h + (1-y) log(1-h)] + (λ/2m)Σθⱼ²

Try λ = [0.001, 0.01, 0.1, 1.0, 10]. Plot validation error vs λ.

# Your implementation here

2.6 🟡 Imbalanced Classes

Create dataset: 95% class 0, 5% class 1

  • Train standard logistic regression

  • Calculate: accuracy, precision, recall, F1

  • Apply class weights

  • Compare performance

# Your implementation here

2.7 🔴 Calibration Analysis

For logistic regression predictions:

  • Bin predictions into 10 buckets

  • For each bucket, calculate actual positive rate

  • Plot calibration curve

  • Compare well-calibrated vs poorly-calibrated

# Your implementation here

2.8 🟡 Feature Engineering

For text classification:

  • Implement TF-IDF vectorization

  • Train logistic regression

  • Identify most important words

  • Compare with bag-of-words

# Your implementation here

2.9 🔴 Softmax Regression

Implement softmax (multi-class logistic regression):

P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)

Derive gradient and implement from scratch. Test on MNIST digits.

# Your implementation here

2.10 🏆 Adversarial Examples

Generate adversarial examples:

  • Find minimal perturbation that flips prediction

  • Visualize perturbations

  • Test adversarial training

  • Measure robustness

# Your implementation here

Lecture 3: Regularization (10 Problems)

3.1 🟢 Overfitting Demonstration

Generate data: y = x² + noise (n=20 samples)

  • Fit polynomials degree 1-15

  • Plot all fits

  • Show overfitting visually

# Your implementation here

3.2 🟡 Cross-Validation Implementation

Implement k-fold CV from scratch:

def k_fold_cv(X, y, k, model):
    # Split data into k folds
    # Train on k-1, validate on 1
    # Return average validation error
    pass

# Your implementation here

3.3 🟡 Ridge vs Lasso Comparison

On dataset with correlated features:

  • Plot regularization paths for both

  • Identify when Lasso sets coefficients to 0

  • Compare prediction performance

# Your implementation here

3.4 🔴 Elastic Net Tuning

Grid search over:

  • α ∈ [0, 0.25, 0.5, 0.75, 1.0]

  • λ ∈ [0.001, 0.01, 0.1, 1.0, 10]

Create heatmap of CV error. Identify optimal combination.

# Your implementation here

3.5 🟢 Early Stopping

Implement early stopping:

  • Monitor validation error

  • Stop when no improvement for 10 epochs

  • Restore best weights

  • Compare with fixed epochs

# Your implementation here

3.6 🟡 Dropout Implementation

Implement dropout regularization:

def dropout_forward(X, p=0.5):
    mask = (np.random.rand(*X.shape) > p) / (1 - p)
    return X * mask, mask

# Test on neural network

3.7 🔴 Bayesian Interpretation

Show Ridge regression equivalent to MAP estimation with:

  • Gaussian prior on θ: θ ~ N(0, (1/λ)I)

  • Likelihood: y|x,θ ~ N(θᵀx, σ²)

Derive λ from prior variance.

# Your implementation here

3.8 🟡 Feature Selection

Compare feature selection methods:

  • Forward selection

  • Backward elimination

  • Lasso

  • Random Forest importance

On dataset with 50 features (10 relevant).

# Your implementation here

3.9 🔴 Non-convex Regularization

Implement L0 regularization (count non-zeros):

J(θ) = MSE + λ × ||θ||

Use greedy approximation or IHT algorithm.

# Your implementation here

3.10 🏆 Regularization for Deep Learning

Compare regularization in deep neural networks:

  • L2 weight decay

  • Dropout

  • Batch normalization

  • Data augmentation

  • Label smoothing

On CIFAR-10 dataset.

# Your implementation here

Lecture 4: Generative Models (10 Problems)

4.1 🟢 GDA Implementation

Implement Gaussian Discriminant Analysis:

def fit_gda(X, y):
    # Estimate μ₀, μ₁, Σ for each class
    # Return parameters
    pass

# Test on synthetic 2D data

4.2 🟡 GDA vs Logistic Regression

Generate data from:

  1. Gaussian distributions

  2. Non-Gaussian distributions

Compare performance of GDA vs Logistic Regression. When does each work better?

# Your implementation here

4.3 🟡 Naive Bayes for Text

Implement Multinomial Naive Bayes:

P(x|y) = Π P(xᵢ|y)^count(xᵢ)

Apply to 20 Newsgroups dataset. Calculate accuracy on test set.

# Your implementation here

4.4 🔴 Laplace Smoothing

Compare Naive Bayes with different smoothing:

  • No smoothing (α=0)

  • Laplace smoothing (α=1)

  • Lidstone smoothing (α=0.1, 0.5, 2.0)

How does α affect rare words?

# Your implementation here

4.5 🟢 Event Models

Implement both:

  • Multinomial event model

  • Bernoulli event model

For email spam detection. Compare accuracy.

# Your implementation here

4.6 🟡 Continuous Features

Extend Naive Bayes to continuous features:

  • Assume Gaussian: P(xⱼ|y) ~ N(μⱼy, σⱼy²)

  • Estimate parameters

  • Test on iris dataset

# Your implementation here

4.7 🔴 Kernel Density Estimation

Implement KDE for class-conditional densities:

P(x|y) = (1/n)Σ K((x-xᵢ)/h)

Use Gaussian kernel. Compare with parametric GDA.

# Your implementation here

4.8 🟡 Discriminative vs Generative

Theoretical comparison:

  • Sample complexity

  • Asymptotic error

  • When to prefer each

Empirical study with varying training size.

# Your implementation here

4.9 🔴 Hidden Markov Models

Implement HMM for sequence labeling:

  • Forward algorithm (likelihood)

  • Viterbi algorithm (best path)

  • Baum-Welch (parameter learning)

Apply to POS tagging.

# Your implementation here

4.10 🏆 Semi-Supervised Learning

Use generative models for semi-supervised learning:

  • Train on labeled data

  • Use EM to leverage unlabeled data

  • Compare with supervised-only

# Your implementation here

Lecture 5: SVMs (10 Problems)

5.1 🟢 Margin Calculation

For linearly separable data:

def compute_margin(X, y, w, b):
    # Return geometric margin
    pass

# Visualize margin for 2D dataset

5.2 🟡 Kernel Functions

Implement kernels:

  • Linear: K(x,z) = xᵀz

  • Polynomial: K(x,z) = (xᵀz + c)^d

  • RBF: K(x,z) = exp(-γ||x-z||²)

  • Sigmoid: K(x,z) = tanh(κxᵀz + Θ)

Test on 2D XOR problem.

# Your implementation here

5.3 🟡 Soft Margin SVM

Implement soft margin:

minimize: (1/2)||w||² + C × Σξᵢ
subject to: yᵢ(wᵀxᵢ + b)  1 - ξᵢ, ξᵢ  0

Solve using quadratic programming.

# Your implementation here

5.4 🔴 SVM Dual Formulation

Derive dual form:

maximize: Σαᵢ - (1/2)ΣΣαᵢαⱼyᵢyⱼxᵢᵀxⱼ
subject to: Σαᵢyᵢ = 0, 0  αᵢ  C

Implement and compare with primal.

# Your implementation here

5.5 🟢 Hyperparameter Tuning

Grid search for SVM:

  • C ∈ [0.001, 0.01, 0.1, 1, 10, 100]

  • γ ∈ [0.001, 0.01, 0.1, 1, 10] (for RBF)

Create heatmap of accuracy.

# Your implementation here

5.6 🟡 Support Vector Analysis

After training:

  • Identify support vectors

  • Calculate percentage of SVs

  • Visualize SVs on plot

  • How does C affect # of SVs?

# Your implementation here

5.7 🔴 Multi-class SVM

Implement:

  • One-vs-One (OvO)

  • One-vs-All (OvA)

Compare:

  • Training time

  • Number of models

  • Accuracy

  • Confusion matrix

# Your implementation here

5.8 🟡 SVM for Regression (SVR)

Implement ε-insensitive loss:

L(y, f(x)) = max(0, |y - f(x)| - ε)

Compare with linear regression. Visualize ε-tube.

# Your implementation here

5.9 🔴 Kernel PCA

Implement kernel PCA:

  • Compute kernel matrix K

  • Center in feature space

  • Eigendecomposition

  • Project data

Apply to Swiss roll dataset.

# Your implementation here

5.10 🏆 Custom Kernel Design

Design problem-specific kernel for:

  • String matching (sequence alignment)

  • Graph similarity

  • Image comparison

Implement and test on real data.

# Your implementation here

Lecture 6: Neural Networks - Basics (15 Problems)

6.1 🟢 Activation Functions

Implement and plot:

  • Sigmoid

  • Tanh

  • ReLU

  • Leaky ReLU

  • ELU

  • Swish

Compare derivatives.

# Your implementation here

6.2 🟡 Backpropagation Step-by-Step

For 3-layer network (2-4-1):

  • Forward pass (show all activations)

  • Backward pass (show all gradients)

  • Parameter updates

  • Verify with numerical gradients

# Your implementation here

6.3 🟡 Gradient Checking

Implement numerical gradient:

grad_approx = (J(θ+ε) - J(θ-ε)) / (2ε)

Compare with analytical gradient. When is difference acceptable?

# Your implementation here

6.4 🔴 Weight Initialization

Compare initialization schemes:

  • Zero initialization

  • Random uniform [-1, 1]

  • Xavier/Glorot

  • He initialization

Train on MNIST, compare convergence.

# Your implementation here

6.5 🟢 Mini-batch Training

Implement mini-batch training:

  • Shuffle data each epoch

  • Process batches of size 32

  • Update parameters after each batch

  • Track loss per epoch

# Your implementation here

6.6 🟡 Learning Rate Schedules

Implement:

  • Step decay: α × 0.5 every 10 epochs

  • Exponential decay: α × e^(-kt)

  • 1/t decay: α₀ / (1 + kt)

  • Cosine annealing

Compare convergence curves.

# Your implementation here

6.7 🔴 Batch Normalization

Implement batch norm layer:

def batch_norm(X, gamma, beta, eps=1e-5):
    mu = X.mean(axis=0)
    var = X.var(axis=0)
    X_norm = (X - mu) / sqrt(var + eps)
    return gamma * X_norm + beta

# Train with/without, compare

6.8 🟡 Dropout Regularization

Implement dropout:

  • Training: randomly drop neurons

  • Testing: use all neurons, scale activations

  • Try p = [0.1, 0.3, 0.5, 0.7]

  • Compare overfitting

# Your implementation here

6.9 🔴 Vanishing Gradients

Create deep network (10+ layers) with sigmoid:

  • Plot gradient magnitude per layer

  • Observe vanishing gradients

  • Switch to ReLU, compare

  • Try residual connections

# Your implementation here

6.11 🟡 Early Stopping

Implement early stopping:

  • Monitor validation loss

  • Save best model

  • Restore if no improvement for patience=10

Compare with fixed epochs.

# Your implementation here

6.12 🔴 Learning Curves

Plot learning curves:

  • Training/validation loss vs epoch

  • Training/validation accuracy vs epoch

  • Diagnose: overfitting, underfitting, or good fit

# Your implementation here

6.13 🟡 Optimization Algorithms

Implement and compare:

  • SGD

  • SGD + Momentum

  • AdaGrad

  • RMSprop

  • Adam

On same network and dataset.

# Your implementation here

6.14 🔴 Neural Network from Scratch

Implement complete neural network without libraries:

  • Forward propagation

  • Backpropagation

  • Multiple layers

  • Activation functions

  • Training loop

Test on iris dataset.

# Your implementation here

6.15 🏆 Interpretability

Visualize what network learned:

  • Plot first layer weights as images

  • Activation maximization

  • Gradient-based saliency maps

  • Layer-wise relevance propagation

# Your implementation here

Comprehensive Projects (10 Projects)

The following sections contain major projects that integrate multiple concepts.

Project 1: House Price Prediction 🏆

Goal: Predict house prices with RMSE < $50k

Dataset: Kaggle House Prices
Techniques:

  • Feature engineering

  • Polynomial features

  • Ridge/Lasso regularization

  • Ensemble methods

Deliverables:

  • EDA report with 10+ visualizations

  • Model comparison table

  • Final model with predictions

  • Written analysis

# Project 1 implementation

Project 2: Spam Email Detection 🏆

Goal: Classify emails with > 95% F1-score

Dataset: Enron spam dataset
Techniques:

  • Text preprocessing

  • TF-IDF vectorization

  • Naive Bayes

  • Logistic regression

  • SVM

Deliverables:

  • Preprocessing pipeline

  • Model comparison

  • Error analysis

  • Deployment-ready classifier

# Project 2 implementation

Project 3: MNIST Digit Recognition 🏆

Goal: Achieve > 98% test accuracy

Dataset: MNIST (70k images)
Techniques:

  • Neural networks

  • CNNs

  • Data augmentation

  • Ensemble

Deliverables:

  • Multiple architectures tested

  • Learning curves

  • Confusion matrix analysis

  • Misclassified examples study

# Project 3 implementation

Solutions and Hints

General Guidelines

For All Exercises:

  1. ✅ Set random seed: np.random.seed(42)

  2. ✅ Split data before anything else

  3. ✅ Scale features when needed

  4. ✅ Use cross-validation

  5. ✅ Visualize results

  6. ✅ Document code

  7. ✅ Interpret findings

Common Mistakes:

  • ❌ Data leakage (scaling before split)

  • ❌ Not checking for NaN/inf

  • ❌ Ignoring class imbalance

  • ❌ Not validating assumptions

  • ❌ Overfitting to test set

Specific Hints

Linear Regression:

  • Feature scaling crucial for gradient descent

  • Normal equation fails if XᵀX not invertible

  • Check condition number of XᵀX

  • Learning rate typically 0.01 - 0.1

Logistic Regression:

  • Initialize weights near zero

  • Check sigmoid doesn’t overflow

  • Use log-sum-exp trick for numerical stability

  • Class weights help with imbalance

Neural Networks:

  • Xavier init for tanh: U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]

  • He init for ReLU: N(0, √(2/n_in))

  • Gradient clipping prevents exploding gradients

  • Batch norm reduces internal covariate shift

SVM:

  • Scale features to [0,1] or standardize

  • Start with RBF kernel

  • C controls trade-off: small C → large margin

  • γ controls decision boundary smoothness

Progress Tracker

By Lecture

  • Lecture 1: 0/10 problems

  • Lecture 2: 0/10 problems

  • Lecture 3: 0/10 problems

  • Lecture 4: 0/10 problems

  • Lecture 5: 0/10 problems

  • Lecture 6: 0/15 problems

Projects

  • Project 1: House Prices

  • Project 2: Spam Detection

  • Project 3: MNIST Digits

  • Project 4: Customer Segmentation

  • Project 5: Sentiment Analysis

  • Project 6: Image Classification

  • Project 7: Anomaly Detection

  • Project 8: Recommender System

  • Project 9: Time Series

  • Project 10: Reinforcement Learning

Challenges

  • Challenge 1: Algorithms from Scratch

  • Challenge 2: Kaggle Competition

  • Challenge 3: Paper Implementation

  • Challenge 4: AutoML System

  • Challenge 5: Interpretable AI

Total Progress: Track your journey to CS229 mastery!

Good luck with your machine learning journey! 🚀