Lecture 1: Linear Regression (10 Problems)¶
1.1 🟢 Basic Implementation¶
Implement linear regression without using numpy’s advanced functions:
Test on: y = 2x + 3 + noise
def manual_linear_regression(X, y):
# Use only basic Python and loops
# Return theta (parameters)
pass
# Your implementation here
1.2 🟡 Gradient Descent Variants¶
Implement and compare:
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch GD (batch_size=32)
On dataset with m=10,000, compare:
Convergence speed (iterations to converge)
Final cost
Time per iteration
# Your implementation here
1.3 🟡 Learning Rate Schedule¶
Implement learning rate decay:
α(t) = α₀ / (1 + decay_rate × t)
Compare with constant learning rate on California Housing.
# Your implementation here
1.4 🔴 Closed-Form vs Iterative¶
For different feature dimensions n = [10, 100, 1000, 5000]:
Time Normal Equation
Time Gradient Descent (1000 iterations)
Plot time vs n
Identify crossover point
# Your implementation here
1.5 🟢 Feature Scaling Impact¶
Generate data where features have different scales (e.g., [0.001, 1000]).
Train without scaling
Train with standardization
Compare convergence
# Your implementation here
1.6 🟡 Polynomial Regression¶
Fit polynomials of degree 1-10 to: y = sin(2πx) + noise
Use cross-validation to select degree
Plot train vs validation error
Identify optimal degree
# Your implementation here
1.7 🔴 Regularization Path¶
For Ridge regression, try λ ∈ [10⁻⁶, 10⁶]:
Plot coefficient values vs log(λ)
Identify when coefficients shrink to zero
Find optimal λ using CV
# Your implementation here
1.8 🟡 Weighted Linear Regression¶
Implement locally weighted regression:
weight[i] = exp(-||x⁽ⁱ⁾ - x||² / (2τ²))
Try different τ values
Compare with standard linear regression
Visualize predictions
# Your implementation here
1.9 🔴 Online Learning¶
Implement online gradient descent:
Receive examples one at a time
Update parameters after each example
Track running average cost
Compare with batch learning
# Your implementation here
1.10 🏆 Distributed Gradient Descent¶
Simulate distributed training:
Split data across 4 “workers”
Each computes local gradients
Aggregate gradients
Update parameters
Compare speedup vs communication cost
# Your implementation here
Lecture 2: Logistic Regression (10 Problems)¶
2.1 🟢 Sigmoid Properties¶
Prove mathematically:
σ’(z) = σ(z)(1 - σ(z))
1 - σ(z) = σ(-z)
σ(z) → 1 as z → ∞
Plot sigmoid, derivative, and second derivative.
# Your implementation here
2.2 🟡 Decision Boundary Visualization¶
For 2D dataset:
Train logistic regression
Plot decision boundary
Show regions with P(y=1|x) > 0.5
Add confidence contours
# Your implementation here
2.3 🟡 Multi-class Classification¶
Implement One-vs-All:
def train_one_vs_all(X, y, num_classes):
# Train K binary classifiers
# Return K parameter vectors
pass
# Test on iris dataset (3 classes)
2.4 🔴 Newton’s Method¶
Implement Newton’s method for logistic regression:
θ := θ - H⁻¹∇J(θ)
where H is Hessian matrix. Compare with gradient descent on convergence speed.
# Your implementation here
2.5 🟢 Regularized Logistic Regression¶
Add L2 regularization:
J(θ) = -(1/m)Σ[y log h + (1-y) log(1-h)] + (λ/2m)Σθⱼ²
Try λ = [0.001, 0.01, 0.1, 1.0, 10]. Plot validation error vs λ.
# Your implementation here
2.6 🟡 Imbalanced Classes¶
Create dataset: 95% class 0, 5% class 1
Train standard logistic regression
Calculate: accuracy, precision, recall, F1
Apply class weights
Compare performance
# Your implementation here
2.7 🔴 Calibration Analysis¶
For logistic regression predictions:
Bin predictions into 10 buckets
For each bucket, calculate actual positive rate
Plot calibration curve
Compare well-calibrated vs poorly-calibrated
# Your implementation here
2.8 🟡 Feature Engineering¶
For text classification:
Implement TF-IDF vectorization
Train logistic regression
Identify most important words
Compare with bag-of-words
# Your implementation here
2.9 🔴 Softmax Regression¶
Implement softmax (multi-class logistic regression):
P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)
Derive gradient and implement from scratch. Test on MNIST digits.
# Your implementation here
2.10 🏆 Adversarial Examples¶
Generate adversarial examples:
Find minimal perturbation that flips prediction
Visualize perturbations
Test adversarial training
Measure robustness
# Your implementation here
Lecture 3: Regularization (10 Problems)¶
3.1 🟢 Overfitting Demonstration¶
Generate data: y = x² + noise (n=20 samples)
Fit polynomials degree 1-15
Plot all fits
Show overfitting visually
# Your implementation here
3.2 🟡 Cross-Validation Implementation¶
Implement k-fold CV from scratch:
def k_fold_cv(X, y, k, model):
# Split data into k folds
# Train on k-1, validate on 1
# Return average validation error
pass
# Your implementation here
3.3 🟡 Ridge vs Lasso Comparison¶
On dataset with correlated features:
Plot regularization paths for both
Identify when Lasso sets coefficients to 0
Compare prediction performance
# Your implementation here
3.4 🔴 Elastic Net Tuning¶
Grid search over:
α ∈ [0, 0.25, 0.5, 0.75, 1.0]
λ ∈ [0.001, 0.01, 0.1, 1.0, 10]
Create heatmap of CV error. Identify optimal combination.
# Your implementation here
3.5 🟢 Early Stopping¶
Implement early stopping:
Monitor validation error
Stop when no improvement for 10 epochs
Restore best weights
Compare with fixed epochs
# Your implementation here
3.6 🟡 Dropout Implementation¶
Implement dropout regularization:
def dropout_forward(X, p=0.5):
mask = (np.random.rand(*X.shape) > p) / (1 - p)
return X * mask, mask
# Test on neural network
3.7 🔴 Bayesian Interpretation¶
Show Ridge regression equivalent to MAP estimation with:
Gaussian prior on θ: θ ~ N(0, (1/λ)I)
Likelihood: y|x,θ ~ N(θᵀx, σ²)
Derive λ from prior variance.
# Your implementation here
3.8 🟡 Feature Selection¶
Compare feature selection methods:
Forward selection
Backward elimination
Lasso
Random Forest importance
On dataset with 50 features (10 relevant).
# Your implementation here
3.9 🔴 Non-convex Regularization¶
Implement L0 regularization (count non-zeros):
J(θ) = MSE + λ × ||θ||₀
Use greedy approximation or IHT algorithm.
# Your implementation here
3.10 🏆 Regularization for Deep Learning¶
Compare regularization in deep neural networks:
L2 weight decay
Dropout
Batch normalization
Data augmentation
Label smoothing
On CIFAR-10 dataset.
# Your implementation here
Lecture 4: Generative Models (10 Problems)¶
4.1 🟢 GDA Implementation¶
Implement Gaussian Discriminant Analysis:
def fit_gda(X, y):
# Estimate μ₀, μ₁, Σ for each class
# Return parameters
pass
# Test on synthetic 2D data
4.2 🟡 GDA vs Logistic Regression¶
Generate data from:
Gaussian distributions
Non-Gaussian distributions
Compare performance of GDA vs Logistic Regression. When does each work better?
# Your implementation here
4.3 🟡 Naive Bayes for Text¶
Implement Multinomial Naive Bayes:
P(x|y) = Π P(xᵢ|y)^count(xᵢ)
Apply to 20 Newsgroups dataset. Calculate accuracy on test set.
# Your implementation here
4.4 🔴 Laplace Smoothing¶
Compare Naive Bayes with different smoothing:
No smoothing (α=0)
Laplace smoothing (α=1)
Lidstone smoothing (α=0.1, 0.5, 2.0)
How does α affect rare words?
# Your implementation here
4.5 🟢 Event Models¶
Implement both:
Multinomial event model
Bernoulli event model
For email spam detection. Compare accuracy.
# Your implementation here
4.6 🟡 Continuous Features¶
Extend Naive Bayes to continuous features:
Assume Gaussian: P(xⱼ|y) ~ N(μⱼy, σⱼy²)
Estimate parameters
Test on iris dataset
# Your implementation here
4.7 🔴 Kernel Density Estimation¶
Implement KDE for class-conditional densities:
P(x|y) = (1/n)Σ K((x-xᵢ)/h)
Use Gaussian kernel. Compare with parametric GDA.
# Your implementation here
4.8 🟡 Discriminative vs Generative¶
Theoretical comparison:
Sample complexity
Asymptotic error
When to prefer each
Empirical study with varying training size.
# Your implementation here
4.10 🏆 Semi-Supervised Learning¶
Use generative models for semi-supervised learning:
Train on labeled data
Use EM to leverage unlabeled data
Compare with supervised-only
# Your implementation here
Lecture 5: SVMs (10 Problems)¶
5.1 🟢 Margin Calculation¶
For linearly separable data:
def compute_margin(X, y, w, b):
# Return geometric margin
pass
# Visualize margin for 2D dataset
5.2 🟡 Kernel Functions¶
Implement kernels:
Linear: K(x,z) = xᵀz
Polynomial: K(x,z) = (xᵀz + c)^d
RBF: K(x,z) = exp(-γ||x-z||²)
Sigmoid: K(x,z) = tanh(κxᵀz + Θ)
Test on 2D XOR problem.
# Your implementation here
5.3 🟡 Soft Margin SVM¶
Implement soft margin:
minimize: (1/2)||w||² + C × Σξᵢ
subject to: yᵢ(wᵀxᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0
Solve using quadratic programming.
# Your implementation here
5.4 🔴 SVM Dual Formulation¶
Derive dual form:
maximize: Σαᵢ - (1/2)ΣΣαᵢαⱼyᵢyⱼxᵢᵀxⱼ
subject to: Σαᵢyᵢ = 0, 0 ≤ αᵢ ≤ C
Implement and compare with primal.
# Your implementation here
5.5 🟢 Hyperparameter Tuning¶
Grid search for SVM:
C ∈ [0.001, 0.01, 0.1, 1, 10, 100]
γ ∈ [0.001, 0.01, 0.1, 1, 10] (for RBF)
Create heatmap of accuracy.
# Your implementation here
5.6 🟡 Support Vector Analysis¶
After training:
Identify support vectors
Calculate percentage of SVs
Visualize SVs on plot
How does C affect # of SVs?
# Your implementation here
5.7 🔴 Multi-class SVM¶
Implement:
One-vs-One (OvO)
One-vs-All (OvA)
Compare:
Training time
Number of models
Accuracy
Confusion matrix
# Your implementation here
5.8 🟡 SVM for Regression (SVR)¶
Implement ε-insensitive loss:
L(y, f(x)) = max(0, |y - f(x)| - ε)
Compare with linear regression. Visualize ε-tube.
# Your implementation here
5.9 🔴 Kernel PCA¶
Implement kernel PCA:
Compute kernel matrix K
Center in feature space
Eigendecomposition
Project data
Apply to Swiss roll dataset.
# Your implementation here
5.10 🏆 Custom Kernel Design¶
Design problem-specific kernel for:
String matching (sequence alignment)
Graph similarity
Image comparison
Implement and test on real data.
# Your implementation here
Lecture 6: Neural Networks - Basics (15 Problems)¶
6.1 🟢 Activation Functions¶
Implement and plot:
Sigmoid
Tanh
ReLU
Leaky ReLU
ELU
Swish
Compare derivatives.
# Your implementation here
6.2 🟡 Backpropagation Step-by-Step¶
For 3-layer network (2-4-1):
Forward pass (show all activations)
Backward pass (show all gradients)
Parameter updates
Verify with numerical gradients
# Your implementation here
6.3 🟡 Gradient Checking¶
Implement numerical gradient:
grad_approx = (J(θ+ε) - J(θ-ε)) / (2ε)
Compare with analytical gradient. When is difference acceptable?
# Your implementation here
6.4 🔴 Weight Initialization¶
Compare initialization schemes:
Zero initialization
Random uniform [-1, 1]
Xavier/Glorot
He initialization
Train on MNIST, compare convergence.
# Your implementation here
6.5 🟢 Mini-batch Training¶
Implement mini-batch training:
Shuffle data each epoch
Process batches of size 32
Update parameters after each batch
Track loss per epoch
# Your implementation here
6.6 🟡 Learning Rate Schedules¶
Implement:
Step decay: α × 0.5 every 10 epochs
Exponential decay: α × e^(-kt)
1/t decay: α₀ / (1 + kt)
Cosine annealing
Compare convergence curves.
# Your implementation here
6.7 🔴 Batch Normalization¶
Implement batch norm layer:
def batch_norm(X, gamma, beta, eps=1e-5):
mu = X.mean(axis=0)
var = X.var(axis=0)
X_norm = (X - mu) / sqrt(var + eps)
return gamma * X_norm + beta
# Train with/without, compare
6.8 🟡 Dropout Regularization¶
Implement dropout:
Training: randomly drop neurons
Testing: use all neurons, scale activations
Try p = [0.1, 0.3, 0.5, 0.7]
Compare overfitting
# Your implementation here
6.9 🔴 Vanishing Gradients¶
Create deep network (10+ layers) with sigmoid:
Plot gradient magnitude per layer
Observe vanishing gradients
Switch to ReLU, compare
Try residual connections
# Your implementation here
6.10 🟢 Architecture Search¶
Try architectures for MNIST:
[256]
[512, 256]
[512, 256, 128]
[1024, 512, 256, 128]
Which gives best accuracy vs parameters?
# Your implementation here
6.11 🟡 Early Stopping¶
Implement early stopping:
Monitor validation loss
Save best model
Restore if no improvement for patience=10
Compare with fixed epochs.
# Your implementation here
6.12 🔴 Learning Curves¶
Plot learning curves:
Training/validation loss vs epoch
Training/validation accuracy vs epoch
Diagnose: overfitting, underfitting, or good fit
# Your implementation here
6.13 🟡 Optimization Algorithms¶
Implement and compare:
SGD
SGD + Momentum
AdaGrad
RMSprop
Adam
On same network and dataset.
# Your implementation here
6.14 🔴 Neural Network from Scratch¶
Implement complete neural network without libraries:
Forward propagation
Backpropagation
Multiple layers
Activation functions
Training loop
Test on iris dataset.
# Your implementation here
6.15 🏆 Interpretability¶
Visualize what network learned:
Plot first layer weights as images
Activation maximization
Gradient-based saliency maps
Layer-wise relevance propagation
# Your implementation here
Comprehensive Projects (10 Projects)¶
The following sections contain major projects that integrate multiple concepts.
Project 1: House Price Prediction 🏆¶
Goal: Predict house prices with RMSE < $50k
Dataset: Kaggle House Prices
Techniques:
Feature engineering
Polynomial features
Ridge/Lasso regularization
Ensemble methods
Deliverables:
EDA report with 10+ visualizations
Model comparison table
Final model with predictions
Written analysis
# Project 1 implementation
Project 2: Spam Email Detection 🏆¶
Goal: Classify emails with > 95% F1-score
Dataset: Enron spam dataset
Techniques:
Text preprocessing
TF-IDF vectorization
Naive Bayes
Logistic regression
SVM
Deliverables:
Preprocessing pipeline
Model comparison
Error analysis
Deployment-ready classifier
# Project 2 implementation
Project 3: MNIST Digit Recognition 🏆¶
Goal: Achieve > 98% test accuracy
Dataset: MNIST (70k images)
Techniques:
Neural networks
CNNs
Data augmentation
Ensemble
Deliverables:
Multiple architectures tested
Learning curves
Confusion matrix analysis
Misclassified examples study
# Project 3 implementation
Solutions and Hints¶
General Guidelines¶
For All Exercises:
✅ Set random seed:
np.random.seed(42)✅ Split data before anything else
✅ Scale features when needed
✅ Use cross-validation
✅ Visualize results
✅ Document code
✅ Interpret findings
Common Mistakes:
❌ Data leakage (scaling before split)
❌ Not checking for NaN/inf
❌ Ignoring class imbalance
❌ Not validating assumptions
❌ Overfitting to test set
Specific Hints¶
Linear Regression:
Feature scaling crucial for gradient descent
Normal equation fails if XᵀX not invertible
Check condition number of XᵀX
Learning rate typically 0.01 - 0.1
Logistic Regression:
Initialize weights near zero
Check sigmoid doesn’t overflow
Use log-sum-exp trick for numerical stability
Class weights help with imbalance
Neural Networks:
Xavier init for tanh: U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]
He init for ReLU: N(0, √(2/n_in))
Gradient clipping prevents exploding gradients
Batch norm reduces internal covariate shift
SVM:
Scale features to [0,1] or standardize
Start with RBF kernel
C controls trade-off: small C → large margin
γ controls decision boundary smoothness
Progress Tracker¶
By Lecture¶
Lecture 1: 0/10 problems
Lecture 2: 0/10 problems
Lecture 3: 0/10 problems
Lecture 4: 0/10 problems
Lecture 5: 0/10 problems
Lecture 6: 0/15 problems
Projects¶
Project 1: House Prices
Project 2: Spam Detection
Project 3: MNIST Digits
Project 4: Customer Segmentation
Project 5: Sentiment Analysis
Project 6: Image Classification
Project 7: Anomaly Detection
Project 8: Recommender System
Project 9: Time Series
Project 10: Reinforcement Learning
Challenges¶
Challenge 1: Algorithms from Scratch
Challenge 2: Kaggle Competition
Challenge 3: Paper Implementation
Challenge 4: AutoML System
Challenge 5: Interpretable AI
Total Progress: Track your journey to CS229 mastery!
Good luck with your machine learning journey! 🚀