Run this notebook: Open in Colab Open in Kaggle

Lecture 1: Introduction & Linear Regression¶

What is Machine Learning?¶

Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Types of Machine Learning¶

1. Supervised Learning

Given “right answers” (labeled data)
Learn mapping from input to output
Examples:
- Regression: Predict continuous values
- Classification: Predict discrete labels

2. Unsupervised Learning

No labels provided
Find structure in data
Examples:
- Clustering
- Dimensionality reduction
- Density estimation

3. Reinforcement Learning

Learn from rewards/penalties
Agent interacts with environment
Examples:
- Game playing
- Robotics
- Autonomous vehicles

Linear Regression¶

Problem Setup¶

Training Set:

m training examples: $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})$
$x \in \mathbb{R}^n$: input features (n-dimensional)
$y \in \mathbb{R}$: output (real-valued)

Goal: Learn function $h: X \rightarrow Y$ (hypothesis)

Hypothesis¶

For linear regression: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$

Vectorized form: $$h_\theta(x) = \theta^T x$$

where:

$\theta = [\theta_0, \theta_1, \ldots, \theta_n]^T$ (parameters)
$x = [1, x_1, x_2, \ldots, x_n]^T$ (features with bias term)

Cost Function¶

Mean Squared Error (MSE): $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2$$

Matrix form: $$J(\theta) = \frac{1}{2m} (X\theta - y)^T(X\theta - y)$$

Learning Algorithms¶

1. Gradient Descent (Batch) $$\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

2. Normal Equation (Closed-form) $$\theta = (X^TX)^{-1}X^Ty$$

3. Stochastic Gradient Descent (SGD) Update using one example at a time

Motivating Example: Portland Housing Prices¶

From Andrew Ng’s CS229 Lecture 2:

“Let’s say you want to predict or estimate the prices of houses. The way you build a learning algorithm is start by collecting a dataset of houses and their prices. This is data from Portland, Oregon…”

Dataset collected from Craigslist (Portland, OR):

Feature: Size of house (square feet)
Target: Price (in thousands of dollars)
Example: 2,104 sq ft house with asking price $400,000

Supervised Learning Process:

Training Set → Learning Algorithm → Hypothesis (h)
Hypothesis: Takes house size as input, outputs estimated price
Goal: Find the best hypothesis that fits the training data

Why Start with Linear Regression?¶

“Linear regression is maybe the simplest possible learning algorithm - a supervised learning regression problem.” - Andrew Ng

Even though self-driving cars (like ALVINN) use more complex models, linear regression provides the foundation for understanding:

How to represent a hypothesis
How to define a cost function
How to optimize parameters
The workflow of machine learning algorithms

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Portland Housing Data (from CS229 Lecture 2 - Craigslist data)
portland_data = {
    'size_sqft': [2104, 1600, 2400, 1416, 3000, 1985, 1534, 1427, 1380, 1494,
                  1940, 2000, 1890, 4478, 1268, 2300, 1320, 1236, 2609, 3031],
    'bedrooms': [3, 3, 3, 2, 4, 4, 3, 3, 3, 3,
                 4, 3, 3, 5, 3, 4, 2, 3, 4, 4],
    'price_1000s': [400, 330, 369, 232, 540, 300, 315, 199, 212, 243,
                    240, 347, 330, 700, 240, 380, 225, 160, 363, 412]
}

df_portland = pd.DataFrame(portland_data)

print("🏠 CS229 Portland Housing Dataset")
print("="*70)
print(f"\nDataset from Andrew Ng's lecture (Craigslist Portland, Oregon)")
print(f"Samples: {len(df_portland)}")
print(f"\nFirst 5 examples:")
print(df_portland.head())
print(f"\nSummary Statistics:")
print(df_portland.describe())

# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Size vs Price
axes[0].scatter(df_portland['size_sqft'], df_portland['price_1000s'], 
                alpha=0.6, s=100, c='steelblue', edgecolors='black', linewidth=1)
axes[0].set_xlabel('Size (sq ft)', fontsize=12)
axes[0].set_ylabel('Price ($1000s)', fontsize=12)
axes[0].set_title('Portland Housing: Size vs Price', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Bedrooms vs Price
axes[1].scatter(df_portland['bedrooms'], df_portland['price_1000s'], 
                alpha=0.6, s=100, c='coral', edgecolors='black', linewidth=1)
axes[1].set_xlabel('Number of Bedrooms', fontsize=12)
axes[1].set_ylabel('Price ($1000s)', fontsize=12)
axes[1].set_title('Portland Housing: Bedrooms vs Price', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observation:")
print("   • Price generally increases with house size")
print("   • Price also correlates with number of bedrooms")
print("   • Linear relationship appears reasonable as first approximation")

1.1 Univariate Linear Regression: Portland Housing¶

From Lecture 2: “How do you represent the hypothesis H? In linear regression, the hypothesis is going to be that the output (price) is a linear function of the input (size).”

\[h_\theta(x) = \theta_0 + \theta_1 x\]

Where:

$x$ = size of house (square feet)
$h_\theta(x)$ = predicted price
$\theta_0$ = intercept (base price)
$\theta_1$ = slope (price per sq ft)

“For the mathematicians in the room, you’ll say technically this is an affine function. If it was a linear function, there’s no θ₀… but machine learning sometimes just calls this a linear function.” - Andrew Ng

# Portland Housing: Univariate Linear Regression (Size → Price)
print("📊 Univariate Linear Regression: Size vs Price")
print("="*70)

# Extract size and price from Portland data
X_portland = df_portland['size_sqft'].values.reshape(-1, 1)
y_portland = df_portland['price_1000s'].values.reshape(-1, 1)

print(f"\nDataset: {len(X_portland)} houses from Portland, OR")
print(f"Input (X):  House size (sq ft)")
print(f"Output (Y): Price ($1000s)")

# Add bias term (x0 = 1) to create design matrix
X_b = np.c_[np.ones((len(X_portland), 1)), X_portland]

print(f"\nDesign Matrix X shape: {X_b.shape}")
print(f"   Column 0: x₀ = 1 (bias term)")
print(f"   Column 1: x₁ = size (sq ft)")

# SOLUTION 1: Normal Equation (Analytical)
# θ = (X^T X)^(-1) X^T y
theta_normal = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_portland

print(f"\n🎯 Normal Equation Solution:")
print(f"   θ₀ (intercept): ${theta_normal[0][0]:.2f}k  (base price)")
print(f"   θ₁ (slope):     ${theta_normal[1][0]:.4f}k per sq ft")
print(f"\n   Interpretation:")
print(f"   • Base price (0 sq ft house): ${theta_normal[0][0]:.2f}k")
print(f"   • Each additional sq ft adds: ${theta_normal[1][0]*1000:.2f}")

# Make predictions
X_test = np.array([[0], [5000]])  # 0 to 5000 sq ft
X_test_b = np.c_[np.ones((2, 1)), X_test]
y_pred_line = X_test_b @ theta_normal

# Predict for some example houses
examples = np.array([[1500], [2000], [2500], [3000]])
examples_b = np.c_[np.ones((len(examples), 1)), examples]
predictions = examples_b @ theta_normal

print(f"\n📈 Example Predictions:")
for size, pred in zip(examples, predictions):
    print(f"   {size[0]:,} sq ft → ${pred[0]:.2f}k (${pred[0]*1000:.0f})")

# Visualize
plt.figure(figsize=(12, 6))
plt.scatter(X_portland, y_portland, alpha=0.7, s=100, c='steelblue', 
            edgecolors='black', linewidth=1, label='Training data (Portland)')
plt.plot(X_test, y_pred_line, 'r-', linewidth=3, 
         label=f'Fit: y = {theta_normal[1][0]:.4f}x + {theta_normal[0][0]:.2f}')

# Highlight specific prediction
highlight_size = 2104  # First house in dataset
highlight_pred = theta_normal[0][0] + theta_normal[1][0] * highlight_size
plt.scatter([highlight_size], [highlight_pred], s=300, c='red', marker='*', 
            edgecolors='black', linewidth=2, zorder=5,
            label=f'Example: 2,104 sq ft → ${highlight_pred:.0f}k')

plt.xlabel('Size (square feet)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('Linear Regression: Portland Housing Prices', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate performance metrics
y_pred_train = X_b @ theta_normal
mse = np.mean((y_pred_train - y_portland)**2)
rmse = np.sqrt(mse)
r2 = 1 - (np.sum((y_portland - y_pred_train)**2) / np.sum((y_portland - y_portland.mean())**2))

print(f"\n📊 Model Performance:")
print(f"   Mean Squared Error (MSE):  {mse:.2f}")
print(f"   Root MSE (RMSE):          ${rmse:.2f}k")
print(f"   R² Score:                  {r2:.4f} ({r2*100:.2f}% variance explained)")

print(f"\n💡 Key Insight:")
print(f"   The Normal Equation gives us the optimal parameters in closed form:")
print(f"   θ* = (X^T X)^(-1) X^T y")

1.2 Gradient Descent Implementation¶

Batch Gradient Descent Algorithm¶

Repeat until convergence: $$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}$$

Vectorized: $$\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)$$

# Implement Gradient Descent from scratch
def compute_cost(X, y, theta):
    """Compute MSE cost function"""
    m = len(y)
    predictions = X @ theta
    cost = (1/(2*m)) * np.sum((predictions - y)**2)
    return cost

def gradient_descent(X, y, theta, alpha, num_iters):
    """Batch Gradient Descent"""
    m = len(y)
    cost_history = np.zeros(num_iters)
    theta_history = np.zeros((num_iters, len(theta)))
    
    for i in range(num_iters):
        # Compute predictions
        predictions = X @ theta
        
        # Compute gradients
        gradients = (1/m) * X.T @ (predictions - y)
        
        # Update parameters
        theta = theta - alpha * gradients
        
        # Save cost and theta
        cost_history[i] = compute_cost(X, y, theta)
        theta_history[i] = theta.flatten()
    
    return theta, cost_history, theta_history

print("🔄 Gradient Descent Implementation")
print("="*70)

# Initialize parameters
theta_init = np.random.randn(2, 1)
alpha = 0.1
num_iters = 1000

print(f"\n⚙️ Hyperparameters:")
print(f"   Learning rate (α): {alpha}")
print(f"   Iterations: {num_iters}")
print(f"   Initial θ: {theta_init.flatten()}")

# Run gradient descent
theta_gd, cost_history, theta_history = gradient_descent(X_b, y_1d, theta_init, alpha, num_iters)

print(f"\n🎯 Gradient Descent Solution:")
print(f"   θ₀ (intercept): {theta_gd[0][0]:.4f}")
print(f"   θ₁ (slope):     {theta_gd[1][0]:.4f}")
print(f"   Final cost:     {cost_history[-1]:.6f}")

# Compare with Normal Equation
print(f"\n📊 Comparison:")
print(f"   Method          θ₀        θ₁        Final Cost")
print(f"   Normal Eq:   {theta_best[0][0]:7.4f}  {theta_best[1][0]:7.4f}  {compute_cost(X_b, y_1d, theta_best):.6f}")
print(f"   Grad Descent:{theta_gd[0][0]:7.4f}  {theta_gd[1][0]:7.4f}  {cost_history[-1]:.6f}")

# Visualize convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cost history
axes[0].plot(cost_history, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost J(θ)')
axes[0].set_title('Gradient Descent: Cost vs Iteration')
axes[0].grid(True, alpha=0.3)

# Parameter evolution
axes[1].plot(theta_history[:, 0], label='θ₀ (intercept)', linewidth=2)
axes[1].plot(theta_history[:, 1], label='θ₁ (slope)', linewidth=2)
axes[1].axhline(theta_best[0], color='C0', linestyle='--', alpha=0.5, label='θ₀ optimal')
axes[1].axhline(theta_best[1], color='C1', linestyle='--', alpha=0.5, label='θ₁ optimal')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('Parameter Evolution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observation: Gradient descent converges to same solution as Normal Equation!")

1.3 Learning Rate Analysis¶

The learning rate α is critical:

Too small: Slow convergence
Too large: May not converge, oscillate, or diverge
Just right: Fast and stable convergence

# Compare different learning rates
print("📊 Learning Rate Comparison")
print("="*70)

learning_rates = [0.001, 0.01, 0.1, 0.5]
colors = ['red', 'orange', 'green', 'blue']
num_iters = 500

plt.figure(figsize=(12, 6))

for alpha, color in zip(learning_rates, colors):
    theta_init = np.random.randn(2, 1)
    theta, cost_history, _ = gradient_descent(X_b, y_1d, theta_init, alpha, num_iters)
    
    converged = cost_history[-1] < cost_history[0] * 0.01
    status = "✓ Converged" if converged else "✗ Slow"
    
    plt.plot(cost_history, color=color, linewidth=2, 
             label=f'α = {alpha} ({status})', alpha=0.8)
    
    print(f"\nα = {alpha:5.3f}: Final cost = {cost_history[-1]:.6f} {status}")

plt.xlabel('Iteration')
plt.ylabel('Cost J(θ)')
plt.title('Effect of Learning Rate on Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("\n💡 Key Insights:")
print("   • α = 0.001: Too slow, needs many iterations")
print("   • α = 0.01:  Better, but still slow")
print("   • α = 0.1:   Good balance - fast and stable")
print("   • α = 0.5:   Very fast convergence")
print("\n   Recommendation: Start with α around 0.1 and tune if needed")

1.4 Multivariate Linear Regression¶

Now extend to multiple features: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$

# California Housing Dataset (8 features)
print("🏠 Multivariate Linear Regression: California Housing")
print("="*70)

# Load data
housing = fetch_california_housing(as_frame=True)
X_housing = housing.data.values
y_housing = housing.target.values.reshape(-1, 1)

print(f"\n📊 Dataset:")
print(f"   Samples: {X_housing.shape[0]:,}")
print(f"   Features: {X_housing.shape[1]}")
print(f"   Feature names: {list(housing.feature_names)}")
print(f"   Target: Median house value (in $100k)")

# Feature Scaling (Important for gradient descent!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_housing)

# Add bias term
X_b_multi = np.c_[np.ones((X_scaled.shape[0], 1)), X_scaled]

print(f"\n📏 Feature Scaling:")
print(f"   Before scaling - mean: {X_housing.mean(axis=0)[:3]}...")
print(f"   After scaling  - mean: {X_scaled.mean(axis=0)[:3]}...")
print(f"   After scaling  - std:  {X_scaled.std(axis=0)[:3]}...")

# Train-test split
m = len(X_b_multi)
indices = np.random.permutation(m)
train_size = int(0.8 * m)
train_idx, test_idx = indices[:train_size], indices[train_size:]

X_train = X_b_multi[train_idx]
y_train = y_housing[train_idx]
X_test = X_b_multi[test_idx]
y_test = y_housing[test_idx]

print(f"\n🔀 Train-Test Split:")
print(f"   Training samples: {len(X_train):,}")
print(f"   Test samples: {len(X_test):,}")

# Gradient Descent
theta_init = np.zeros((X_b_multi.shape[1], 1))
alpha = 0.1
num_iters = 1000

theta_gd_multi, cost_history_multi, _ = gradient_descent(X_train, y_train, theta_init, alpha, num_iters)

# Normal Equation (for comparison)
theta_normal_multi = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train

# Evaluate
def evaluate_model(X, y, theta, name):
    predictions = X @ theta
    mse = np.mean((predictions - y)**2)
    rmse = np.sqrt(mse)
    r2 = 1 - (np.sum((y - predictions)**2) / np.sum((y - y.mean())**2))
    return mse, rmse, r2

train_mse_gd, train_rmse_gd, train_r2_gd = evaluate_model(X_train, y_train, theta_gd_multi, "GD Train")
test_mse_gd, test_rmse_gd, test_r2_gd = evaluate_model(X_test, y_test, theta_gd_multi, "GD Test")

train_mse_normal, train_rmse_normal, train_r2_normal = evaluate_model(X_train, y_train, theta_normal_multi, "Normal Train")
test_mse_normal, test_rmse_normal, test_r2_normal = evaluate_model(X_test, y_test, theta_normal_multi, "Normal Test")

print(f"\n📊 Results:")
print(f"\n   Gradient Descent:")
print(f"      Train RMSE: {train_rmse_gd:.4f} | R²: {train_r2_gd:.4f}")
print(f"      Test RMSE:  {test_rmse_gd:.4f} | R²: {test_r2_gd:.4f}")
print(f"\n   Normal Equation:")
print(f"      Train RMSE: {train_rmse_normal:.4f} | R²: {train_r2_normal:.4f}")
print(f"      Test RMSE:  {test_rmse_normal:.4f} | R²: {test_r2_normal:.4f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Cost convergence
axes[0].plot(cost_history_multi, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost J(θ)')
axes[0].set_title('Training Cost vs Iteration')
axes[0].grid(True, alpha=0.3)

# Feature weights
feature_names = ['Bias'] + list(housing.feature_names)
axes[1].barh(feature_names, theta_gd_multi.flatten(), alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title('Learned Feature Weights')
axes[1].grid(True, alpha=0.3, axis='x')

# Predictions vs Actual
y_pred_test = X_test @ theta_gd_multi
axes[2].scatter(y_test, y_pred_test, alpha=0.3)
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual Price ($100k)')
axes[2].set_ylabel('Predicted Price ($100k)')
axes[2].set_title(f'Predictions vs Actual (R²={test_r2_gd:.3f})')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("   • Feature scaling essential for multivariate gradient descent")
print("   • Gradient descent converges to same solution as Normal Equation")
print("   • Model explains ~60% of variance in house prices")
print("   • MedInc (median income) has largest positive coefficient")

1.5 Gradient Descent Variants¶

Comparison¶

Method	Updates per iteration	Pros	Cons
Batch GD	All m examples	Stable, guaranteed convergence	Slow for large datasets
Stochastic GD	1 example	Fast, works with online data	Noisy, may not converge exactly
Mini-batch GD	b examples (b < m)	Balance of both	Need to tune batch size

# Implement SGD and Mini-batch GD
def stochastic_gradient_descent(X, y, theta, alpha, num_epochs):
    """Stochastic Gradient Descent"""
    m = len(y)
    cost_history = []
    
    for epoch in range(num_epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Update for each example
        for i in range(m):
            xi = X_shuffled[i:i+1]
            yi = y_shuffled[i:i+1]
            
            prediction = xi @ theta
            gradient = xi.T @ (prediction - yi)
            theta = theta - alpha * gradient
        
        # Record cost after each epoch
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
    
    return theta, cost_history

def minibatch_gradient_descent(X, y, theta, alpha, num_epochs, batch_size=32):
    """Mini-batch Gradient Descent"""
    m = len(y)
    cost_history = []
    
    for epoch in range(num_epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Process mini-batches
        for i in range(0, m, batch_size):
            Xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            
            predictions = Xi @ theta
            gradients = (1/len(yi)) * Xi.T @ (predictions - yi)
            theta = theta - alpha * gradients
        
        # Record cost after each epoch
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
    
    return theta, cost_history

print("⚡ Gradient Descent Variants Comparison")
print("="*70)

# Use subset of data for faster demo
sample_size = 5000
X_sample = X_b_multi[:sample_size]
y_sample = y_housing[:sample_size]

num_epochs = 50
alpha = 0.01

# Batch GD
theta_init = np.zeros((X_sample.shape[1], 1))
theta_batch, cost_batch = gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs)

# SGD
theta_sgd, cost_sgd = stochastic_gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs)

# Mini-batch GD (batch_size=32)
theta_minibatch, cost_minibatch = minibatch_gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs, batch_size=32)

print(f"\n📊 Final Costs:")
print(f"   Batch GD:      {cost_batch[-1]:.6f}")
print(f"   SGD:           {cost_sgd[-1]:.6f}")
print(f"   Mini-batch GD: {cost_minibatch[-1]:.6f}")

# Visualize convergence
plt.figure(figsize=(12, 6))
plt.plot(cost_batch, 'b-', linewidth=2, label='Batch GD', alpha=0.7)
plt.plot(cost_sgd, 'r-', linewidth=1, label='Stochastic GD', alpha=0.7)
plt.plot(cost_minibatch, 'g-', linewidth=2, label='Mini-batch GD (batch=32)', alpha=0.7)
plt.xlabel('Epoch')
plt.ylabel('Cost J(θ)')
plt.title('Gradient Descent Variants: Convergence Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n💡 Observations:")
print("   • Batch GD: Smooth convergence, most stable")
print("   • SGD: Noisy but fast, oscillates around minimum")
print("   • Mini-batch: Good balance - less noisy than SGD, faster than Batch")
print("\n   In practice: Mini-batch GD is most commonly used (especially in deep learning)")

Key Takeaways¶

Linear Regression Fundamentals¶

Hypothesis: $$h_\theta(x) = \theta^T x$$

Cost Function: $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2$$

Learning Algorithms¶

1. Normal Equation

Closed-form solution: $\theta = (X^TX)^{-1}X^Ty$
Pros: Direct solution, no iterations
Cons: Slow for large n (O(n³)), doesn’t scale

2. Gradient Descent

Iterative: $\theta := \theta - \alpha \nabla J(\theta)$
Pros: Scales to large n, works for other algorithms
Cons: Needs tuning (α), many iterations

Best Practices¶

✅ Feature Scaling: Always scale features for gradient descent
✅ Learning Rate: Start with α = 0.01, 0.1, adjust based on convergence
✅ Convergence: Monitor cost function, ensure it decreases
✅ Variants: Use mini-batch GD for large datasets
✅ Evaluation: Use train-test split, calculate R² and RMSE

When to Use¶

Normal Equation:

n < 10,000
Need exact solution
X^TX is invertible

Gradient Descent:

n > 10,000
Large datasets
Foundation for advanced algorithms

Next Steps¶

Lecture 2: Classification (Logistic Regression)
Lecture 3: Regularization (Overfitting prevention)
Lecture 4: Neural Networks

References¶

CS229 Lecture Notes: Linear Regression
Andrew Ng, Stanford University
Machine Learning Yearning (deeplearning.ai)

Practice Exercises¶

Exercise 1.1: Implement from Scratch¶

Without using sklearn or numpy’s advanced functions:

Implement compute_cost() using only loops
Implement gradient_descent() using only loops
Test on simple 1D dataset
Verify matches numpy version

Exercise 1.2: Learning Rate Experiment¶

Try α = [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0]
For each, run 1000 iterations
Plot cost vs iteration for all α
What happens when α > 1.0?

Exercise 1.3: Feature Engineering¶

Using California Housing:

Create polynomial features (x², x₁x₂, etc.)
Train model with original vs polynomial features
Compare R² on train and test sets
Discuss overfitting

Exercise 1.4: Batch Size Analysis¶

For mini-batch GD:

Try batch sizes: [1, 8, 32, 128, 512]
Measure: (a) time per epoch, (b) final cost
Plot convergence for each
Determine optimal batch size

Exercise 1.5: Normal Equation vs GD¶

Generate datasets with n = [10, 100, 1000, 10000]:

Time Normal Equation for each n
Time Gradient Descent for each n
Plot time vs n for both methods
At what n does GD become faster?

Exercise 1.6: Vectorization Benefits¶

Implement gradient descent with loops (no vectorization)
Implement vectorized version
Compare execution time on dataset with m=10000
Calculate speedup factor

Exercise 1.7: Real Dataset¶

Load diabetes dataset:

from sklearn.datasets import load_diabetes

Apply linear regression
Analyze which features are most important
Create residual plots
Check for heteroscedasticity

Exercise 1.8: Regularization Preview¶

Generate data with many correlated features
Fit standard linear regression
Observe coefficient instability
Add L2 penalty: $J(\theta) = MSE + \lambda \sum \theta_j^2$
Compare coefficients with/without regularization