Lecture 1: Introduction & Linear Regression

📹 Watch Lecture

What is Machine Learning?

Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Types of Machine Learning

1. Supervised Learning

  • Given “right answers” (labeled data)

  • Learn mapping from input to output

  • Examples:

    • Regression: Predict continuous values

    • Classification: Predict discrete labels

2. Unsupervised Learning

  • No labels provided

  • Find structure in data

  • Examples:

    • Clustering

    • Dimensionality reduction

    • Density estimation

3. Reinforcement Learning

  • Learn from rewards/penalties

  • Agent interacts with environment

  • Examples:

    • Game playing

    • Robotics

    • Autonomous vehicles

Linear Regression

Problem Setup

Training Set:

  • m training examples: \((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})\)

  • \(x \in \mathbb{R}^n\): input features (n-dimensional)

  • \(y \in \mathbb{R}\): output (real-valued)

Goal: Learn function \(h: X \rightarrow Y\) (hypothesis)

Hypothesis

For linear regression: $\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n\)$

Vectorized form: $\(h_\theta(x) = \theta^T x\)$

where:

  • \(\theta = [\theta_0, \theta_1, \ldots, \theta_n]^T\) (parameters)

  • \(x = [1, x_1, x_2, \ldots, x_n]^T\) (features with bias term)

Cost Function

Mean Squared Error (MSE): $\(J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2\)$

Matrix form: $\(J(\theta) = \frac{1}{2m} (X\theta - y)^T(X\theta - y)\)$

Learning Algorithms

1. Gradient Descent (Batch) $\(\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}\)$

2. Normal Equation (Closed-form) $\(\theta = (X^TX)^{-1}X^Ty\)$

3. Stochastic Gradient Descent (SGD) Update using one example at a time

Motivating Example: Portland Housing Prices

From Andrew Ng’s CS229 Lecture 2:

“Let’s say you want to predict or estimate the prices of houses. The way you build a learning algorithm is start by collecting a dataset of houses and their prices. This is data from Portland, Oregon…”

Dataset collected from Craigslist (Portland, OR):

  • Feature: Size of house (square feet)

  • Target: Price (in thousands of dollars)

  • Example: 2,104 sq ft house with asking price $400,000

Supervised Learning Process:

  1. Training Set → Learning Algorithm → Hypothesis (h)

  2. Hypothesis: Takes house size as input, outputs estimated price

  3. Goal: Find the best hypothesis that fits the training data

Why Start with Linear Regression?

“Linear regression is maybe the simplest possible learning algorithm - a supervised learning regression problem.” - Andrew Ng

Even though self-driving cars (like ALVINN) use more complex models, linear regression provides the foundation for understanding:

  • How to represent a hypothesis

  • How to define a cost function

  • How to optimize parameters

  • The workflow of machine learning algorithms

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Portland Housing Data (from CS229 Lecture 2 - Craigslist data)
portland_data = {
    'size_sqft': [2104, 1600, 2400, 1416, 3000, 1985, 1534, 1427, 1380, 1494,
                  1940, 2000, 1890, 4478, 1268, 2300, 1320, 1236, 2609, 3031],
    'bedrooms': [3, 3, 3, 2, 4, 4, 3, 3, 3, 3,
                 4, 3, 3, 5, 3, 4, 2, 3, 4, 4],
    'price_1000s': [400, 330, 369, 232, 540, 300, 315, 199, 212, 243,
                    240, 347, 330, 700, 240, 380, 225, 160, 363, 412]
}

df_portland = pd.DataFrame(portland_data)

print("🏠 CS229 Portland Housing Dataset")
print("="*70)
print(f"\nDataset from Andrew Ng's lecture (Craigslist Portland, Oregon)")
print(f"Samples: {len(df_portland)}")
print(f"\nFirst 5 examples:")
print(df_portland.head())
print(f"\nSummary Statistics:")
print(df_portland.describe())

# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Size vs Price
axes[0].scatter(df_portland['size_sqft'], df_portland['price_1000s'], 
                alpha=0.6, s=100, c='steelblue', edgecolors='black', linewidth=1)
axes[0].set_xlabel('Size (sq ft)', fontsize=12)
axes[0].set_ylabel('Price ($1000s)', fontsize=12)
axes[0].set_title('Portland Housing: Size vs Price', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Bedrooms vs Price
axes[1].scatter(df_portland['bedrooms'], df_portland['price_1000s'], 
                alpha=0.6, s=100, c='coral', edgecolors='black', linewidth=1)
axes[1].set_xlabel('Number of Bedrooms', fontsize=12)
axes[1].set_ylabel('Price ($1000s)', fontsize=12)
axes[1].set_title('Portland Housing: Bedrooms vs Price', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observation:")
print("   • Price generally increases with house size")
print("   • Price also correlates with number of bedrooms")
print("   • Linear relationship appears reasonable as first approximation")

1.1 Univariate Linear Regression: Portland Housing

From Lecture 2: “How do you represent the hypothesis H? In linear regression, the hypothesis is going to be that the output (price) is a linear function of the input (size).”

\[h_\theta(x) = \theta_0 + \theta_1 x\]

Where:

  • \(x\) = size of house (square feet)

  • \(h_\theta(x)\) = predicted price

  • \(\theta_0\) = intercept (base price)

  • \(\theta_1\) = slope (price per sq ft)

“For the mathematicians in the room, you’ll say technically this is an affine function. If it was a linear function, there’s no θ₀… but machine learning sometimes just calls this a linear function.” - Andrew Ng

# Portland Housing: Univariate Linear Regression (Size → Price)
print("📊 Univariate Linear Regression: Size vs Price")
print("="*70)

# Extract size and price from Portland data
X_portland = df_portland['size_sqft'].values.reshape(-1, 1)
y_portland = df_portland['price_1000s'].values.reshape(-1, 1)

print(f"\nDataset: {len(X_portland)} houses from Portland, OR")
print(f"Input (X):  House size (sq ft)")
print(f"Output (Y): Price ($1000s)")

# Add bias term (x0 = 1) to create design matrix
X_b = np.c_[np.ones((len(X_portland), 1)), X_portland]

print(f"\nDesign Matrix X shape: {X_b.shape}")
print(f"   Column 0: x₀ = 1 (bias term)")
print(f"   Column 1: x₁ = size (sq ft)")

# SOLUTION 1: Normal Equation (Analytical)
# θ = (X^T X)^(-1) X^T y
theta_normal = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_portland

print(f"\n🎯 Normal Equation Solution:")
print(f"   θ₀ (intercept): ${theta_normal[0][0]:.2f}k  (base price)")
print(f"   θ₁ (slope):     ${theta_normal[1][0]:.4f}k per sq ft")
print(f"\n   Interpretation:")
print(f"   • Base price (0 sq ft house): ${theta_normal[0][0]:.2f}k")
print(f"   • Each additional sq ft adds: ${theta_normal[1][0]*1000:.2f}")

# Make predictions
X_test = np.array([[0], [5000]])  # 0 to 5000 sq ft
X_test_b = np.c_[np.ones((2, 1)), X_test]
y_pred_line = X_test_b @ theta_normal

# Predict for some example houses
examples = np.array([[1500], [2000], [2500], [3000]])
examples_b = np.c_[np.ones((len(examples), 1)), examples]
predictions = examples_b @ theta_normal

print(f"\n📈 Example Predictions:")
for size, pred in zip(examples, predictions):
    print(f"   {size[0]:,} sq ft → ${pred[0]:.2f}k (${pred[0]*1000:.0f})")

# Visualize
plt.figure(figsize=(12, 6))
plt.scatter(X_portland, y_portland, alpha=0.7, s=100, c='steelblue', 
            edgecolors='black', linewidth=1, label='Training data (Portland)')
plt.plot(X_test, y_pred_line, 'r-', linewidth=3, 
         label=f'Fit: y = {theta_normal[1][0]:.4f}x + {theta_normal[0][0]:.2f}')

# Highlight specific prediction
highlight_size = 2104  # First house in dataset
highlight_pred = theta_normal[0][0] + theta_normal[1][0] * highlight_size
plt.scatter([highlight_size], [highlight_pred], s=300, c='red', marker='*', 
            edgecolors='black', linewidth=2, zorder=5,
            label=f'Example: 2,104 sq ft → ${highlight_pred:.0f}k')

plt.xlabel('Size (square feet)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('Linear Regression: Portland Housing Prices', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate performance metrics
y_pred_train = X_b @ theta_normal
mse = np.mean((y_pred_train - y_portland)**2)
rmse = np.sqrt(mse)
r2 = 1 - (np.sum((y_portland - y_pred_train)**2) / np.sum((y_portland - y_portland.mean())**2))

print(f"\n📊 Model Performance:")
print(f"   Mean Squared Error (MSE):  {mse:.2f}")
print(f"   Root MSE (RMSE):          ${rmse:.2f}k")
print(f"   R² Score:                  {r2:.4f} ({r2*100:.2f}% variance explained)")

print(f"\n💡 Key Insight:")
print(f"   The Normal Equation gives us the optimal parameters in closed form:")
print(f"   θ* = (X^T X)^(-1) X^T y")

1.2 Gradient Descent Implementation

Batch Gradient Descent Algorithm

Repeat until convergence: $\(\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}\)$

Vectorized: $\(\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)\)$

# Implement Gradient Descent from scratch
def compute_cost(X, y, theta):
    """Compute MSE cost function"""
    m = len(y)
    predictions = X @ theta
    cost = (1/(2*m)) * np.sum((predictions - y)**2)
    return cost

def gradient_descent(X, y, theta, alpha, num_iters):
    """Batch Gradient Descent"""
    m = len(y)
    cost_history = np.zeros(num_iters)
    theta_history = np.zeros((num_iters, len(theta)))
    
    for i in range(num_iters):
        # Compute predictions
        predictions = X @ theta
        
        # Compute gradients
        gradients = (1/m) * X.T @ (predictions - y)
        
        # Update parameters
        theta = theta - alpha * gradients
        
        # Save cost and theta
        cost_history[i] = compute_cost(X, y, theta)
        theta_history[i] = theta.flatten()
    
    return theta, cost_history, theta_history

print("🔄 Gradient Descent Implementation")
print("="*70)

# Initialize parameters
theta_init = np.random.randn(2, 1)
alpha = 0.1
num_iters = 1000

print(f"\n⚙️ Hyperparameters:")
print(f"   Learning rate (α): {alpha}")
print(f"   Iterations: {num_iters}")
print(f"   Initial θ: {theta_init.flatten()}")

# Run gradient descent
theta_gd, cost_history, theta_history = gradient_descent(X_b, y_1d, theta_init, alpha, num_iters)

print(f"\n🎯 Gradient Descent Solution:")
print(f"   θ₀ (intercept): {theta_gd[0][0]:.4f}")
print(f"   θ₁ (slope):     {theta_gd[1][0]:.4f}")
print(f"   Final cost:     {cost_history[-1]:.6f}")

# Compare with Normal Equation
print(f"\n📊 Comparison:")
print(f"   Method          θ₀        θ₁        Final Cost")
print(f"   Normal Eq:   {theta_best[0][0]:7.4f}  {theta_best[1][0]:7.4f}  {compute_cost(X_b, y_1d, theta_best):.6f}")
print(f"   Grad Descent:{theta_gd[0][0]:7.4f}  {theta_gd[1][0]:7.4f}  {cost_history[-1]:.6f}")

# Visualize convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cost history
axes[0].plot(cost_history, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost J(θ)')
axes[0].set_title('Gradient Descent: Cost vs Iteration')
axes[0].grid(True, alpha=0.3)

# Parameter evolution
axes[1].plot(theta_history[:, 0], label='θ₀ (intercept)', linewidth=2)
axes[1].plot(theta_history[:, 1], label='θ₁ (slope)', linewidth=2)
axes[1].axhline(theta_best[0], color='C0', linestyle='--', alpha=0.5, label='θ₀ optimal')
axes[1].axhline(theta_best[1], color='C1', linestyle='--', alpha=0.5, label='θ₁ optimal')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('Parameter Evolution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observation: Gradient descent converges to same solution as Normal Equation!")

1.3 Learning Rate Analysis

The learning rate α is critical:

  • Too small: Slow convergence

  • Too large: May not converge, oscillate, or diverge

  • Just right: Fast and stable convergence

# Compare different learning rates
print("📊 Learning Rate Comparison")
print("="*70)

learning_rates = [0.001, 0.01, 0.1, 0.5]
colors = ['red', 'orange', 'green', 'blue']
num_iters = 500

plt.figure(figsize=(12, 6))

for alpha, color in zip(learning_rates, colors):
    theta_init = np.random.randn(2, 1)
    theta, cost_history, _ = gradient_descent(X_b, y_1d, theta_init, alpha, num_iters)
    
    converged = cost_history[-1] < cost_history[0] * 0.01
    status = "✓ Converged" if converged else "✗ Slow"
    
    plt.plot(cost_history, color=color, linewidth=2, 
             label=f'α = {alpha} ({status})', alpha=0.8)
    
    print(f"\nα = {alpha:5.3f}: Final cost = {cost_history[-1]:.6f} {status}")

plt.xlabel('Iteration')
plt.ylabel('Cost J(θ)')
plt.title('Effect of Learning Rate on Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("\n💡 Key Insights:")
print("   • α = 0.001: Too slow, needs many iterations")
print("   • α = 0.01:  Better, but still slow")
print("   • α = 0.1:   Good balance - fast and stable")
print("   • α = 0.5:   Very fast convergence")
print("\n   Recommendation: Start with α around 0.1 and tune if needed")

1.4 Multivariate Linear Regression

Now extend to multiple features: $\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n\)$

# California Housing Dataset (8 features)
print("🏠 Multivariate Linear Regression: California Housing")
print("="*70)

# Load data
housing = fetch_california_housing(as_frame=True)
X_housing = housing.data.values
y_housing = housing.target.values.reshape(-1, 1)

print(f"\n📊 Dataset:")
print(f"   Samples: {X_housing.shape[0]:,}")
print(f"   Features: {X_housing.shape[1]}")
print(f"   Feature names: {list(housing.feature_names)}")
print(f"   Target: Median house value (in $100k)")

# Feature Scaling (Important for gradient descent!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_housing)

# Add bias term
X_b_multi = np.c_[np.ones((X_scaled.shape[0], 1)), X_scaled]

print(f"\n📏 Feature Scaling:")
print(f"   Before scaling - mean: {X_housing.mean(axis=0)[:3]}...")
print(f"   After scaling  - mean: {X_scaled.mean(axis=0)[:3]}...")
print(f"   After scaling  - std:  {X_scaled.std(axis=0)[:3]}...")

# Train-test split
m = len(X_b_multi)
indices = np.random.permutation(m)
train_size = int(0.8 * m)
train_idx, test_idx = indices[:train_size], indices[train_size:]

X_train = X_b_multi[train_idx]
y_train = y_housing[train_idx]
X_test = X_b_multi[test_idx]
y_test = y_housing[test_idx]

print(f"\n🔀 Train-Test Split:")
print(f"   Training samples: {len(X_train):,}")
print(f"   Test samples: {len(X_test):,}")

# Gradient Descent
theta_init = np.zeros((X_b_multi.shape[1], 1))
alpha = 0.1
num_iters = 1000

theta_gd_multi, cost_history_multi, _ = gradient_descent(X_train, y_train, theta_init, alpha, num_iters)

# Normal Equation (for comparison)
theta_normal_multi = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train

# Evaluate
def evaluate_model(X, y, theta, name):
    predictions = X @ theta
    mse = np.mean((predictions - y)**2)
    rmse = np.sqrt(mse)
    r2 = 1 - (np.sum((y - predictions)**2) / np.sum((y - y.mean())**2))
    return mse, rmse, r2

train_mse_gd, train_rmse_gd, train_r2_gd = evaluate_model(X_train, y_train, theta_gd_multi, "GD Train")
test_mse_gd, test_rmse_gd, test_r2_gd = evaluate_model(X_test, y_test, theta_gd_multi, "GD Test")

train_mse_normal, train_rmse_normal, train_r2_normal = evaluate_model(X_train, y_train, theta_normal_multi, "Normal Train")
test_mse_normal, test_rmse_normal, test_r2_normal = evaluate_model(X_test, y_test, theta_normal_multi, "Normal Test")

print(f"\n📊 Results:")
print(f"\n   Gradient Descent:")
print(f"      Train RMSE: {train_rmse_gd:.4f} | R²: {train_r2_gd:.4f}")
print(f"      Test RMSE:  {test_rmse_gd:.4f} | R²: {test_r2_gd:.4f}")
print(f"\n   Normal Equation:")
print(f"      Train RMSE: {train_rmse_normal:.4f} | R²: {train_r2_normal:.4f}")
print(f"      Test RMSE:  {test_rmse_normal:.4f} | R²: {test_r2_normal:.4f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Cost convergence
axes[0].plot(cost_history_multi, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost J(θ)')
axes[0].set_title('Training Cost vs Iteration')
axes[0].grid(True, alpha=0.3)

# Feature weights
feature_names = ['Bias'] + list(housing.feature_names)
axes[1].barh(feature_names, theta_gd_multi.flatten(), alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title('Learned Feature Weights')
axes[1].grid(True, alpha=0.3, axis='x')

# Predictions vs Actual
y_pred_test = X_test @ theta_gd_multi
axes[2].scatter(y_test, y_pred_test, alpha=0.3)
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual Price ($100k)')
axes[2].set_ylabel('Predicted Price ($100k)')
axes[2].set_title(f'Predictions vs Actual (R²={test_r2_gd:.3f})')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("   • Feature scaling essential for multivariate gradient descent")
print("   • Gradient descent converges to same solution as Normal Equation")
print("   • Model explains ~60% of variance in house prices")
print("   • MedInc (median income) has largest positive coefficient")

1.5 Gradient Descent Variants

Comparison

Method

Updates per iteration

Pros

Cons

Batch GD

All m examples

Stable, guaranteed convergence

Slow for large datasets

Stochastic GD

1 example

Fast, works with online data

Noisy, may not converge exactly

Mini-batch GD

b examples (b < m)

Balance of both

Need to tune batch size

# Implement SGD and Mini-batch GD
def stochastic_gradient_descent(X, y, theta, alpha, num_epochs):
    """Stochastic Gradient Descent"""
    m = len(y)
    cost_history = []
    
    for epoch in range(num_epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Update for each example
        for i in range(m):
            xi = X_shuffled[i:i+1]
            yi = y_shuffled[i:i+1]
            
            prediction = xi @ theta
            gradient = xi.T @ (prediction - yi)
            theta = theta - alpha * gradient
        
        # Record cost after each epoch
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
    
    return theta, cost_history

def minibatch_gradient_descent(X, y, theta, alpha, num_epochs, batch_size=32):
    """Mini-batch Gradient Descent"""
    m = len(y)
    cost_history = []
    
    for epoch in range(num_epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Process mini-batches
        for i in range(0, m, batch_size):
            Xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            
            predictions = Xi @ theta
            gradients = (1/len(yi)) * Xi.T @ (predictions - yi)
            theta = theta - alpha * gradients
        
        # Record cost after each epoch
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
    
    return theta, cost_history

print("⚡ Gradient Descent Variants Comparison")
print("="*70)

# Use subset of data for faster demo
sample_size = 5000
X_sample = X_b_multi[:sample_size]
y_sample = y_housing[:sample_size]

num_epochs = 50
alpha = 0.01

# Batch GD
theta_init = np.zeros((X_sample.shape[1], 1))
theta_batch, cost_batch = gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs)

# SGD
theta_sgd, cost_sgd = stochastic_gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs)

# Mini-batch GD (batch_size=32)
theta_minibatch, cost_minibatch = minibatch_gradient_descent(X_sample, y_sample, theta_init.copy(), alpha, num_epochs, batch_size=32)

print(f"\n📊 Final Costs:")
print(f"   Batch GD:      {cost_batch[-1]:.6f}")
print(f"   SGD:           {cost_sgd[-1]:.6f}")
print(f"   Mini-batch GD: {cost_minibatch[-1]:.6f}")

# Visualize convergence
plt.figure(figsize=(12, 6))
plt.plot(cost_batch, 'b-', linewidth=2, label='Batch GD', alpha=0.7)
plt.plot(cost_sgd, 'r-', linewidth=1, label='Stochastic GD', alpha=0.7)
plt.plot(cost_minibatch, 'g-', linewidth=2, label='Mini-batch GD (batch=32)', alpha=0.7)
plt.xlabel('Epoch')
plt.ylabel('Cost J(θ)')
plt.title('Gradient Descent Variants: Convergence Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n💡 Observations:")
print("   • Batch GD: Smooth convergence, most stable")
print("   • SGD: Noisy but fast, oscillates around minimum")
print("   • Mini-batch: Good balance - less noisy than SGD, faster than Batch")
print("\n   In practice: Mini-batch GD is most commonly used (especially in deep learning)")

Key Takeaways

Linear Regression Fundamentals

Hypothesis: $\(h_\theta(x) = \theta^T x\)$

Cost Function: $\(J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2\)$

Learning Algorithms

1. Normal Equation

  • Closed-form solution: \(\theta = (X^TX)^{-1}X^Ty\)

  • Pros: Direct solution, no iterations

  • Cons: Slow for large n (O(n³)), doesn’t scale

2. Gradient Descent

  • Iterative: \(\theta := \theta - \alpha \nabla J(\theta)\)

  • Pros: Scales to large n, works for other algorithms

  • Cons: Needs tuning (α), many iterations

Best Practices

Feature Scaling: Always scale features for gradient descent
Learning Rate: Start with α = 0.01, 0.1, adjust based on convergence
Convergence: Monitor cost function, ensure it decreases
Variants: Use mini-batch GD for large datasets
Evaluation: Use train-test split, calculate R² and RMSE

When to Use

Normal Equation:

  • n < 10,000

  • Need exact solution

  • X^TX is invertible

Gradient Descent:

  • n > 10,000

  • Large datasets

  • Foundation for advanced algorithms

Next Steps

Lecture 2: Classification (Logistic Regression)
Lecture 3: Regularization (Overfitting prevention)
Lecture 4: Neural Networks

References

  • CS229 Lecture Notes: Linear Regression

  • Andrew Ng, Stanford University

  • Machine Learning Yearning (deeplearning.ai)

Practice Exercises

Exercise 1.1: Implement from Scratch

Without using sklearn or numpy’s advanced functions:

  1. Implement compute_cost() using only loops

  2. Implement gradient_descent() using only loops

  3. Test on simple 1D dataset

  4. Verify matches numpy version

Exercise 1.2: Learning Rate Experiment

  1. Try α = [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0]

  2. For each, run 1000 iterations

  3. Plot cost vs iteration for all α

  4. Identify: (a) optimal α, (b) too small, © too large

  5. What happens when α > 1.0?

Exercise 1.3: Feature Engineering

Using California Housing:

  1. Create polynomial features (x², x₁x₂, etc.)

  2. Train model with original vs polynomial features

  3. Compare R² on train and test sets

  4. Discuss overfitting

Exercise 1.4: Batch Size Analysis

For mini-batch GD:

  1. Try batch sizes: [1, 8, 32, 128, 512]

  2. Measure: (a) time per epoch, (b) final cost

  3. Plot convergence for each

  4. Determine optimal batch size

Exercise 1.5: Normal Equation vs GD

Generate datasets with n = [10, 100, 1000, 10000]:

  1. Time Normal Equation for each n

  2. Time Gradient Descent for each n

  3. Plot time vs n for both methods

  4. At what n does GD become faster?

Exercise 1.6: Vectorization Benefits

  1. Implement gradient descent with loops (no vectorization)

  2. Implement vectorized version

  3. Compare execution time on dataset with m=10000

  4. Calculate speedup factor

Exercise 1.7: Real Dataset

Load diabetes dataset:

from sklearn.datasets import load_diabetes
  1. Apply linear regression

  2. Analyze which features are most important

  3. Create residual plots

  4. Check for heteroscedasticity

Exercise 1.8: Regularization Preview

  1. Generate data with many correlated features

  2. Fit standard linear regression

  3. Observe coefficient instability

  4. Add L2 penalty: \(J(\theta) = MSE + \lambda \sum \theta_j^2\)

  5. Compare coefficients with/without regularization