Lecture 2: Linear Regression and Gradient Descent

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 2 (Autumn 2018)

2.1 Supervised Learning Framework

The Process

Training SetLearning AlgorithmHypothesis \(h\)

“The job of the learning algorithm is to output a function… a hypothesis. The job of the hypothesis is to take as input any size of a house and try to tell you what it thinks should be the price of that house.” - Andrew Ng

Key Design Decision: Representing the Hypothesis

Simple linear model: $\(h_\theta(x) = \theta_0 + \theta_1 x\)$

“The mathematicians in the room will say technically this is an affine function. It was a linear function, there’s no \(\theta_0\), technically… but machine learning sometimes just calls this a linear function.” - Andrew Ng

Multiple features: $\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2\)$

Where:

  • \(x_1\) = size of house

  • \(x_2\) = number of bedrooms

Compact Notation

Summation form: $\(h_\theta(x) = \sum_{j=0}^{n} \theta_j x_j\)$

Convention: Define \(x_0 = 1\) (dummy feature)

Then:

  • \(\theta = [\theta_0, \theta_1, \theta_2]^T\) (3-dimensional parameter vector)

  • \(x = [x_0, x_1, x_2]^T = [1, \text{size}, \text{bedrooms}]^T\)

Vector form: $\(h_\theta(x) = \theta^T x\)$

Notation Summary

“Design choices of what is the workflow, what is the dataset, what is the hypothesis, how do you represent the hypothesis - these are the key decisions you have to make in pretty much every supervised learning algorithm’s design.” - Andrew Ng

Symbol

Meaning

Example

\(m\)

Number of training examples

50 houses

\(n\)

Number of features

2 (size, bedrooms)

\(x\)

Input features/attributes

[1, 2104, 3]

\(y\)

Output/target variable

400 (thousands)

\((x^{(i)}, y^{(i)})\)

\(i\)-th training example

\((x^{(1)}, y^{(1)})\) = first house

\(\theta\)

Parameters of the model

\([\theta_0, \theta_1, \theta_2]\)

\(h_\theta(x)\)

Hypothesis (prediction)

Estimated price

“The superscript parentheses i, that’s not exponentiation… this is just a way of writing an index into the table of training examples.” - Andrew Ng

Example:

  • \(x_1^{(1)} = 2104\) (size of first house)

  • \(x_1^{(2)} = 1416\) (size of second house)

2.2 Cost Function: Ordinary Least Squares

Goal: Choose Good Parameters

“The learning algorithm’s job is to choose parameters \(\theta\) that allows you to make good predictions about prices of houses.” - Andrew Ng

Intuition: Choose \(\theta\) such that \(h_\theta(x)\) is close to \(y\) for training examples

The Cost Function

Definition: $\(J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\)$

Breaking it down:

  • \((h_\theta(x^{(i)}) - y^{(i)})^2\) = squared difference between prediction and truth

  • \(\sum_{i=1}^{m}\) = sum over all \(m\) training examples

  • \(\frac{1}{2m}\) = average (and makes calculus cleaner)

“By convention we put a one-half there because when we take derivatives to minimize this later, putting a one-half there would make some of the math a little bit simpler.” - Andrew Ng

Goal: Find \(\theta\) that minimizes \(J(\theta)\)

Why Squared Error?

“The questions I’ve often gotten is why squared error? Why not absolute error, or error to the power of 4? We’ll talk more about that when we talk about generalized linear models next week. Using squared error corresponds to a Gaussian.” - Andrew Ng

Intuition (preview):

  • Assumes errors are normally distributed

  • Equivalent to maximum likelihood under Gaussian noise

  • Convex optimization problem (single global minimum)

  • Differentiable everywhere (unlike absolute error)

2.3 Gradient Descent Algorithm

“Let’s see how you can implement an algorithm to find the value of \(\theta\) that minimizes \(J(\theta)\). We’re going to use an algorithm called gradient descent.” - Andrew Ng

Algorithm Overview

Initialize:

  • Start with some \(\theta\) (e.g., \(\theta = \vec{0}\))

Repeat:

  • Keep changing \(\theta\) to reduce \(J(\theta)\)

Visual Intuition

“Imagine that you are standing on this hill… What you do in gradient descent is turn around all 360 degrees and look all around you and see if you were to take a tiny little step, in what direction should you take a little step to go downhill as fast as possible?” - Andrew Ng

The surface:

  • Horizontal axes: \(\theta_0, \theta_1\) (parameters)

  • Vertical axis: \(J(\theta)\) (cost)

  • Goal: Reach lowest elevation (minimum cost)

Process:

  1. Stand at current point

  2. Look all around (360°)

  3. Take small step in steepest downhill direction

  4. Repeat until convergence

Local vs Global Minima

“One property of gradient descent is that depending on where you initialize parameters, you can get to different points… If you had started it just a few steps over to the right, you would have gotten to a different local optimum.” - Andrew Ng

For linear regression:

“It turns out that when you run gradient descent on linear regression, there will not be local optimum” - Andrew Ng

Why? The cost function \(J(\theta)\) is convex (bowl-shaped) → single global minimum

The Update Rule

Notation:

  • := means “gets assigned to” (assignment)

  • = means “equals” (assertion of fact)

“I’m gonna use colon equals to denote assignment. So \(a := a + 1\) means increment the value of \(a\) by 1. Whereas \(a = b\) means I’m asserting that the value of \(a\) is equal to the value of \(b\).” - Andrew Ng

Gradient descent update: $\(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\)$

For all \(j = 0, 1, 2, \ldots, n\) simultaneously

Where:

  • \(\alpha\) = learning rate (step size)

  • \(\frac{\partial}{\partial \theta_j} J(\theta)\) = partial derivative (gradient component)

Computing the Gradient

For linear regression: $\(\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\)$

Derivation (using chain rule): $\(\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left[ \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \right]\)$

\[= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})\]
\[= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

Batch Gradient Descent

Complete algorithm:

Repeat until convergence {
    For j = 0 to n:
        θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
}

“Batch” because each step uses all \(m\) training examples

Vectorized form: $\(\theta := \theta - \frac{\alpha}{m} X^T (X\theta - y)\)$

Where:

  • \(X\) is \(m \times (n+1)\) design matrix

  • \(y\) is \(m \times 1\) vector of target values

2.4 Stochastic Gradient Descent (SGD)

Motivation: Computational Efficiency

Problem with batch GD:

  • Each update requires scanning entire dataset

  • If \(m\) = 1 million examples → very slow!

Stochastic GD idea:

  • Update parameters using one example at a time

  • Much faster iterations

SGD Algorithm

Randomly shuffle training examples

Repeat {
    For i = 1 to m:
        For j = 0 to n:
            θ_j := θ_j - α * (h_θ(x^(i)) - y^(i)) * x_j^(i)
}

Key difference: Update after each example, not after seeing all examples

Batch GD vs SGD

Aspect

Batch GD

Stochastic GD

Update frequency

After full pass through data

After each example

Gradient

Exact (average over all data)

Noisy (from single example)

Convergence

Smooth path to minimum

Oscillates around minimum

Speed per iteration

Slow

Fast

Large datasets

Impractical

Practical

Parallelization

Easy

Hard

Convergence Behavior

Batch GD:

  • Follows gradient precisely

  • Smooth descent to minimum

  • Converges to exact minimum

Stochastic GD:

  • “Drunk walk” toward minimum

  • Oscillates due to noise from individual examples

  • Converges to region near minimum

  • Never settles exactly (keeps bouncing)

Practical Tips

Learning rate:

  • For SGD, often decrease \(\alpha\) over time: \(\alpha = \frac{c_1}{\text{iteration} + c_2}\)

  • Allows convergence while maintaining fast initial progress

Mini-batch GD (best of both worlds):

  • Use \(b\) examples per update (e.g., \(b = 32, 64, 128\))

  • Balances speed and gradient accuracy

  • Enables GPU parallelization

Setup: Import Libraries

Before implementing gradient descent, we load NumPy for vectorized linear algebra operations and Matplotlib for visualizing convergence behavior. Feature normalization, matrix multiplications, and cost function evaluations all rely on NumPy’s efficient array operations – the same operations that underpin virtually every ML framework. Visualization is equally important: plotting the cost function \(J(\theta)\) across iterations is the standard way to verify that gradient descent is converging properly and to diagnose issues like a learning rate that is too large or too small.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

2.5 Implementation: Batch Gradient Descent

What: Building gradient descent from scratch

We now translate the mathematical update rule \(\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)\) into working code. The implementation covers data loading, feature normalization, the cost function \(J(\theta)\), and the iterative gradient descent loop.

Why: Understanding the mechanics behind every ML optimizer

Every modern deep learning optimizer (Adam, RMSProp, AdaGrad) is a variation of this fundamental algorithm. By implementing vanilla batch gradient descent, you build intuition for learning rates, convergence diagnostics, and the critical role of feature scaling – skills that transfer directly to training neural networks at any scale.

How it works

The vectorized gradient \(\nabla_\theta J = \frac{1}{m} X^T(X\theta - y)\) computes the average direction of steepest ascent across all \(m\) training examples in a single matrix operation, and we step in the opposite direction by \(\alpha\) (the learning rate). Feature normalization ensures all dimensions of \(\theta\) converge at similar rates; without it, elongated cost contours cause oscillation along one axis while progressing slowly along another.

# Portland housing data (simplified)
# Features: size (sq ft), bedrooms
# Target: price (thousands of dollars)

X = np.array([
    [2104, 3],
    [1416, 2],
    [1534, 3],
    [852, 2],
    [1940, 4],
    [1752, 3],
    [1212, 2],
    [1668, 3]
])

y = np.array([400, 232, 315, 178, 370, 342, 240, 325])

m = len(y)  # number of training examples
n = X.shape[1]  # number of features

print(f"Training set size: {m} examples")
print(f"Number of features: {n}")
print(f"\nFirst 3 examples:")
for i in range(3):
    print(f"  House {i+1}: {X[i,0]} sq ft, {X[i,1]} bedrooms → ${y[i]}k")
def feature_normalize(X):
    """
    Normalize features to have mean=0, std=1
    Important for gradient descent to converge faster!
    """
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0)
    X_norm = (X - mu) / sigma
    return X_norm, mu, sigma

# Normalize features
X_norm, mu, sigma = feature_normalize(X)

# Add x_0 = 1 (intercept term)
X_with_bias = np.c_[np.ones(m), X_norm]

print("Feature normalization:")
print(f"  Original: size mean = {mu[0]:.0f}, std = {sigma[0]:.0f}")
print(f"  Normalized: mean = {np.mean(X_norm[:,0]):.2f}, std = {np.std(X_norm[:,0]):.2f}")
print(f"\nDesign matrix shape: {X_with_bias.shape}")
print(f"  (m={m} examples, n+1={n+1} features including intercept)")
def compute_cost(X, y, theta):
    """
    Compute cost function J(θ)
    J(θ) = (1/2m) * Σ(h_θ(x^(i)) - y^(i))^2
    """
    m = len(y)
    predictions = X @ theta  # h_θ(x) = θ^T x
    errors = predictions - y
    cost = (1/(2*m)) * np.sum(errors**2)
    return cost

# Test with initial θ = [0, 0, 0]
theta_init = np.zeros(n + 1)
initial_cost = compute_cost(X_with_bias, y, theta_init)
print(f"Initial cost (θ = 0): J(θ) = {initial_cost:.2f}")
print("  (This is just the variance of y since all predictions are 0)")
def gradient_descent(X, y, theta, alpha, num_iters):
    """
    Batch Gradient Descent
    
    Repeat until convergence:
        θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
    """
    m = len(y)
    cost_history = np.zeros(num_iters)
    theta_history = np.zeros((num_iters, len(theta)))
    
    for iteration in range(num_iters):
        # Compute predictions
        predictions = X @ theta
        
        # Compute errors
        errors = predictions - y
        
        # Compute gradient (vectorized!)
        gradient = (1/m) * (X.T @ errors)
        
        # Update parameters
        theta = theta - alpha * gradient
        
        # Save history
        cost_history[iteration] = compute_cost(X, y, theta)
        theta_history[iteration] = theta
        
    return theta, cost_history, theta_history

# Run gradient descent
alpha = 0.01  # learning rate
num_iters = 400
theta_init = np.zeros(n + 1)

theta_final, cost_history, theta_history = gradient_descent(
    X_with_bias, y, theta_init, alpha, num_iters
)

print(f"Gradient Descent Results:")
print(f"  Learning rate α = {alpha}")
print(f"  Iterations: {num_iters}")
print(f"\nLearned parameters:")
print(f"  θ_0 (intercept) = {theta_final[0]:.2f}")
print(f"  θ_1 (size)      = {theta_final[1]:.2f}")
print(f"  θ_2 (bedrooms)  = {theta_final[2]:.2f}")
print(f"\nFinal cost: J(θ) = {cost_history[-1]:.2f}")
print(f"Initial cost: J(θ) = {cost_history[0]:.2f}")
print(f"Improvement: {100*(1 - cost_history[-1]/cost_history[0]):.1f}% reduction")
# Visualize convergence
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Cost vs iterations
ax1.plot(cost_history, linewidth=2)
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Cost J(θ)', fontsize=12)
ax1.set_title('Gradient Descent Convergence', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Parameter evolution
for j in range(len(theta_final)):
    ax2.plot(theta_history[:, j], label=f'θ_{j}', linewidth=2)
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Parameter Value', fontsize=12)
ax2.set_title('Parameter Evolution', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Left: Cost decreases smoothly (convex optimization!)")
print("Right: Parameters converge to optimal values")

Summary

Key Takeaways from Lecture 2

1. Linear Regression Framework:

  • Hypothesis: \(h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j\)

  • Cost function: \(J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\)

  • Goal: Find \(\theta\) that minimizes \(J(\theta)\)

2. Gradient Descent:

  • Iterative optimization algorithm

  • Update rule: \(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\)

  • Batch GD: Uses all training examples per update

  • SGD: Uses one example per update (faster for large datasets)

3. Important Concepts:

  • Feature normalization crucial for convergence

  • Learning rate \(\alpha\) controls step size

  • Linear regression has convex cost → no local minima

  • Vectorization makes implementation efficient

Next Lecture: Normal equations (analytical solution) and more on optimization