Run this notebook: Open in Colab Open in Kaggle

Lecture 2: Linear Regression and Gradient Descent¶

From Andrew Ng’s CS229 Lecture 2 (Autumn 2018)¶

2.1 Supervised Learning Framework¶

The Process¶

Training Set → Learning Algorithm → Hypothesis $h$

“The job of the learning algorithm is to output a function… a hypothesis. The job of the hypothesis is to take as input any size of a house and try to tell you what it thinks should be the price of that house.” - Andrew Ng

Key Design Decision: Representing the Hypothesis¶

Simple linear model: $$h_\theta(x) = \theta_0 + \theta_1 x$$

“The mathematicians in the room will say technically this is an affine function. It was a linear function, there’s no $\theta_0$, technically… but machine learning sometimes just calls this a linear function.” - Andrew Ng

Multiple features: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

Where:

$x_1$ = size of house
$x_2$ = number of bedrooms

Compact Notation¶

Summation form: $$h_\theta(x) = \sum_{j=0}^{n} \theta_j x_j$$

Convention: Define $x_0 = 1$ (dummy feature)

Then:

$\theta = [\theta_0, \theta_1, \theta_2]^T$ (3-dimensional parameter vector)
$x = [x_0, x_1, x_2]^T = [1, \text{size}, \text{bedrooms}]^T$

Vector form: $$h_\theta(x) = \theta^T x$$

Notation Summary¶

“Design choices of what is the workflow, what is the dataset, what is the hypothesis, how do you represent the hypothesis - these are the key decisions you have to make in pretty much every supervised learning algorithm’s design.” - Andrew Ng

Symbol	Meaning	Example
$m$	Number of training examples	50 houses
$n$	Number of features	2 (size, bedrooms)
$x$	Input features/attributes	[1, 2104, 3]
$y$	Output/target variable	400 (thousands)
$(x^{(i)}, y^{(i)})$	$i$-th training example	$(x^{(1)}, y^{(1)})$ = first house
$\theta$	Parameters of the model	$[\theta_0, \theta_1, \theta_2]$
$h_\theta(x)$	Hypothesis (prediction)	Estimated price

“The superscript parentheses i, that’s not exponentiation… this is just a way of writing an index into the table of training examples.” - Andrew Ng

Example:

$x_1^{(1)} = 2104$ (size of first house)
$x_1^{(2)} = 1416$ (size of second house)

2.2 Cost Function: Ordinary Least Squares¶

Goal: Choose Good Parameters¶

“The learning algorithm’s job is to choose parameters $\theta$ that allows you to make good predictions about prices of houses.” - Andrew Ng

Intuition: Choose $\theta$ such that $h_\theta(x)$ is close to $y$ for training examples

The Cost Function¶

Definition: $$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

Breaking it down:

$(h_\theta(x^{(i)}) - y^{(i)})^2$ = squared difference between prediction and truth
$\sum_{i=1}^{m}$ = sum over all $m$ training examples
$\frac{1}{2m}$ = average (and makes calculus cleaner)

“By convention we put a one-half there because when we take derivatives to minimize this later, putting a one-half there would make some of the math a little bit simpler.” - Andrew Ng

Goal: Find $\theta$ that minimizes $J(\theta)$

Why Squared Error?¶

“The questions I’ve often gotten is why squared error? Why not absolute error, or error to the power of 4? We’ll talk more about that when we talk about generalized linear models next week. Using squared error corresponds to a Gaussian.” - Andrew Ng

Intuition (preview):

Assumes errors are normally distributed
Equivalent to maximum likelihood under Gaussian noise
Convex optimization problem (single global minimum)
Differentiable everywhere (unlike absolute error)

2.3 Gradient Descent Algorithm¶

“Let’s see how you can implement an algorithm to find the value of $\theta$ that minimizes $J(\theta)$. We’re going to use an algorithm called gradient descent.” - Andrew Ng

Algorithm Overview¶

Initialize:

Start with some $\theta$ (e.g., $\theta = \vec{0}$)

Repeat:

Keep changing $\theta$ to reduce $J(\theta)$

Visual Intuition¶

“Imagine that you are standing on this hill… What you do in gradient descent is turn around all 360 degrees and look all around you and see if you were to take a tiny little step, in what direction should you take a little step to go downhill as fast as possible?” - Andrew Ng

The surface:

Horizontal axes: $\theta_0, \theta_1$ (parameters)
Vertical axis: $J(\theta)$ (cost)
Goal: Reach lowest elevation (minimum cost)

Process:

Stand at current point
Look all around (360°)
Take small step in steepest downhill direction
Repeat until convergence

Local vs Global Minima¶

“One property of gradient descent is that depending on where you initialize parameters, you can get to different points… If you had started it just a few steps over to the right, you would have gotten to a different local optimum.” - Andrew Ng

For linear regression:

“It turns out that when you run gradient descent on linear regression, there will not be local optimum” - Andrew Ng

Why? The cost function $J(\theta)$ is convex (bowl-shaped) → single global minimum

The Update Rule¶

Notation:

:= means “gets assigned to” (assignment)
= means “equals” (assertion of fact)

“I’m gonna use colon equals to denote assignment. So $a := a + 1$ means increment the value of $a$ by 1. Whereas $a = b$ means I’m asserting that the value of $a$ is equal to the value of $b$.” - Andrew Ng

Gradient descent update: $$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$

For all $j = 0, 1, 2, \ldots, n$ simultaneously

Where:

$\alpha$ = learning rate (step size)
$\frac{\partial}{\partial \theta_j} J(\theta)$ = partial derivative (gradient component)

Computing the Gradient¶

For linear regression: $$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

Derivation (using chain rule): $$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left[ \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \right]$$

\[= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})\]

\[= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

Batch Gradient Descent¶

Complete algorithm:

Repeat until convergence {
    For j = 0 to n:
        θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
}

“Batch” because each step uses all $m$ training examples

Vectorized form: $$\theta := \theta - \frac{\alpha}{m} X^T (X\theta - y)$$

Where:

$X$ is $m \times (n+1)$ design matrix
$y$ is $m \times 1$ vector of target values

2.4 Stochastic Gradient Descent (SGD)¶

Motivation: Computational Efficiency¶

Problem with batch GD:

Each update requires scanning entire dataset
If $m$ = 1 million examples → very slow!

Stochastic GD idea:

Update parameters using one example at a time
Much faster iterations

SGD Algorithm¶

Randomly shuffle training examples

Repeat {
    For i = 1 to m:
        For j = 0 to n:
            θ_j := θ_j - α * (h_θ(x^(i)) - y^(i)) * x_j^(i)
}

Key difference: Update after each example, not after seeing all examples

Batch GD vs SGD¶

Aspect	Batch GD	Stochastic GD
Update frequency	After full pass through data	After each example
Gradient	Exact (average over all data)	Noisy (from single example)
Convergence	Smooth path to minimum	Oscillates around minimum
Speed per iteration	Slow	Fast
Large datasets	Impractical	Practical
Parallelization	Easy	Hard

Convergence Behavior¶

Batch GD:

Follows gradient precisely
Smooth descent to minimum
Converges to exact minimum

Stochastic GD:

“Drunk walk” toward minimum
Oscillates due to noise from individual examples
Converges to region near minimum
Never settles exactly (keeps bouncing)

Practical Tips¶

Learning rate:

For SGD, often decrease $\alpha$ over time: $\alpha = \frac{c_1}{\text{iteration} + c_2}$
Allows convergence while maintaining fast initial progress

Mini-batch GD (best of both worlds):

Use $b$ examples per update (e.g., $b = 32, 64, 128$)
Balances speed and gradient accuracy
Enables GPU parallelization

Setup: Import Libraries¶

Before implementing gradient descent, we load NumPy for vectorized linear algebra operations and Matplotlib for visualizing convergence behavior. Feature normalization, matrix multiplications, and cost function evaluations all rely on NumPy’s efficient array operations – the same operations that underpin virtually every ML framework. Visualization is equally important: plotting the cost function $J(\theta)$ across iterations is the standard way to verify that gradient descent is converging properly and to diagnose issues like a learning rate that is too large or too small.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

2.5 Implementation: Batch Gradient Descent¶

What: Building gradient descent from scratch¶

We now translate the mathematical update rule $\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)$ into working code. The implementation covers data loading, feature normalization, the cost function $J(\theta)$, and the iterative gradient descent loop.

Why: Understanding the mechanics behind every ML optimizer¶

Every modern deep learning optimizer (Adam, RMSProp, AdaGrad) is a variation of this fundamental algorithm. By implementing vanilla batch gradient descent, you build intuition for learning rates, convergence diagnostics, and the critical role of feature scaling – skills that transfer directly to training neural networks at any scale.

How it works¶

The vectorized gradient $\nabla_\theta J = \frac{1}{m} X^T(X\theta - y)$ computes the average direction of steepest ascent across all $m$ training examples in a single matrix operation, and we step in the opposite direction by $\alpha$ (the learning rate). Feature normalization ensures all dimensions of $\theta$ converge at similar rates; without it, elongated cost contours cause oscillation along one axis while progressing slowly along another.

# Portland housing data (simplified)
# Features: size (sq ft), bedrooms
# Target: price (thousands of dollars)

X = np.array([
    [2104, 3],
    [1416, 2],
    [1534, 3],
    [852, 2],
    [1940, 4],
    [1752, 3],
    [1212, 2],
    [1668, 3]
])

y = np.array([400, 232, 315, 178, 370, 342, 240, 325])

m = len(y)  # number of training examples
n = X.shape[1]  # number of features

print(f"Training set size: {m} examples")
print(f"Number of features: {n}")
print(f"\nFirst 3 examples:")
for i in range(3):
    print(f"  House {i+1}: {X[i,0]} sq ft, {X[i,1]} bedrooms → ${y[i]}k")

def feature_normalize(X):
    """
    Normalize features to have mean=0, std=1
    Important for gradient descent to converge faster!
    """
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0)
    X_norm = (X - mu) / sigma
    return X_norm, mu, sigma

# Normalize features
X_norm, mu, sigma = feature_normalize(X)

# Add x_0 = 1 (intercept term)
X_with_bias = np.c_[np.ones(m), X_norm]

print("Feature normalization:")
print(f"  Original: size mean = {mu[0]:.0f}, std = {sigma[0]:.0f}")
print(f"  Normalized: mean = {np.mean(X_norm[:,0]):.2f}, std = {np.std(X_norm[:,0]):.2f}")
print(f"\nDesign matrix shape: {X_with_bias.shape}")
print(f"  (m={m} examples, n+1={n+1} features including intercept)")

def compute_cost(X, y, theta):
    """
    Compute cost function J(θ)
    J(θ) = (1/2m) * Σ(h_θ(x^(i)) - y^(i))^2
    """
    m = len(y)
    predictions = X @ theta  # h_θ(x) = θ^T x
    errors = predictions - y
    cost = (1/(2*m)) * np.sum(errors**2)
    return cost

# Test with initial θ = [0, 0, 0]
theta_init = np.zeros(n + 1)
initial_cost = compute_cost(X_with_bias, y, theta_init)
print(f"Initial cost (θ = 0): J(θ) = {initial_cost:.2f}")
print("  (This is just the variance of y since all predictions are 0)")

def gradient_descent(X, y, theta, alpha, num_iters):
    """
    Batch Gradient Descent
    
    Repeat until convergence:
        θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
    """
    m = len(y)
    cost_history = np.zeros(num_iters)
    theta_history = np.zeros((num_iters, len(theta)))
    
    for iteration in range(num_iters):
        # Compute predictions
        predictions = X @ theta
        
        # Compute errors
        errors = predictions - y
        
        # Compute gradient (vectorized!)
        gradient = (1/m) * (X.T @ errors)
        
        # Update parameters
        theta = theta - alpha * gradient
        
        # Save history
        cost_history[iteration] = compute_cost(X, y, theta)
        theta_history[iteration] = theta
        
    return theta, cost_history, theta_history

# Run gradient descent
alpha = 0.01  # learning rate
num_iters = 400
theta_init = np.zeros(n + 1)

theta_final, cost_history, theta_history = gradient_descent(
    X_with_bias, y, theta_init, alpha, num_iters
)

print(f"Gradient Descent Results:")
print(f"  Learning rate α = {alpha}")
print(f"  Iterations: {num_iters}")
print(f"\nLearned parameters:")
print(f"  θ_0 (intercept) = {theta_final[0]:.2f}")
print(f"  θ_1 (size)      = {theta_final[1]:.2f}")
print(f"  θ_2 (bedrooms)  = {theta_final[2]:.2f}")
print(f"\nFinal cost: J(θ) = {cost_history[-1]:.2f}")
print(f"Initial cost: J(θ) = {cost_history[0]:.2f}")
print(f"Improvement: {100*(1 - cost_history[-1]/cost_history[0]):.1f}% reduction")

# Visualize convergence
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Cost vs iterations
ax1.plot(cost_history, linewidth=2)
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Cost J(θ)', fontsize=12)
ax1.set_title('Gradient Descent Convergence', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Parameter evolution
for j in range(len(theta_final)):
    ax2.plot(theta_history[:, j], label=f'θ_{j}', linewidth=2)
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Parameter Value', fontsize=12)
ax2.set_title('Parameter Evolution', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Left: Cost decreases smoothly (convex optimization!)")
print("Right: Parameters converge to optimal values")

Summary¶

Key Takeaways from Lecture 2¶

1. Linear Regression Framework:

Hypothesis: $h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j$
Cost function: $J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
Goal: Find $\theta$ that minimizes $J(\theta)$

2. Gradient Descent:

Iterative optimization algorithm
Update rule: $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$
Batch GD: Uses all training examples per update
SGD: Uses one example per update (faster for large datasets)

3. Important Concepts:

Feature normalization crucial for convergence
Learning rate $\alpha$ controls step size
Linear regression has convex cost → no local minima
Vectorization makes implementation efficient

Next Lecture: Normal equations (analytical solution) and more on optimization

Symbol	Meaning	Example
\(m\)	Number of training examples	50 houses
\(n\)	Number of features	2 (size, bedrooms)
\(x\)	Input features/attributes	[1, 2104, 3]
\(y\)	Output/target variable	400 (thousands)
\((x^{(i)}, y^{(i)})\)	\(i\)-th training example	\((x^{(1)}, y^{(1)})\) = first house
\(\theta\)	Parameters of the model	\([\theta_0, \theta_1, \theta_2]\)
\(h_\theta(x)\)	Hypothesis (prediction)	Estimated price