Lecture 2: Linear Regression and Gradient Descent¶
From Andrew Ng’s CS229 Lecture 2 (Autumn 2018)¶
2.1 Supervised Learning Framework¶
The Process¶
Training Set → Learning Algorithm → Hypothesis \(h\)
“The job of the learning algorithm is to output a function… a hypothesis. The job of the hypothesis is to take as input any size of a house and try to tell you what it thinks should be the price of that house.” - Andrew Ng
Key Design Decision: Representing the Hypothesis¶
Simple linear model: $\(h_\theta(x) = \theta_0 + \theta_1 x\)$
“The mathematicians in the room will say technically this is an affine function. It was a linear function, there’s no \(\theta_0\), technically… but machine learning sometimes just calls this a linear function.” - Andrew Ng
Multiple features: $\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2\)$
Where:
\(x_1\) = size of house
\(x_2\) = number of bedrooms
Compact Notation¶
Summation form: $\(h_\theta(x) = \sum_{j=0}^{n} \theta_j x_j\)$
Convention: Define \(x_0 = 1\) (dummy feature)
Then:
\(\theta = [\theta_0, \theta_1, \theta_2]^T\) (3-dimensional parameter vector)
\(x = [x_0, x_1, x_2]^T = [1, \text{size}, \text{bedrooms}]^T\)
Vector form: $\(h_\theta(x) = \theta^T x\)$
Notation Summary¶
“Design choices of what is the workflow, what is the dataset, what is the hypothesis, how do you represent the hypothesis - these are the key decisions you have to make in pretty much every supervised learning algorithm’s design.” - Andrew Ng
Symbol |
Meaning |
Example |
|---|---|---|
\(m\) |
Number of training examples |
50 houses |
\(n\) |
Number of features |
2 (size, bedrooms) |
\(x\) |
Input features/attributes |
[1, 2104, 3] |
\(y\) |
Output/target variable |
400 (thousands) |
\((x^{(i)}, y^{(i)})\) |
\(i\)-th training example |
\((x^{(1)}, y^{(1)})\) = first house |
\(\theta\) |
Parameters of the model |
\([\theta_0, \theta_1, \theta_2]\) |
\(h_\theta(x)\) |
Hypothesis (prediction) |
Estimated price |
“The superscript parentheses i, that’s not exponentiation… this is just a way of writing an index into the table of training examples.” - Andrew Ng
Example:
\(x_1^{(1)} = 2104\) (size of first house)
\(x_1^{(2)} = 1416\) (size of second house)
2.2 Cost Function: Ordinary Least Squares¶
Goal: Choose Good Parameters¶
“The learning algorithm’s job is to choose parameters \(\theta\) that allows you to make good predictions about prices of houses.” - Andrew Ng
Intuition: Choose \(\theta\) such that \(h_\theta(x)\) is close to \(y\) for training examples
The Cost Function¶
Definition: $\(J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\)$
Breaking it down:
\((h_\theta(x^{(i)}) - y^{(i)})^2\) = squared difference between prediction and truth
\(\sum_{i=1}^{m}\) = sum over all \(m\) training examples
\(\frac{1}{2m}\) = average (and makes calculus cleaner)
“By convention we put a one-half there because when we take derivatives to minimize this later, putting a one-half there would make some of the math a little bit simpler.” - Andrew Ng
Goal: Find \(\theta\) that minimizes \(J(\theta)\)
Why Squared Error?¶
“The questions I’ve often gotten is why squared error? Why not absolute error, or error to the power of 4? We’ll talk more about that when we talk about generalized linear models next week. Using squared error corresponds to a Gaussian.” - Andrew Ng
Intuition (preview):
Assumes errors are normally distributed
Equivalent to maximum likelihood under Gaussian noise
Convex optimization problem (single global minimum)
Differentiable everywhere (unlike absolute error)
2.3 Gradient Descent Algorithm¶
“Let’s see how you can implement an algorithm to find the value of \(\theta\) that minimizes \(J(\theta)\). We’re going to use an algorithm called gradient descent.” - Andrew Ng
Algorithm Overview¶
Initialize:
Start with some \(\theta\) (e.g., \(\theta = \vec{0}\))
Repeat:
Keep changing \(\theta\) to reduce \(J(\theta)\)
Visual Intuition¶
“Imagine that you are standing on this hill… What you do in gradient descent is turn around all 360 degrees and look all around you and see if you were to take a tiny little step, in what direction should you take a little step to go downhill as fast as possible?” - Andrew Ng
The surface:
Horizontal axes: \(\theta_0, \theta_1\) (parameters)
Vertical axis: \(J(\theta)\) (cost)
Goal: Reach lowest elevation (minimum cost)
Process:
Stand at current point
Look all around (360°)
Take small step in steepest downhill direction
Repeat until convergence
Local vs Global Minima¶
“One property of gradient descent is that depending on where you initialize parameters, you can get to different points… If you had started it just a few steps over to the right, you would have gotten to a different local optimum.” - Andrew Ng
For linear regression:
“It turns out that when you run gradient descent on linear regression, there will not be local optimum” - Andrew Ng
Why? The cost function \(J(\theta)\) is convex (bowl-shaped) → single global minimum
The Update Rule¶
Notation:
:=means “gets assigned to” (assignment)=means “equals” (assertion of fact)
“I’m gonna use colon equals to denote assignment. So \(a := a + 1\) means increment the value of \(a\) by 1. Whereas \(a = b\) means I’m asserting that the value of \(a\) is equal to the value of \(b\).” - Andrew Ng
Gradient descent update: $\(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\)$
For all \(j = 0, 1, 2, \ldots, n\) simultaneously
Where:
\(\alpha\) = learning rate (step size)
\(\frac{\partial}{\partial \theta_j} J(\theta)\) = partial derivative (gradient component)
Computing the Gradient¶
For linear regression: $\(\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\)$
Derivation (using chain rule): $\(\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left[ \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \right]\)$
Batch Gradient Descent¶
Complete algorithm:
Repeat until convergence {
For j = 0 to n:
θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
}
“Batch” because each step uses all \(m\) training examples
Vectorized form: $\(\theta := \theta - \frac{\alpha}{m} X^T (X\theta - y)\)$
Where:
\(X\) is \(m \times (n+1)\) design matrix
\(y\) is \(m \times 1\) vector of target values
2.4 Stochastic Gradient Descent (SGD)¶
Motivation: Computational Efficiency¶
Problem with batch GD:
Each update requires scanning entire dataset
If \(m\) = 1 million examples → very slow!
Stochastic GD idea:
Update parameters using one example at a time
Much faster iterations
SGD Algorithm¶
Randomly shuffle training examples
Repeat {
For i = 1 to m:
For j = 0 to n:
θ_j := θ_j - α * (h_θ(x^(i)) - y^(i)) * x_j^(i)
}
Key difference: Update after each example, not after seeing all examples
Batch GD vs SGD¶
Aspect |
Batch GD |
Stochastic GD |
|---|---|---|
Update frequency |
After full pass through data |
After each example |
Gradient |
Exact (average over all data) |
Noisy (from single example) |
Convergence |
Smooth path to minimum |
Oscillates around minimum |
Speed per iteration |
Slow |
Fast |
Large datasets |
Impractical |
Practical |
Parallelization |
Easy |
Hard |
Convergence Behavior¶
Batch GD:
Follows gradient precisely
Smooth descent to minimum
Converges to exact minimum
Stochastic GD:
“Drunk walk” toward minimum
Oscillates due to noise from individual examples
Converges to region near minimum
Never settles exactly (keeps bouncing)
Practical Tips¶
Learning rate:
For SGD, often decrease \(\alpha\) over time: \(\alpha = \frac{c_1}{\text{iteration} + c_2}\)
Allows convergence while maintaining fast initial progress
Mini-batch GD (best of both worlds):
Use \(b\) examples per update (e.g., \(b = 32, 64, 128\))
Balances speed and gradient accuracy
Enables GPU parallelization
Setup: Import Libraries¶
Before implementing gradient descent, we load NumPy for vectorized linear algebra operations and Matplotlib for visualizing convergence behavior. Feature normalization, matrix multiplications, and cost function evaluations all rely on NumPy’s efficient array operations – the same operations that underpin virtually every ML framework. Visualization is equally important: plotting the cost function \(J(\theta)\) across iterations is the standard way to verify that gradient descent is converging properly and to diagnose issues like a learning rate that is too large or too small.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded!")
2.5 Implementation: Batch Gradient Descent¶
What: Building gradient descent from scratch¶
We now translate the mathematical update rule \(\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)\) into working code. The implementation covers data loading, feature normalization, the cost function \(J(\theta)\), and the iterative gradient descent loop.
Why: Understanding the mechanics behind every ML optimizer¶
Every modern deep learning optimizer (Adam, RMSProp, AdaGrad) is a variation of this fundamental algorithm. By implementing vanilla batch gradient descent, you build intuition for learning rates, convergence diagnostics, and the critical role of feature scaling – skills that transfer directly to training neural networks at any scale.
How it works¶
The vectorized gradient \(\nabla_\theta J = \frac{1}{m} X^T(X\theta - y)\) computes the average direction of steepest ascent across all \(m\) training examples in a single matrix operation, and we step in the opposite direction by \(\alpha\) (the learning rate). Feature normalization ensures all dimensions of \(\theta\) converge at similar rates; without it, elongated cost contours cause oscillation along one axis while progressing slowly along another.
# Portland housing data (simplified)
# Features: size (sq ft), bedrooms
# Target: price (thousands of dollars)
X = np.array([
[2104, 3],
[1416, 2],
[1534, 3],
[852, 2],
[1940, 4],
[1752, 3],
[1212, 2],
[1668, 3]
])
y = np.array([400, 232, 315, 178, 370, 342, 240, 325])
m = len(y) # number of training examples
n = X.shape[1] # number of features
print(f"Training set size: {m} examples")
print(f"Number of features: {n}")
print(f"\nFirst 3 examples:")
for i in range(3):
print(f" House {i+1}: {X[i,0]} sq ft, {X[i,1]} bedrooms → ${y[i]}k")
def feature_normalize(X):
"""
Normalize features to have mean=0, std=1
Important for gradient descent to converge faster!
"""
mu = np.mean(X, axis=0)
sigma = np.std(X, axis=0)
X_norm = (X - mu) / sigma
return X_norm, mu, sigma
# Normalize features
X_norm, mu, sigma = feature_normalize(X)
# Add x_0 = 1 (intercept term)
X_with_bias = np.c_[np.ones(m), X_norm]
print("Feature normalization:")
print(f" Original: size mean = {mu[0]:.0f}, std = {sigma[0]:.0f}")
print(f" Normalized: mean = {np.mean(X_norm[:,0]):.2f}, std = {np.std(X_norm[:,0]):.2f}")
print(f"\nDesign matrix shape: {X_with_bias.shape}")
print(f" (m={m} examples, n+1={n+1} features including intercept)")
def compute_cost(X, y, theta):
"""
Compute cost function J(θ)
J(θ) = (1/2m) * Σ(h_θ(x^(i)) - y^(i))^2
"""
m = len(y)
predictions = X @ theta # h_θ(x) = θ^T x
errors = predictions - y
cost = (1/(2*m)) * np.sum(errors**2)
return cost
# Test with initial θ = [0, 0, 0]
theta_init = np.zeros(n + 1)
initial_cost = compute_cost(X_with_bias, y, theta_init)
print(f"Initial cost (θ = 0): J(θ) = {initial_cost:.2f}")
print(" (This is just the variance of y since all predictions are 0)")
def gradient_descent(X, y, theta, alpha, num_iters):
"""
Batch Gradient Descent
Repeat until convergence:
θ_j := θ_j - α * (1/m) * Σ(h_θ(x^(i)) - y^(i)) * x_j^(i)
"""
m = len(y)
cost_history = np.zeros(num_iters)
theta_history = np.zeros((num_iters, len(theta)))
for iteration in range(num_iters):
# Compute predictions
predictions = X @ theta
# Compute errors
errors = predictions - y
# Compute gradient (vectorized!)
gradient = (1/m) * (X.T @ errors)
# Update parameters
theta = theta - alpha * gradient
# Save history
cost_history[iteration] = compute_cost(X, y, theta)
theta_history[iteration] = theta
return theta, cost_history, theta_history
# Run gradient descent
alpha = 0.01 # learning rate
num_iters = 400
theta_init = np.zeros(n + 1)
theta_final, cost_history, theta_history = gradient_descent(
X_with_bias, y, theta_init, alpha, num_iters
)
print(f"Gradient Descent Results:")
print(f" Learning rate α = {alpha}")
print(f" Iterations: {num_iters}")
print(f"\nLearned parameters:")
print(f" θ_0 (intercept) = {theta_final[0]:.2f}")
print(f" θ_1 (size) = {theta_final[1]:.2f}")
print(f" θ_2 (bedrooms) = {theta_final[2]:.2f}")
print(f"\nFinal cost: J(θ) = {cost_history[-1]:.2f}")
print(f"Initial cost: J(θ) = {cost_history[0]:.2f}")
print(f"Improvement: {100*(1 - cost_history[-1]/cost_history[0]):.1f}% reduction")
# Visualize convergence
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Cost vs iterations
ax1.plot(cost_history, linewidth=2)
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Cost J(θ)', fontsize=12)
ax1.set_title('Gradient Descent Convergence', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
# Parameter evolution
for j in range(len(theta_final)):
ax2.plot(theta_history[:, j], label=f'θ_{j}', linewidth=2)
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Parameter Value', fontsize=12)
ax2.set_title('Parameter Evolution', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Left: Cost decreases smoothly (convex optimization!)")
print("Right: Parameters converge to optimal values")
Summary¶
Key Takeaways from Lecture 2¶
1. Linear Regression Framework:
Hypothesis: \(h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j\)
Cost function: \(J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\)
Goal: Find \(\theta\) that minimizes \(J(\theta)\)
2. Gradient Descent:
Iterative optimization algorithm
Update rule: \(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\)
Batch GD: Uses all training examples per update
SGD: Uses one example per update (faster for large datasets)
3. Important Concepts:
Feature normalization crucial for convergence
Learning rate \(\alpha\) controls step size
Linear regression has convex cost → no local minima
Vectorization makes implementation efficient
Next Lecture: Normal equations (analytical solution) and more on optimization