Optimization from Scratch: Gradient Descent & AdamΒΆ

Source: The Math Behind Artificial Intelligence β€” Chapter 7

OverviewΒΆ

Optimization theory is why machine learning works β€” it’s the engine that turns data + architecture into a trained model.

This notebook covers:

  1. What is Optimization Theory? β€” the math of β€œfinding the best”

  2. Linear Regression from scratch with gradient descent (book’s step-by-step visualization)

  3. Cost function landscape β€” why the shape matters

  4. The Adam optimizer β€” the math inside PyTorch’s default optimizer

  5. Neural network with Adam using PyTorch (book’s credit approval example)

  6. Optimizer comparison: SGD vs Momentum vs RMSProp vs Adam

1. What is Optimization Theory?ΒΆ

Optimization = finding the input that minimizes (or maximizes) an objective function.

In machine learning:

  • Objective = loss function L(ΞΈ) = measure of how wrong our model is

  • Input = model parameters ΞΈ (weights, biases)

  • Goal = find ΞΈ* = argmin L(ΞΈ)

Why can’t we just solve L’(ΞΈ) = 0 analytically?

For a neural network with 1 billion parameters:

  • The loss surface is a function in 10⁹-dimensional space

  • Closed-form solution requires inverting matrices the size of the internet

  • Instead: follow the gradient downhill, iteratively

\[\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)\]
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation
from IPython.display import HTML

# --- Visualize the loss landscape ---
# Simple 2D case: L(w, b) = mean squared error for linear regression

np.random.seed(42)
X_data = np.linspace(0, 10, 50)
y_data = 3 * X_data + 2 + np.random.normal(0, 2, 50)

def mse_loss(w, b, X, y):
    y_pred = w * X + b
    return np.mean((y - y_pred) ** 2)

# Create loss landscape
w_range = np.linspace(0, 6, 100)
b_range = np.linspace(-2, 8, 100)
W, B = np.meshgrid(w_range, b_range)
Loss = np.array([[mse_loss(w, b, X_data, y_data) for w in w_range] for b in b_range])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Contour plot
contour = axes[0].contourf(W, B, Loss, levels=30, cmap='RdYlGn_r')
plt.colorbar(contour, ax=axes[0])
axes[0].scatter([3], [2], color='red', s=200, zorder=5, marker='*', label='True minimum (w=3, b=2)')
axes[0].set_xlabel('Weight w', fontsize=12)
axes[0].set_ylabel('Bias b', fontsize=12)
axes[0].set_title('MSE Loss Landscape', fontsize=13, fontweight='bold')
axes[0].legend()

# 3D surface
ax3d = fig.add_subplot(1, 2, 2, projection='3d')
ax3d.plot_surface(W, B, Loss, cmap='RdYlGn_r', alpha=0.8)
ax3d.set_xlabel('w')
ax3d.set_ylabel('b')
ax3d.set_zlabel('MSE Loss')
ax3d.set_title('Loss Surface (3D)', fontsize=12)

plt.tight_layout()
plt.show()

print(f"True minimum: w=3, b=2 β†’ MSE={mse_loss(3, 2, X_data, y_data):.4f}")

2. Linear Regression from Scratch – Step by StepΒΆ

Gradient descent on the simplest possible ML modelΒΆ

Linear regression is the ideal starting point for understanding optimization because its loss surface is a convex paraboloid – a perfect bowl with a single global minimum. This means gradient descent is guaranteed to converge, and we can visualize the entire optimization trajectory in 2D (weight \(w\) and bias \(b\)).

The MSE loss and its gradients have clean closed-form expressions:

\[L = \frac{1}{n}\sum_{i=1}^{n}(y_i - (wx_i + b))^2\]
\[\frac{\partial L}{\partial w} = -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - \hat{y}_i), \qquad \frac{\partial L}{\partial b} = -\frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)\]

At each epoch, we compute these gradients over the entire dataset and take a step of size \(\alpha\) (the learning rate) in the negative gradient direction. The gradient \(\partial L / \partial w\) is essentially the correlation between the input \(x\) and the residual error \((y - \hat{y})\): when the error is positively correlated with \(x\), the weight needs to increase. Watching the fitted line evolve from a random initialization to the true relationship provides concrete geometric intuition for what gradient descent does at every step.

# --- Linear Regression from scratch (from book, Chapter 7) ---

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

# Initial parameters (from book)
w = 0.1
b = 0.5
learning_rate = 0.01
n_epochs = 200

# Track training history
history = {'w': [], 'b': [], 'loss': [], 'epoch': []}
snapshot_epochs = [0, 1, 2, 5, 10, 50, 100, 200]
snapshots = []

for epoch in range(n_epochs + 1):
    y_pred = w * X + b
    loss = np.mean((y - y_pred) ** 2)
    
    history['w'].append(w)
    history['b'].append(b)
    history['loss'].append(loss)
    history['epoch'].append(epoch)
    
    if epoch in snapshot_epochs:
        snapshots.append({'epoch': epoch, 'w': w, 'b': b,
                          'y_pred': y_pred.copy(), 'loss': loss})
    
    if epoch < n_epochs:
        # Gradient computation (from book)
        dw = -2 * np.mean(X * (y - y_pred))
        db = -2 * np.mean(y - y_pred)
        
        # Parameter update
        w = w - learning_rate * dw
        b = b - learning_rate * db

print(f"{'Epoch':>8} | {'w':>8} | {'b':>8} | {'Loss':>10}")
print("-" * 45)
for s in snapshots:
    print(f"{s['epoch']:>8} | {s['w']:>8.4f} | {s['b']:>8.4f} | {s['loss']:>10.4f}")
print(f"\nTrue values: w=3.000, b=2.000")
# --- Step-by-step visualization (book's 6-iteration progression) ---

fig, axes = plt.subplots(2, 4, figsize=(18, 8))
axes = axes.flatten()

for i, s in enumerate(snapshots):
    ax = axes[i]
    
    # Data points
    ax.scatter(X, y, color='blue', alpha=0.5, s=25, label='Data')
    
    # Current fitted line
    ax.plot(X, s['y_pred'], 'r-', linewidth=2.5,
            label=f'Fitted: y={s["w"]:.2f}x+{s["b"]:.2f}')
    
    # True line
    ax.plot(X, y_true, 'g--', linewidth=1.5, alpha=0.7, label='True: y=3x+2')
    
    ax.set_title(f'Epoch {s["epoch"]}  |  Loss={s["loss"]:.2f}', fontsize=10, fontweight='bold')
    ax.set_ylim(-5, 35)
    ax.set_xlim(-0.5, 10.5)
    ax.grid(True, alpha=0.3)
    if i == 0:
        ax.legend(fontsize=8, loc='upper left')

plt.suptitle('Linear Regression Training β€” Step-by-Step (Chapter 7)', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
# --- Parameter trajectory on loss landscape ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Trajectory on contour plot
contour = axes[0].contourf(W, B, Loss, levels=30, cmap='RdYlGn_r', alpha=0.7)
plt.colorbar(contour, ax=axes[0])

# Plot the trajectory of w, b during training
w_hist = np.array(history['w'])
b_hist = np.array(history['b'])
axes[0].plot(w_hist, b_hist, 'b-', linewidth=1.5, alpha=0.7)
axes[0].scatter(w_hist[0], b_hist[0], color='blue', s=150, zorder=5, label=f'Start (w={w_hist[0]:.1f}, b={b_hist[0]:.1f})')
axes[0].scatter(w_hist[-1], b_hist[-1], color='red', s=150, zorder=5, label=f'End (w={w_hist[-1]:.2f}, b={b_hist[-1]:.2f})')
axes[0].scatter([3], [2], color='lime', s=200, marker='*', zorder=6, label='True minimum')
axes[0].set_xlabel('Weight w')
axes[0].set_ylabel('Bias b')
axes[0].set_title('Gradient Descent Trajectory')
axes[0].legend(fontsize=9)

# Loss curve
axes[1].semilogy(history['epoch'], history['loss'], 'b-', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE Loss (log scale)')
axes[1].set_title('Loss During Training')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. The Adam Optimizer β€” Math Deep DiveΒΆ

Adam (Adaptive Moment Estimation, Kingma & Ba, 2014) is the most widely used optimizer for deep learning.

Key idea: maintain per-parameter adaptive learning rates using running estimates of gradient moments.

Algorithm at each step t:

\[g_t = \nabla L(\theta_{t-1})\]
\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(1st moment: mean)}\]
\[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(2nd moment: uncentered variance)}\]
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias correction)}\]
\[\hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}\]
\[\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

Default hyperparameters: Ξ±=0.001, β₁=0.9, Ξ²β‚‚=0.999, Ξ΅=10⁻⁸

Intuition:

  • m_t: like momentum β€” smooth gradient direction

  • v_t: like RMSProp β€” scale step by gradient magnitude

  • Result: large steps for small/rare gradients, small steps for large/frequent gradients

# --- Adam from scratch ---

class Adam:
    """Adam optimizer implemented from scratch."""
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None   # 1st moment (mean)
        self.v = None   # 2nd moment (variance)
        self.t = 0      # timestep
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        
        # Update biased moment estimates
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
        
        # Bias-corrected estimates
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        # Parameter update
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)


# Compare optimizers on linear regression
def train_with_optimizer(optimizer_fn, n_epochs=500):
    params = np.array([0.1, 0.5])  # [w, b]
    losses = []
    opt = optimizer_fn()
    
    for _ in range(n_epochs):
        w, b = params
        y_pred = w * X + b
        loss = np.mean((y - y_pred) ** 2)
        losses.append(loss)
        
        dw = -2 * np.mean(X * (y - y_pred))
        db = -2 * np.mean(y - y_pred)
        grads = np.array([dw, db])
        
        params = opt.step(params, grads)
    
    return losses, params


class SGD:
    def __init__(self, lr=0.01): self.lr = lr
    def step(self, p, g): return p - self.lr * g

class SGDMomentum:
    def __init__(self, lr=0.01, beta=0.9):
        self.lr, self.beta, self.v = lr, beta, None
    def step(self, p, g):
        if self.v is None: self.v = np.zeros_like(p)
        self.v = self.beta * self.v + (1 - self.beta) * g
        return p - self.lr * self.v

class RMSProp:
    def __init__(self, lr=0.01, beta=0.999, eps=1e-8):
        self.lr, self.beta, self.eps, self.v = lr, beta, eps, None
    def step(self, p, g):
        if self.v is None: self.v = np.zeros_like(p)
        self.v = self.beta * self.v + (1 - self.beta) * g**2
        return p - self.lr * g / (np.sqrt(self.v) + self.eps)


configs = [
    ('SGD', lambda: SGD(lr=0.01), 'blue'),
    ('Momentum', lambda: SGDMomentum(lr=0.01), 'green'),
    ('RMSProp', lambda: RMSProp(lr=0.01), 'orange'),
    ('Adam', lambda: Adam(lr=0.1), 'red'),
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

print(f"{'Optimizer':>12} | {'Final Loss':>12} | {'w':>8} | {'b':>8}")
print("-" * 50)
for name, opt_fn, color in configs:
    losses, final_params = train_with_optimizer(opt_fn)
    axes[0].semilogy(losses, color=color, linewidth=2, label=name)
    axes[1].semilogy(losses[:50], color=color, linewidth=2, label=name)
    print(f"{name:>12} | {losses[-1]:>12.4f} | {final_params[0]:>8.4f} | {final_params[1]:>8.4f}")

for ax, title in zip(axes, ['Full Training (500 epochs)', 'First 50 epochs (early convergence)']):
    ax.set_xlabel('Epoch')
    ax.set_ylabel('MSE Loss (log scale)')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print(f"\nTrue values: w=3.000, b=2.000")

4. Neural Network with Adam – PyTorch (from book)ΒΆ

Putting the optimizer to work on a real training loopΒΆ

Chapter 7 of the book demonstrates Adam in action by training a binary classifier for credit approval using PyTorch. While our from-scratch Adam implementation above shows how the algorithm works internally, PyTorch’s torch.optim.Adam wraps the same logic into an efficient, GPU-accelerated optimizer that integrates with automatic differentiation.

The training loop follows the standard pattern that every deep learning practitioner should internalize: forward pass (compute predictions and loss), backward pass (loss.backward() computes gradients via autograd), and optimizer step (optimizer.step() applies the Adam update rule to all parameters). The optimizer.zero_grad() call at the start of each iteration is essential because PyTorch accumulates gradients by default – a design choice that supports gradient accumulation across mini-batches but causes incorrect updates if you forget to zero them out. After training, we can inspect Adam’s internal state (exp_avg and exp_avg_sq) to see the running moment estimates it maintains for each parameter tensor.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split

# --- Data setup (from book, Chapter 7) ---
torch.manual_seed(42)
x = torch.randn(10000, 4)        # 10k samples, 4 features
y = torch.randint(0, 2, (10000, 1)).float()  # binary labels

dataset = TensorDataset(x, y)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=32)

# --- Model (from book) ---
class CreditApprovalNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu   = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        return self.sigmoid(self.output(self.relu(self.hidden(x))))


model = CreditApprovalNet()
optimizer = optim.Adam(model.parameters(), lr=0.0001)  # Adam (from book)
loss_fn = nn.BCELoss()

print("Model Architecture (from book, Chapter 7):")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"\nOptimizer: Adam(lr=0.0001, β₁=0.9, Ξ²β‚‚=0.999)")
# --- Training loop ---
n_epochs = 50
train_losses = []
val_losses = []

for epoch in range(n_epochs):
    # Training
    model.train()
    epoch_train_loss = 0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        pred = model(batch_x)
        loss = loss_fn(pred, batch_y)
        loss.backward()
        optimizer.step()
        epoch_train_loss += loss.item()
    
    # Validation
    model.eval()
    epoch_val_loss = 0
    with torch.no_grad():
        for batch_x, batch_y in val_loader:
            pred = model(batch_x)
            epoch_val_loss += loss_fn(pred, batch_y).item()
    
    train_losses.append(epoch_train_loss / len(train_loader))
    val_losses.append(epoch_val_loss / len(val_loader))

# Final accuracy
model.eval()
with torch.no_grad():
    preds = (model(x) > 0.5).float()
    accuracy = (preds == y).float().mean()

# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].plot(train_losses, 'b-', linewidth=2, label='Train loss')
axes[0].plot(val_losses,   'r-', linewidth=2, label='Val loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('BCE Loss')
axes[0].set_title('Credit Approval Net β€” Training with Adam')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Adam internal state: inspect first layer bias gradient moments
state = optimizer.state[model.hidden.bias]
axes[1].bar(['m (1st moment)', 'v (2nd moment)'],
            [state['exp_avg'].abs().mean().item(),
             state['exp_avg_sq'].abs().mean().item()],
            color=['steelblue', 'coral'])
axes[1].set_title('Adam Internal State\n(hidden layer bias)')
axes[1].set_ylabel('Mean absolute value')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nFinal training loss: {train_losses[-1]:.4f}")
print(f"Final val loss:      {val_losses[-1]:.4f}")
print(f"Accuracy:            {accuracy:.2%}")
print(f"\n(Note: ~50% accuracy is expected β€” labels are random, so the model cannot learn meaningfully.")
print(f"The goal is to see the training loop and Adam optimizer in action.)")

5. Why Adam Works – IntuitionΒΆ

Decomposing the optimizer into its mathematical componentsΒΆ

Adam combines two complementary ideas, each addressing a different limitation of vanilla gradient descent:

Component

Role

Analogy

\(m_t\) (1st moment)

Smooth gradient direction via exponential moving average

Ball rolling with momentum – overshoots but escapes flat regions

\(v_t\) (2nd moment)

Per-parameter gradient scale estimate

Automatic learning rate tuning – parameters with large gradients get smaller steps

Bias correction

Warm-up correction for initialization at \(m_0 = 0, v_0 = 0\)

Prevents artificially large steps at \(t=1\) when moments are underestimated

\(\epsilon\)

Numerical stability constant

Prevents division by zero when \(v_t \approx 0\)

The adaptive learning rate is the key insight. For a parameter with consistently large gradients, \(\sqrt{v_t}\) is large, so the effective step \(\alpha / \sqrt{v_t}\) is small – preventing overshooting. For a parameter with sparse or small gradients, \(\sqrt{v_t}\) is small, so the step is large – allowing faster progress. This per-parameter adaptation is why Adam converges well on problems with heterogeneous gradient scales, such as embeddings (sparse gradients) combined with dense layers.

When Adam struggles: empirical evidence shows that SGD with momentum sometimes generalizes better than Adam on certain computer vision tasks (ResNets on ImageNet), possibly because Adam’s sharp adaptivity can converge to sharper minima. AdamW (Loshchilov & Hutter, 2019) fixes a subtle bug where Adam’s adaptive step interferes with L2 regularization by decoupling weight decay from the gradient-based update.

SummaryΒΆ

  • Gradient descent = follow the negative gradient downhill, iteratively: \(\theta_{t+1} = \theta_t - \alpha \nabla L\)

  • Learning rate \(\alpha\) controls step size – too large causes divergence, too small causes slow convergence

  • Adam = momentum (smooth direction via \(m_t\)) + RMSProp (adaptive scale via \(v_t\)) + bias correction

  • Linear regression has a convex loss surface (bowl-shaped) – gradient descent always finds the global minimum

  • Neural networks have non-convex losses with local minima, saddle points, and flat regions – yet Adam navigates them remarkably well in practice

ExercisesΒΆ

  1. Modify the linear regression training to use a learning rate of 0.1 instead of 0.01. What happens to the loss curve? Why?

  2. Implement AdaGrad from scratch (like Adam but without momentum: just the v_t term). Compare it with Adam on linear regression.

  3. In the neural network, change the architecture to have 3 hidden layers with 64 neurons each. Does it train better?

  4. Why does bias correction in Adam matter at small t values? Compute m_hat / m for t=1 with β₁=0.9 to see the correction factor.

  5. Research: What is the difference between Adam and AdamW? When should you prefer AdamW? (Hint: look at how weight decay is applied.)