# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from scipy.optimize import minimize, minimize_scalar
from matplotlib import animation
from IPython.display import HTML
sns.set_style('whitegrid')
np.random.seed(42)
np.set_printoptions(precision=4, suppress=True)
7.1 Gradient Descent¶
Algorithm: Gradient Descent¶
Iterative update:
where:
α: learning rate (step size)
∇f(x_t): gradient at current point
Intuition: Move in the direction of steepest descent!
Variants:¶
Batch GD: Use all data
Stochastic GD (SGD): Use one sample at a time
Mini-batch GD: Use small batches
Momentum: Accumulate velocity
Adam: Adaptive learning rates
Advanced Optimization Theory for Deep Learning¶
1. Convergence Analysis of Gradient Descent¶
Smooth Convex Functions¶
For \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) that is \(L\)-smooth (Lipschitz continuous gradient):
Theorem (GD Convergence): With step size \(\alpha \leq \frac{1}{L}\):
Rate: \(O(1/t)\) (sublinear)
Strongly Convex Functions¶
If \(f\) is additionally \(\mu\)-strongly convex:
Condition number: \(\kappa = L/\mu\)
Theorem (Strong Convexity): With \(\alpha = \frac{1}{L}\):
Rate: \(O((1-\mu/L)^t)\) (linear/exponential)
Interpretation: Convergence speed depends on \(\kappa\):
\(\kappa \approx 1\): Fast (well-conditioned)
\(\kappa \gg 1\): Slow (ill-conditioned, elongated valley)
2. Stochastic Gradient Descent (SGD)¶
Replace full gradient with noisy estimate from mini-batch:
where \(\mathcal{B}_t\) is random mini-batch, \(\mathbb{E}[\nabla f_{\mathcal{B}_t}] = \nabla f\).
SGD Convergence (Convex Case)¶
Theorem: With decreasing step size \(\alpha_t = \frac{C}{\sqrt{t}}\):
where \(\bar{\mathbf{x}}_T = \frac{1}{T}\sum_{t=1}^T \mathbf{x}_t\) (averaged iterates).
Trade-off:
Constant step size: Faster initially, but oscillates near optimum
Decreasing step size: Converges to optimum, but slower
Variance Reduction¶
Problem: SGD has high variance \(\text{Var}(\nabla f_{\mathcal{B}}) = \sigma^2/|\mathcal{B}|\)
Solutions:
SVRG (Stochastic Variance Reduced Gradient): Periodically compute full gradient
SAGA: Maintain table of per-sample gradients
Large batch training: Use batch size \(\propto \sqrt{T}\) for linear speedup
3. Momentum Methods¶
Classical Momentum (Polyak, 1964)¶
Physical analogy: Ball rolling down hill with friction coefficient \((1-\beta)\).
Effective step: \(\mathbf{x}_{t+1} - \mathbf{x}_t = -\alpha \sum_{i=0}^t \beta^i \nabla f(\mathbf{x}_{t-i})\)
Benefits:
Accelerates in consistent directions
Dampens oscillations in high-curvature directions
Escapes shallow local minima
Optimal \(\beta\): For quadratic \(f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x}\):
Nesterov Accelerated Gradient (NAG)¶
“Look-ahead” gradient: evaluate at momentum-predicted point:
Convergence rate (strongly convex): \(O\left((1 - \frac{1}{\sqrt{\kappa}})^t\right)\) vs. GD’s \(O\left((1-\frac{1}{\kappa})^t\right)\)
Improvement: \(\sqrt{\kappa}\) vs. \(\kappa\) in exponent!
4. Adaptive Learning Rates¶
AdaGrad (Duchi et al., 2011)¶
Per-parameter learning rates based on historical gradients:
where \(\mathbf{G}_t = \sum_{\tau=1}^t \nabla f(\mathbf{x}_\tau)^2\) (element-wise squares).
Effect: Smaller updates for frequently updated parameters, larger for sparse.
Problem: \(\mathbf{G}_t\) grows monotonically → learning rate decays to zero.
RMSprop (Hinton, 2012)¶
Exponential moving average of squared gradients:
Fix: Decaying average prevents indefinite shrinking.
Typical: \(\gamma = 0.9\), \(\epsilon = 10^{-8}\)
5. Adam: Adaptive Moment Estimation (Kingma & Ba, 2015)¶
Combines momentum and adaptive learning rates:
Defaults: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\alpha = 0.001\), \(\epsilon = 10^{-8}\)
Bias correction rationale: At \(t=0\), \(\mathbf{m}_0 = \mathbf{v}_0 = 0\). Without correction:
\(\mathbf{m}_1 = (1-\beta_1) \mathbf{g}_1 \approx 0.1 \mathbf{g}_1\) (underestimate if \(\mathbf{g}_1\) is signal)
Correction: \(\hat{\mathbf{m}}_1 = \frac{0.1 \mathbf{g}_1}{1 - 0.9} = \mathbf{g}_1\) ✓
Effective learning rate per parameter:
Typically \(\alpha_{\text{eff},i} \in [10^{-5}, 10^{-2}]\) even with fixed \(\alpha = 10^{-3}\).
Adam Variants¶
Variant |
Modification |
Benefit |
|---|---|---|
AdamW |
Decoupled weight decay: \(\mathbf{x}_t \leftarrow (1-\lambda)\mathbf{x}_t - \alpha \hat{\mathbf{m}}_t/\sqrt{\hat{\mathbf{v}}_t}\) |
Better regularization |
AMSGrad |
\(\hat{\mathbf{v}}_t = \max(\hat{\mathbf{v}}_{t-1}, \mathbf{v}_t)\) (non-decreasing) |
Convergence guarantee |
RAdam |
Rectified bias correction with warmup |
Stable early training |
AdaBelief |
Adapt to gradient prediction error \((\mathbf{g}_t - \mathbf{m}_t)^2\) |
Better generalization |
6. AdamW: Weight Decay Done Right¶
Problem with Adam + L2 regularization:
L2 loss: \(\mathcal{L}(\mathbf{x}) = f(\mathbf{x}) + \frac{\lambda}{2}\|\mathbf{x}\|^2\)
Gradient: \(\nabla \mathcal{L} = \nabla f(\mathbf{x}) + \lambda \mathbf{x}\)
Adam applies adaptive learning rate to \(\lambda \mathbf{x}\) → incorrect scaling.
Solution (Loshchilov & Hutter, 2019): Decouple weight decay:
Effect: Weight decay applied with constant rate \(\alpha \lambda\), independent of adaptive scaling.
Empirical impact:
Transformers: AdamW with \(\lambda = 0.01\) outperforms Adam + L2
Better generalization on vision tasks
7. Learning Rate Schedules¶
Step Decay¶
Typical: \(\gamma = 0.1\), decay every \(T\) epochs.
Exponential Decay¶
Cosine Annealing (Loshchilov & Hutter, 2017)¶
Benefit: Smooth decay, periodically increases (with restarts).
Cosine Annealing with Warm Restarts (SGDR):
Restart schedule at \(T_i = T_0 \cdot T_{\text{mult}}^i\).
Intuition: Escape local minima by restarting with higher learning rate.
Warmup (Goyal et al., 2017)¶
Linear increase from 0 to \(\alpha_{\max}\) over first \(T_w\) steps:
Why needed: Large initial learning rate + Adam can cause instability.
Transformer standard: Warmup for 4000-10000 steps.
8. Second-Order Methods¶
Newton’s Method¶
Use second-order Taylor expansion:
Minimize w.r.t. \(\Delta \mathbf{x}\):
Update: \(\mathbf{x}_{t+1} = \mathbf{x}_t - \mathbf{H}_t^{-1} \nabla f(\mathbf{x}_t)\)
Pros:
Quadratic convergence: \(\|\mathbf{x}_{t+1} - \mathbf{x}^*\| \leq C \|\mathbf{x}_t - \mathbf{x}^*\|^2\)
Scale-invariant (automatic preconditioning)
Cons:
Compute Hessian \(\mathbf{H} \in \mathbb{R}^{d \times d}\): \(O(d^2)\) memory
Invert Hessian: \(O(d^3)\) time
Hessian may not be positive definite
Quasi-Newton: L-BFGS¶
Approximate \(\mathbf{H}^{-1}\) using gradient history (last \(m\) steps):
where \(\mathbf{B}_t \approx \mathbf{H}_t^{-1}\) is updated via BFGS formula.
Memory: \(O(md)\) instead of \(O(d^2)\)
Typical: \(m = 10-20\)
Use case: Small to medium \(d\) (\(d \lesssim 10^6\)), full-batch optimization.
Problem for SGD: Stochastic curvature estimates are too noisy.
Natural Gradient (Amari, 1998)¶
Use Fisher information matrix \(\mathbf{F}\) instead of Hessian:
Update: \(\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \mathbf{F}^{-1} \nabla f(\mathbf{x}_t)\)
Interpretation: Gradient in parameter space with Riemannian metric.
Advantage: Invariant to reparameterization of model.
Approximations: K-FAC (Kronecker-factored approximate curvature) for neural networks.
9. Gradient Clipping and Scaling¶
Gradient Clipping (Pascanu et al., 2013)¶
Prevent exploding gradients in RNNs:
Effect: Cap gradient norm at threshold \(\theta\) (typical: \(\theta = 1\) or \(5\)).
Gradient Scaling¶
For mixed-precision training (FP16):
Compute gradients, then unscale before update.
Reason: FP16 underflows for small gradients → scale loss by \(s = 2^{10}-2^{15}\).
10. Optimization Landscape of Neural Networks¶
Loss Surface Geometry¶
High-dimensional phenomena:
Saddle points dominate: Most critical points are saddles, not local minima
Mode connectivity: Different minima connected by low-loss paths
Wide vs. sharp minima: Wide minima generalize better (flatter curvature)
Theorem (Dauphin et al., 2014): For random high-dimensional functions, probability of local minimum decreases exponentially with dimension.
Lottery Ticket Hypothesis (Frankle & Carbin, 2019)¶
Claim: Dense networks contain sparse subnetworks that can train to similar accuracy.
Finding: Initialization matters more than commonly thought.
11. Practical Recommendations¶
Task |
Optimizer |
Learning Rate |
Schedule |
|---|---|---|---|
Computer Vision |
AdamW |
\(10^{-4}\) to \(10^{-3}\) |
Cosine annealing |
NLP (Transformers) |
AdamW |
\(10^{-4}\) |
Warmup + linear decay |
Reinforcement Learning |
Adam |
\(3 \times 10^{-4}\) |
Constant or linear decay |
Small datasets |
SGD + momentum |
\(0.01\) to \(0.1\) |
Step decay (÷10 at epochs 60, 80) |
Large-scale vision |
LAMB/LARS |
Scaled with batch size |
Warmup + polynomial decay |
Hyperparameter tuning priority:
Learning rate (most important!)
Batch size
Optimizer choice
Weight decay
\(\beta\) coefficients (Adam)
Debugging tips:
Loss exploding → reduce learning rate or use gradient clipping
Slow convergence → increase learning rate or use adaptive optimizer
Overfitting → add weight decay, dropout, or data augmentation
Training unstable → add warmup, reduce batch size, use gradient accumulation
# Advanced Optimizer Implementations and Comparisons
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, Tuple, Dict, List
import seaborn as sns
# ============================================================
# 1. Optimizer Implementations
# ============================================================
class Optimizer:
"""Base optimizer class"""
def __init__(self, params: np.ndarray, lr: float):
self.params = params.copy()
self.lr = lr
self.history = [params.copy()]
def step(self, grad: np.ndarray):
raise NotImplementedError
def get_history(self):
return np.array(self.history)
class SGD(Optimizer):
"""Stochastic Gradient Descent"""
def step(self, grad: np.ndarray):
self.params -= self.lr * grad
self.history.append(self.params.copy())
class MomentumSGD(Optimizer):
"""SGD with Momentum"""
def __init__(self, params: np.ndarray, lr: float, beta: float = 0.9):
super().__init__(params, lr)
self.beta = beta
self.v = np.zeros_like(params)
def step(self, grad: np.ndarray):
self.v = self.beta * self.v + grad
self.params -= self.lr * self.v
self.history.append(self.params.copy())
class NAG(Optimizer):
"""Nesterov Accelerated Gradient"""
def __init__(self, params: np.ndarray, lr: float, beta: float = 0.9):
super().__init__(params, lr)
self.beta = beta
self.v = np.zeros_like(params)
def step(self, grad: np.ndarray):
v_prev = self.v.copy()
self.v = self.beta * self.v + grad
self.params -= self.lr * (self.beta * self.v + grad)
self.history.append(self.params.copy())
class AdaGrad(Optimizer):
"""AdaGrad: Adaptive Gradient"""
def __init__(self, params: np.ndarray, lr: float, eps: float = 1e-8):
super().__init__(params, lr)
self.eps = eps
self.G = np.zeros_like(params)
def step(self, grad: np.ndarray):
self.G += grad ** 2
self.params -= self.lr * grad / (np.sqrt(self.G) + self.eps)
self.history.append(self.params.copy())
class RMSprop(Optimizer):
"""RMSprop: Root Mean Square Propagation"""
def __init__(self, params: np.ndarray, lr: float, gamma: float = 0.9, eps: float = 1e-8):
super().__init__(params, lr)
self.gamma = gamma
self.eps = eps
self.v = np.zeros_like(params)
def step(self, grad: np.ndarray):
self.v = self.gamma * self.v + (1 - self.gamma) * grad ** 2
self.params -= self.lr * grad / (np.sqrt(self.v) + self.eps)
self.history.append(self.params.copy())
class Adam(Optimizer):
"""Adam: Adaptive Moment Estimation"""
def __init__(self, params: np.ndarray, lr: float, beta1: float = 0.9,
beta2: float = 0.999, eps: float = 1e-8):
super().__init__(params, lr)
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t = 0
def step(self, grad: np.ndarray):
self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
# Bias correction
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
self.params -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
self.history.append(self.params.copy())
class AdamW(Optimizer):
"""AdamW: Adam with decoupled Weight Decay"""
def __init__(self, params: np.ndarray, lr: float, beta1: float = 0.9,
beta2: float = 0.999, eps: float = 1e-8, weight_decay: float = 0.01):
super().__init__(params, lr)
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t = 0
def step(self, grad: np.ndarray):
self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
# AdamW: Decoupled weight decay
self.params -= self.lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * self.params)
self.history.append(self.params.copy())
# ============================================================
# 2. Test Functions
# ============================================================
def rosenbrock(x: np.ndarray) -> float:
"""Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²"""
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x: np.ndarray) -> np.ndarray:
"""Gradient of Rosenbrock"""
dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
dy = 200 * (x[1] - x[0]**2)
return np.array([dx, dy])
def beale(x: np.ndarray) -> float:
"""Beale function: highly non-convex"""
return ((1.5 - x[0] + x[0]*x[1])**2 +
(2.25 - x[0] + x[0]*x[1]**2)**2 +
(2.625 - x[0] + x[0]*x[1]**3)**2)
def beale_grad(x: np.ndarray) -> np.ndarray:
"""Gradient of Beale (numerical approximation)"""
eps = 1e-8
grad = np.zeros(2)
for i in range(2):
x_plus = x.copy()
x_plus[i] += eps
grad[i] = (beale(x_plus) - beale(x)) / eps
return grad
# ============================================================
# 3. Optimizer Comparison
# ============================================================
def run_optimizer(opt_class, params_init: np.ndarray, grad_fn: Callable,
n_steps: int, **kwargs) -> np.ndarray:
"""Run optimizer for n_steps"""
opt = opt_class(params_init, **kwargs)
for _ in range(n_steps):
grad = grad_fn(opt.params)
opt.step(grad)
return opt.get_history()
# Test parameters
x0 = np.array([-1.0, 1.0])
n_steps = 100
# Run all optimizers
optimizers = {
'SGD': (SGD, {'lr': 0.001}),
'Momentum': (MomentumSGD, {'lr': 0.001, 'beta': 0.9}),
'NAG': (NAG, {'lr': 0.001, 'beta': 0.9}),
'AdaGrad': (AdaGrad, {'lr': 0.1}),
'RMSprop': (RMSprop, {'lr': 0.01, 'gamma': 0.9}),
'Adam': (Adam, {'lr': 0.01, 'beta1': 0.9, 'beta2': 0.999}),
'AdamW': (AdamW, {'lr': 0.01, 'weight_decay': 0.01}),
}
trajectories = {}
for name, (opt_class, kwargs) in optimizers.items():
trajectories[name] = run_optimizer(opt_class, x0, rosenbrock_grad, n_steps, **kwargs)
print("="*70)
print("OPTIMIZER COMPARISON ON ROSENBROCK FUNCTION")
print("="*70)
print(f"Starting point: x₀ = {x0}")
print(f"True minimum: x* = [1, 1], f(x*) = 0")
print(f"\nAfter {n_steps} steps:")
print(f"{'Optimizer':<12} {'Final x':<20} {'Final f(x)':<15} {'Distance to x*'}")
print("-"*70)
for name, traj in trajectories.items():
final = traj[-1]
f_final = rosenbrock(final)
dist = np.linalg.norm(final - np.array([1, 1]))
print(f"{name:<12} [{final[0]:6.3f}, {final[1]:6.3f}] {f_final:12.6f} {dist:8.5f}")
# ============================================================
# 4. Visualization
# ============================================================
fig = plt.figure(figsize=(18, 12))
# Create contour plot
x_range = np.linspace(-1.5, 1.5, 300)
y_range = np.linspace(-0.5, 1.5, 300)
X, Y = np.meshgrid(x_range, y_range)
Z = np.array([[rosenbrock(np.array([x, y])) for x in x_range] for y in y_range])
# Plot 1: All optimizers
ax1 = plt.subplot(2, 3, 1)
levels = np.logspace(-1, 3, 20)
cs = ax1.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
ax1.plot(1, 1, 'r*', markersize=20, label='Optimum')
colors = plt.cm.tab10(np.linspace(0, 1, len(trajectories)))
for (name, traj), color in zip(trajectories.items(), colors):
ax1.plot(traj[:, 0], traj[:, 1], '-', color=color, linewidth=1.5,
label=name, alpha=0.7)
ax1.plot(traj[0, 0], traj[0, 1], 'o', color=color, markersize=8)
ax1.set_xlim(-1.5, 1.5)
ax1.set_ylim(-0.5, 1.5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Optimizer Trajectories on Rosenbrock')
ax1.legend(fontsize=8, loc='upper left')
ax1.grid(True, alpha=0.3)
# Plot 2-4: Individual optimizer comparisons
for idx, (group_name, opt_names) in enumerate([
('First-order', ['SGD', 'Momentum', 'NAG']),
('Adaptive', ['AdaGrad', 'RMSprop']),
('Adam variants', ['Adam', 'AdamW'])
], 2):
ax = plt.subplot(2, 3, idx)
ax.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
ax.plot(1, 1, 'r*', markersize=15)
for name in opt_names:
if name in trajectories:
traj = trajectories[name]
ax.plot(traj[:, 0], traj[:, 1], '-o', linewidth=2, markersize=3,
label=name, alpha=0.8)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title(f'{group_name} Optimizers')
ax.legend()
ax.grid(True, alpha=0.3)
# Plot 5: Convergence curves
ax5 = plt.subplot(2, 3, 5)
for name, traj in trajectories.items():
f_values = np.array([rosenbrock(x) for x in traj])
ax5.semilogy(f_values, linewidth=2, label=name, alpha=0.8)
ax5.set_xlabel('Iteration')
ax5.set_ylabel('f(x) (log scale)')
ax5.set_title('Convergence Comparison')
ax5.legend(fontsize=8)
ax5.grid(True, alpha=0.3)
# Plot 6: Distance to optimum
ax6 = plt.subplot(2, 3, 6)
for name, traj in trajectories.items():
distances = np.array([np.linalg.norm(x - np.array([1, 1])) for x in traj])
ax6.semilogy(distances, linewidth=2, label=name, alpha=0.8)
ax6.set_xlabel('Iteration')
ax6.set_ylabel('||x - x*|| (log scale)')
ax6.set_title('Distance to Optimum')
ax6.legend(fontsize=8)
ax6.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ============================================================
# 5. Learning Rate Sensitivity Analysis
# ============================================================
print("\n" + "="*70)
print("LEARNING RATE SENSITIVITY (Adam)")
print("="*70)
learning_rates = [0.001, 0.01, 0.1, 1.0]
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for idx, lr in enumerate(learning_rates):
traj = run_optimizer(Adam, x0, rosenbrock_grad, n_steps=50, lr=lr)
ax = axes[idx]
ax.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
ax.plot(1, 1, 'r*', markersize=15)
ax.plot(traj[:, 0], traj[:, 1], 'b-o', linewidth=2, markersize=3, alpha=0.8)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_title(f'Adam (lr={lr})')
ax.set_xlabel('x')
if idx == 0:
ax.set_ylabel('y')
final = traj[-1]
dist = np.linalg.norm(final - np.array([1, 1]))
print(f"lr = {lr:5.3f}: Final distance = {dist:8.5f}")
plt.tight_layout()
plt.show()
print("\n" + "="*70)
print("KEY OBSERVATIONS")
print("="*70)
print("1. Momentum methods accelerate in consistent directions")
print("2. NAG has better convergence than classical momentum")
print("3. Adaptive methods (Adam, RMSprop) auto-tune learning rates")
print("4. AdaGrad learning rate decays too aggressively")
print("5. Adam balances momentum + adaptive learning rates")
print("6. AdamW improves regularization via decoupled weight decay")
print("="*70)
# Gradient descent on 1D function
def f_1d(x):
"""Objective function: f(x) = (x-2)² + 1"""
return (x - 2)**2 + 1
def grad_f_1d(x):
"""Gradient: f'(x) = 2(x-2)"""
return 2 * (x - 2)
def gradient_descent_1d(x0, learning_rate=0.1, n_iterations=20):
"""
Perform gradient descent.
Returns:
history: List of (x, f(x)) tuples at each iteration
"""
x = x0
history = [(x, f_1d(x))]
for _ in range(n_iterations):
grad = grad_f_1d(x)
x = x - learning_rate * grad
history.append((x, f_1d(x)))
return history
# Run gradient descent
x0 = 5.0
history = gradient_descent_1d(x0, learning_rate=0.1, n_iterations=20)
# Extract trajectory
x_vals = np.array([h[0] for h in history])
f_vals = np.array([h[1] for h in history])
print("Gradient Descent (1D)")
print(f"\nObjective: f(x) = (x-2)² + 1")
print(f"Starting point: x₀ = {x0}")
print(f"Learning rate: α = 0.1")
print(f"\nIteration x f(x)")
print("-" * 35)
for i, (x, fx) in enumerate(history[:10]):
print(f"{i:5d} {x:8.4f} {fx:8.4f}")
print("...")
print(f"{len(history)-1:5d} {x_vals[-1]:8.4f} {f_vals[-1]:8.4f}")
print(f"\nTrue minimum: x* = 2, f(x*) = 1")
print(f"Found: x ≈ {x_vals[-1]:.4f}, f(x) ≈ {f_vals[-1]:.4f}")
# Visualize
x_plot = np.linspace(-1, 6, 300)
f_plot = f_1d(x_plot)
plt.figure(figsize=(12, 5))
# Function and trajectory
plt.subplot(1, 2, 1)
plt.plot(x_plot, f_plot, 'b-', linewidth=2, label='f(x)')
plt.plot(x_vals, f_vals, 'ro-', markersize=6, linewidth=1.5, label='GD trajectory')
plt.plot(2, 1, 'g*', markersize=20, label='True minimum')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Trajectory')
plt.legend()
plt.grid(True, alpha=0.3)
# Convergence
plt.subplot(1, 2, 2)
plt.semilogy(f_vals - 1, 'r.-', linewidth=2, markersize=8)
plt.xlabel('Iteration')
plt.ylabel('f(x) - f(x*)')
plt.title('Convergence (log scale)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Gradient descent on 2D function
def f_2d(x):
"""Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²"""
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def grad_f_2d(x):
"""Gradient of Rosenbrock"""
dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
dy = 200 * (x[1] - x[0]**2)
return np.array([dx, dy])
def gradient_descent_2d(x0, learning_rate=0.001, n_iterations=100):
"""Gradient descent in 2D."""
x = x0.copy()
history = [x.copy()]
for _ in range(n_iterations):
grad = grad_f_2d(x)
x = x - learning_rate * grad
history.append(x.copy())
return np.array(history)
# Run
x0 = np.array([-1.0, 1.0])
history = gradient_descent_2d(x0, learning_rate=0.001, n_iterations=500)
print("Gradient Descent (2D)")
print(f"\nObjective: Rosenbrock function")
print(f"f(x,y) = (1-x)² + 100(y-x²)²")
print(f"\nStarting point: x₀ = {x0}")
print(f"Learning rate: α = 0.001")
print(f"Iterations: 500")
print(f"\nTrue minimum: x* = [1, 1], f(x*) = 0")
print(f"Found: x ≈ {history[-1]}, f(x) ≈ {f_2d(history[-1]):.6f}")
# Visualize
x_range = np.linspace(-1.5, 1.5, 200)
y_range = np.linspace(-0.5, 1.5, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros_like(X)
for i in range(X.shape[0]):
for j in range(X.shape[1]):
Z[i, j] = f_2d(np.array([X[i, j], Y[i, j]]))
plt.figure(figsize=(14, 6))
# Contour plot
plt.subplot(1, 2, 1)
levels = np.logspace(-1, 3, 20)
plt.contour(X, Y, Z, levels=levels, cmap='viridis')
plt.plot(history[:, 0], history[:, 1], 'r.-', markersize=4, linewidth=1, alpha=0.7)
plt.plot(history[0, 0], history[0, 1], 'go', markersize=10, label='Start')
plt.plot(1, 1, 'r*', markersize=15, label='True minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('GD on Rosenbrock Function')
plt.legend()
plt.grid(True, alpha=0.3)
# Convergence
plt.subplot(1, 2, 2)
f_history = [f_2d(x) for x in history]
plt.semilogy(f_history, 'b.-', linewidth=2, markersize=6)
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Convergence (log scale)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nRosenbrock is a challenging function with a narrow valley!")
# Effect of learning rate
def test_learning_rates():
"""Test different learning rates."""
x0 = np.array([5.0])
learning_rates = [0.01, 0.1, 0.5, 0.9]
plt.figure(figsize=(14, 8))
for idx, lr in enumerate(learning_rates, 1):
# Run GD
history = gradient_descent_1d(x0[0], learning_rate=lr, n_iterations=20)
x_vals = np.array([h[0] for h in history])
f_vals = np.array([h[1] for h in history])
plt.subplot(2, 2, idx)
x_plot = np.linspace(-1, 6, 300)
f_plot = f_1d(x_plot)
plt.plot(x_plot, f_plot, 'b-', linewidth=2, alpha=0.3)
plt.plot(x_vals, f_vals, 'ro-', markersize=6, linewidth=1.5)
plt.plot(2, 1, 'g*', markersize=15)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title(f'Learning rate α = {lr}')
plt.grid(True, alpha=0.3)
if lr == 0.9:
plt.ylim([0, 20])
plt.tight_layout()
plt.show()
print("Learning rate selection:")
print(" α too small → slow convergence")
print(" α too large → oscillation or divergence!")
print(" α just right → fast, stable convergence")
test_learning_rates()
# Gradient descent with momentum
def gradient_descent_momentum(x0, learning_rate=0.01, momentum=0.9, n_iterations=100):
"""
GD with momentum:
v_{t+1} = β*v_t + α*∇f(x_t)
x_{t+1} = x_t - v_{t+1}
"""
x = x0.copy()
v = np.zeros_like(x0)
history = [x.copy()]
for _ in range(n_iterations):
grad = grad_f_2d(x)
v = momentum * v + learning_rate * grad
x = x - v
history.append(x.copy())
return np.array(history)
# Compare vanilla GD vs GD with momentum
x0 = np.array([-1.0, 1.0])
n_iter = 200
history_vanilla = gradient_descent_2d(x0, learning_rate=0.001, n_iterations=n_iter)
history_momentum = gradient_descent_momentum(x0, learning_rate=0.001, momentum=0.9, n_iterations=n_iter)
print("Gradient Descent with Momentum")
print(f"\nVanilla GD after {n_iter} iterations:")
print(f" x = {history_vanilla[-1]}")
print(f" f(x) = {f_2d(history_vanilla[-1]):.6f}")
print(f"\nGD with momentum (β=0.9) after {n_iter} iterations:")
print(f" x = {history_momentum[-1]}")
print(f" f(x) = {f_2d(history_momentum[-1]):.6f}")
# Visualize
plt.figure(figsize=(14, 6))
# Trajectories
plt.subplot(1, 2, 1)
levels = np.logspace(-1, 3, 20)
plt.contour(X, Y, Z, levels=levels, cmap='viridis', alpha=0.3)
plt.plot(history_vanilla[:, 0], history_vanilla[:, 1], 'b.-',
markersize=3, linewidth=1, alpha=0.6, label='Vanilla GD')
plt.plot(history_momentum[:, 0], history_momentum[:, 1], 'r.-',
markersize=3, linewidth=1, alpha=0.6, label='GD + Momentum')
plt.plot(1, 1, 'g*', markersize=15, label='Minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Comparison of Trajectories')
plt.legend()
plt.grid(True, alpha=0.3)
# Convergence
plt.subplot(1, 2, 2)
f_vanilla = [f_2d(x) for x in history_vanilla]
f_momentum = [f_2d(x) for x in history_momentum]
plt.semilogy(f_vanilla, 'b-', linewidth=2, label='Vanilla GD')
plt.semilogy(f_momentum, 'r-', linewidth=2, label='GD + Momentum')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Convergence Rate')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nMomentum accelerates convergence in narrow valleys!")
7.2 Constrained Optimization and Lagrange Multipliers¶
Problem:¶
Minimize f(x) subject to constraints g(x) = 0
Lagrangian:¶
Optimality Conditions (KKT):¶
At optimal (x*, λ*):
\(\nabla_x \mathcal{L} = \nabla f(x^*) + \lambda^* \nabla g(x^*) = 0\)
\(g(x^*) = 0\) (feasibility)
Intuition: At optimum, gradient of f is parallel to gradient of g!
# Lagrange multipliers example
# Minimize f(x,y) = x² + y² subject to x + y = 1
from scipy.optimize import minimize
def objective(xy):
"""f(x,y) = x² + y²"""
return xy[0]**2 + xy[1]**2
def constraint(xy):
"""g(x,y) = x + y - 1 = 0"""
return xy[0] + xy[1] - 1
# Analytical solution using Lagrange multipliers
# L = x² + y² + λ(x + y - 1)
# ∂L/∂x = 2x + λ = 0 → x = -λ/2
# ∂L/∂y = 2y + λ = 0 → y = -λ/2
# ∂L/∂λ = x + y - 1 = 0
#
# Substituting: -λ/2 - λ/2 - 1 = 0 → λ = -1
# Therefore: x* = y* = 1/2
x_analytical = np.array([0.5, 0.5])
f_analytical = objective(x_analytical)
print("Constrained Optimization Example")
print("\nProblem:")
print(" minimize f(x,y) = x² + y²")
print(" subject to x + y = 1")
print("\nAnalytical solution (Lagrange multipliers):")
print(f" x* = {x_analytical}")
print(f" f(x*) = {f_analytical:.4f}")
print(f" Constraint satisfied? x + y = {x_analytical.sum():.4f}")
# Numerical solution
x0 = np.array([0.0, 1.0])
constraints = {'type': 'eq', 'fun': constraint}
result = minimize(objective, x0, constraints=constraints)
print("\nNumerical solution (scipy.optimize):")
print(f" x* = {result.x}")
print(f" f(x*) = {result.fun:.4f}")
print(f" Success? {result.success}")
# Visualize
x_range = np.linspace(-0.5, 1.5, 200)
y_range = np.linspace(-0.5, 1.5, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + Y**2
plt.figure(figsize=(10, 8))
contour = plt.contour(X, Y, Z, levels=20, cmap='viridis')
plt.clabel(contour, inline=True, fontsize=8)
# Constraint: x + y = 1 → y = 1 - x
x_constraint = np.linspace(-0.5, 1.5, 100)
y_constraint = 1 - x_constraint
plt.plot(x_constraint, y_constraint, 'r-', linewidth=3, label='Constraint: x+y=1')
# Optimal point
plt.plot(x_analytical[0], x_analytical[1], 'g*', markersize=20, label='Optimal x*')
# Gradients at optimal point
grad_f = 2 * x_analytical # ∇f = [2x, 2y]
grad_g = np.array([1, 1]) # ∇g = [1, 1]
# Normalize for visualization
grad_f_norm = grad_f / np.linalg.norm(grad_f) * 0.3
grad_g_norm = grad_g / np.linalg.norm(grad_g) * 0.3
plt.quiver(x_analytical[0], x_analytical[1], grad_f_norm[0], grad_f_norm[1],
color='blue', width=0.008, scale=1, scale_units='xy', label='∇f')
plt.quiver(x_analytical[0], x_analytical[1], grad_g_norm[0], grad_g_norm[1],
color='orange', width=0.008, scale=1, scale_units='xy', label='∇g')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Constrained Optimization\n∇f and ∇g are parallel at optimum!')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.xlim([-0.5, 1.5])
plt.ylim([-0.5, 1.5])
plt.show()
print("\nKey insight: At x*, gradients ∇f and ∇g are parallel!")
print("This is exactly what Lagrange multipliers enforce.")
7.3 Convex Optimization¶
Definition 7.2 (Convex Function)¶
f is convex if for all x, y and λ ∈ [0,1]:
Geometric interpretation: Function lies below any line connecting two points.
Properties:¶
Any local minimum is a global minimum
Gradient descent is guaranteed to converge
Efficiently solvable
Examples:¶
Convex:
x²
|x|
exp(x)
-log(x) for x > 0
||x||₂²
Not convex:
sin(x)
x⁴ - x²
Neural networks (generally)
# Visualize convex vs non-convex
x = np.linspace(-3, 3, 300)
# Convex functions
f_convex1 = x**2
f_convex2 = np.abs(x)
f_convex3 = np.exp(x)
# Non-convex functions
f_nonconvex1 = np.sin(x)
f_nonconvex2 = x**4 - x**2
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
# Convex
axes[0, 0].plot(x, f_convex1, 'b-', linewidth=2)
axes[0, 0].set_title('f(x) = x² (Convex)', fontsize=12)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 1].plot(x, f_convex2, 'b-', linewidth=2)
axes[0, 1].set_title('f(x) = |x| (Convex)', fontsize=12)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 2].plot(x, f_convex3, 'b-', linewidth=2)
axes[0, 2].set_title('f(x) = exp(x) (Convex)', fontsize=12)
axes[0, 2].set_ylim([0, 20])
axes[0, 2].grid(True, alpha=0.3)
# Non-convex
axes[1, 0].plot(x, f_nonconvex1, 'r-', linewidth=2)
axes[1, 0].set_title('f(x) = sin(x) (Non-convex)', fontsize=12)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 1].plot(x, f_nonconvex2, 'r-', linewidth=2)
axes[1, 1].set_title('f(x) = x⁴ - x² (Non-convex)', fontsize=12)
axes[1, 1].grid(True, alpha=0.3)
# Demonstrate convexity definition
x_demo = np.linspace(-2, 2, 100)
f_demo = x_demo**2
# Two points
x1, x2 = -1.5, 1.5
f1, f2 = x1**2, x2**2
# Line connecting them
x_line = np.linspace(x1, x2, 50)
f_line = f1 + (f2 - f1) * (x_line - x1) / (x2 - x1)
axes[1, 2].plot(x_demo, f_demo, 'b-', linewidth=2, label='f(x) = x²')
axes[1, 2].plot(x_line, f_line, 'g--', linewidth=2, label='Linear interpolation')
axes[1, 2].plot([x1, x2], [f1, f2], 'ro', markersize=10)
axes[1, 2].fill_between(x_line, f_demo[np.searchsorted(x_demo, x_line)], f_line,
alpha=0.3, color='yellow', label='f ≤ line')
axes[1, 2].set_title('Convexity Definition', fontsize=12)
axes[1, 2].legend(fontsize=9)
axes[1, 2].grid(True, alpha=0.3)
for ax in axes.flat:
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
plt.tight_layout()
plt.show()
print("Convex functions: curve lies BELOW any connecting line")
print("Non-convex: can have multiple local minima!")
# Demonstrate: Convex → unique global minimum
# Non-convex → multiple local minima
def convex_function(x):
return x**2 + 2*x + 1
def nonconvex_function(x):
return x**4 - 4*x**2 + 2*x
# Test gradient descent from multiple starting points
starting_points = [-3, -1, 0, 1, 3]
def grad_convex(x):
return 2*x + 2
def grad_nonconvex(x):
return 4*x**3 - 8*x + 2
def gd_1d_general(x0, grad_func, lr=0.01, n_iter=100):
x = x0
for _ in range(n_iter):
x = x - lr * grad_func(x)
return x
# Run GD
x_plot = np.linspace(-3, 3, 300)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Convex
axes[0].plot(x_plot, convex_function(x_plot), 'b-', linewidth=2)
results_convex = []
for x0 in starting_points:
x_final = gd_1d_general(x0, grad_convex, lr=0.1, n_iter=50)
results_convex.append(x_final)
axes[0].plot(x0, convex_function(x0), 'ro', markersize=8)
axes[0].plot(x_final, convex_function(x_final), 'g*', markersize=15)
axes[0].annotate('', xy=(x_final, convex_function(x_final)),
xytext=(x0, convex_function(x0)),
arrowprops=dict(arrowstyle='->', color='red', alpha=0.5))
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Convex: All paths lead to same minimum!')
axes[0].grid(True, alpha=0.3)
# Non-convex
axes[1].plot(x_plot, nonconvex_function(x_plot), 'r-', linewidth=2)
results_nonconvex = []
for x0 in starting_points:
x_final = gd_1d_general(x0, grad_nonconvex, lr=0.01, n_iter=100)
results_nonconvex.append(x_final)
axes[1].plot(x0, nonconvex_function(x0), 'ro', markersize=8)
axes[1].plot(x_final, nonconvex_function(x_final), 'g*', markersize=15)
axes[1].annotate('', xy=(x_final, nonconvex_function(x_final)),
xytext=(x0, nonconvex_function(x0)),
arrowprops=dict(arrowstyle='->', color='red', alpha=0.5))
axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Non-convex: Different local minima!')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Convex optimization results (from different starting points):")
print(f" Converged to: {results_convex}")
print(f" All same? {len(set(np.round(results_convex, 2))) == 1}")
print("\nNon-convex optimization results:")
print(f" Converged to: {np.round(results_nonconvex, 2)}")
print(f" Different minima found!")
print("\n" + "="*60)
print("This is why convex optimization is so valuable in ML!")
Summary¶
Key Concepts from Chapter 7:¶
Gradient Descent: x_{t+1} = x_t - α∇f(x_t)
Learning Rate: Controls step size (critical!)
Momentum: Accelerates convergence
Lagrange Multipliers: Handle constraints
Convexity: Guarantees global optimum
KKT Conditions: Optimality for constrained problems
Optimization Variants:¶
Method |
Update Rule |
Advantage |
|---|---|---|
Batch GD |
Full gradient |
Stable |
SGD |
Single sample |
Fast, online |
Mini-batch |
Batch gradient |
Good trade-off |
Momentum |
v = βv + α∇f |
Accelerates |
Adam |
Adaptive rates |
State-of-the-art |
ML Applications:¶
Concept |
ML Application |
|---|---|
Gradient descent |
Training all models |
SGD |
Deep learning |
Momentum |
Faster training |
Convex optimization |
Linear regression, SVM |
Lagrange multipliers |
SVM dual problem |
Convexity Benefits:¶
✓ Unique global minimum
✓ Guaranteed convergence
✓ Efficient algorithms
✓ Theoretical guarantees
Next Steps:¶
Chapter 9: Linear Regression (uses gradient descent)
Chapter 10: PCA (eigenvalue optimization)
Chapter 12: SVM (constrained optimization)
Practice: Implement Adam optimizer and compare with vanilla GD!