Run this notebook: Open in Colab Open in Kaggle

# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from scipy.optimize import minimize, minimize_scalar
from matplotlib import animation
from IPython.display import HTML

sns.set_style('whitegrid')
np.random.seed(42)
np.set_printoptions(precision=4, suppress=True)

7.1 Gradient Descent¶

Algorithm: Gradient Descent¶

Iterative update:

\[x_{t+1} = x_t - \alpha \nabla f(x_t)\]

where:

α: learning rate (step size)
∇f(x_t): gradient at current point

Intuition: Move in the direction of steepest descent!

Variants:¶

Batch GD: Use all data
Stochastic GD (SGD): Use one sample at a time
Mini-batch GD: Use small batches
Momentum: Accumulate velocity
Adam: Adaptive learning rates

Advanced Optimization Theory for Deep Learning¶

1. Convergence Analysis of Gradient Descent¶

Smooth Convex Functions¶

For \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) that is \(L\)-smooth (Lipschitz continuous gradient):

\[\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\|\]

Theorem (GD Convergence): With step size \(\alpha \leq \frac{1}{L}\):

\[f(\mathbf{x}_t) - f(\mathbf{x}^*) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2t}\]

Rate: \(O(1/t)\) (sublinear)

Strongly Convex Functions¶

If \(f\) is additionally \(\mu\)-strongly convex:

\[f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\|\mathbf{y} - \mathbf{x}\|^2\]

Condition number: \(\kappa = L/\mu\)

Theorem (Strong Convexity): With \(\alpha = \frac{1}{L}\):

\[\|\mathbf{x}_t - \mathbf{x}^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^t \|\mathbf{x}_0 - \mathbf{x}^*\|^2\]

Rate: \(O((1-\mu/L)^t)\) (linear/exponential)

Interpretation: Convergence speed depends on \(\kappa\):

\(\kappa \approx 1\): Fast (well-conditioned)
\(\kappa \gg 1\): Slow (ill-conditioned, elongated valley)

2. Stochastic Gradient Descent (SGD)¶

Replace full gradient with noisy estimate from mini-batch:

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha_t \nabla f_{\mathcal{B}_t}(\mathbf{x}_t)\]

where \(\mathcal{B}_t\) is random mini-batch, \(\mathbb{E}[\nabla f_{\mathcal{B}_t}] = \nabla f\).

SGD Convergence (Convex Case)¶

Theorem: With decreasing step size \(\alpha_t = \frac{C}{\sqrt{t}}\):

\[\mathbb{E}[f(\bar{\mathbf{x}}_T)] - f(\mathbf{x}^*) = O\left(\frac{1}{\sqrt{T}}\right)\]

where \(\bar{\mathbf{x}}_T = \frac{1}{T}\sum_{t=1}^T \mathbf{x}_t\) (averaged iterates).

Trade-off:

Constant step size: Faster initially, but oscillates near optimum
Decreasing step size: Converges to optimum, but slower

Variance Reduction¶

Problem: SGD has high variance \(\text{Var}(\nabla f_{\mathcal{B}}) = \sigma^2/|\mathcal{B}|\)

Solutions:

SVRG (Stochastic Variance Reduced Gradient): Periodically compute full gradient
SAGA: Maintain table of per-sample gradients
Large batch training: Use batch size \(\propto \sqrt{T}\) for linear speedup

3. Momentum Methods¶

Classical Momentum (Polyak, 1964)¶

\[\begin{split}\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla f(\mathbf{x}_t) \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \alpha \mathbf{v}_{t+1} \end{aligned}\end{split}\]

Physical analogy: Ball rolling down hill with friction coefficient \((1-\beta)\).

Effective step: \(\mathbf{x}_{t+1} - \mathbf{x}_t = -\alpha \sum_{i=0}^t \beta^i \nabla f(\mathbf{x}_{t-i})\)

Benefits:

Accelerates in consistent directions
Dampens oscillations in high-curvature directions
Escapes shallow local minima

Optimal \(\beta\): For quadratic \(f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x}\):

\[\beta^* = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\]

Nesterov Accelerated Gradient (NAG)¶

“Look-ahead” gradient: evaluate at momentum-predicted point:

\[\begin{split}\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla f(\mathbf{x}_t - \alpha \beta \mathbf{v}_t) \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \alpha \mathbf{v}_{t+1} \end{aligned}\end{split}\]

Convergence rate (strongly convex): \(O\left((1 - \frac{1}{\sqrt{\kappa}})^t\right)\) vs. GD’s \(O\left((1-\frac{1}{\kappa})^t\right)\)

Improvement: \(\sqrt{\kappa}\) vs. \(\kappa\) in exponent!

4. Adaptive Learning Rates¶

AdaGrad (Duchi et al., 2011)¶

Per-parameter learning rates based on historical gradients:

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\alpha}{\sqrt{\mathbf{G}_t + \epsilon}} \odot \nabla f(\mathbf{x}_t)\]

where \(\mathbf{G}_t = \sum_{\tau=1}^t \nabla f(\mathbf{x}_\tau)^2\) (element-wise squares).

Effect: Smaller updates for frequently updated parameters, larger for sparse.

Problem: \(\mathbf{G}_t\) grows monotonically → learning rate decays to zero.

RMSprop (Hinton, 2012)¶

Exponential moving average of squared gradients:

\[\begin{split}\begin{aligned} \mathbf{v}_t &= \gamma \mathbf{v}_{t-1} + (1-\gamma) \nabla f(\mathbf{x}_t)^2 \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \frac{\alpha}{\sqrt{\mathbf{v}_t + \epsilon}} \odot \nabla f(\mathbf{x}_t) \end{aligned}\end{split}\]

Fix: Decaying average prevents indefinite shrinking.

Typical: \(\gamma = 0.9\), \(\epsilon = 10^{-8}\)

5. Adam: Adaptive Moment Estimation (Kingma & Ba, 2015)¶

Combines momentum and adaptive learning rates:

\[\begin{split}\begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla f(\mathbf{x}_t) && \text{(first moment: mean)} \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) \nabla f(\mathbf{x}_t)^2 && \text{(second moment: variance)} \\ \hat{\mathbf{m}}_t &= \frac{\mathbf{m}_t}{1 - \beta_1^t} && \text{(bias correction)} \\ \hat{\mathbf{v}}_t &= \frac{\mathbf{v}_t}{1 - \beta_2^t} && \text{(bias correction)} \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \frac{\alpha \hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \end{aligned}\end{split}\]

Defaults: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\alpha = 0.001\), \(\epsilon = 10^{-8}\)

Bias correction rationale: At \(t=0\), \(\mathbf{m}_0 = \mathbf{v}_0 = 0\). Without correction:

\(\mathbf{m}_1 = (1-\beta_1) \mathbf{g}_1 \approx 0.1 \mathbf{g}_1\) (underestimate if \(\mathbf{g}_1\) is signal)
Correction: \(\hat{\mathbf{m}}_1 = \frac{0.1 \mathbf{g}_1}{1 - 0.9} = \mathbf{g}_1\) ✓

Effective learning rate per parameter:

\[\alpha_{\text{eff},i} = \frac{\alpha}{\sqrt{\hat{v}_{t,i}} + \epsilon}\]

Typically \(\alpha_{\text{eff},i} \in [10^{-5}, 10^{-2}]\) even with fixed \(\alpha = 10^{-3}\).

Adam Variants¶

Variant	Modification	Benefit
AdamW	Decoupled weight decay: \(\mathbf{x}_t \leftarrow (1-\lambda)\mathbf{x}_t - \alpha \hat{\mathbf{m}}_t/\sqrt{\hat{\mathbf{v}}_t}\)	Better regularization
AMSGrad	\(\hat{\mathbf{v}}_t = \max(\hat{\mathbf{v}}_{t-1}, \mathbf{v}_t)\) (non-decreasing)	Convergence guarantee
RAdam	Rectified bias correction with warmup	Stable early training
AdaBelief	Adapt to gradient prediction error \((\mathbf{g}_t - \mathbf{m}_t)^2\)	Better generalization

6. AdamW: Weight Decay Done Right¶

Problem with Adam + L2 regularization:

L2 loss: \(\mathcal{L}(\mathbf{x}) = f(\mathbf{x}) + \frac{\lambda}{2}\|\mathbf{x}\|^2\)

Gradient: \(\nabla \mathcal{L} = \nabla f(\mathbf{x}) + \lambda \mathbf{x}\)

Adam applies adaptive learning rate to \(\lambda \mathbf{x}\) → incorrect scaling.

Solution (Loshchilov & Hutter, 2019): Decouple weight decay:

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda \mathbf{x}_t\right)\]

Effect: Weight decay applied with constant rate \(\alpha \lambda\), independent of adaptive scaling.

Empirical impact:

Transformers: AdamW with \(\lambda = 0.01\) outperforms Adam + L2
Better generalization on vision tasks

7. Learning Rate Schedules¶

Step Decay¶

\[\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/T \rfloor}\]

Typical: \(\gamma = 0.1\), decay every \(T\) epochs.

Exponential Decay¶

\[\alpha_t = \alpha_0 e^{-\lambda t}\]

Cosine Annealing (Loshchilov & Hutter, 2017)¶

\[\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{t\pi}{T}\right)\right)\]

Benefit: Smooth decay, periodically increases (with restarts).

Cosine Annealing with Warm Restarts (SGDR):

Restart schedule at \(T_i = T_0 \cdot T_{\text{mult}}^i\).

Intuition: Escape local minima by restarting with higher learning rate.

Warmup (Goyal et al., 2017)¶

Linear increase from 0 to \(\alpha_{\max}\) over first \(T_w\) steps:

\[\alpha_t = \frac{t}{T_w} \alpha_{\max} \quad \text{for } t \leq T_w\]

Why needed: Large initial learning rate + Adam can cause instability.

Transformer standard: Warmup for 4000-10000 steps.

8. Second-Order Methods¶

Newton’s Method¶

Use second-order Taylor expansion:

\[f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \Delta \mathbf{x} + \frac{1}{2} \Delta \mathbf{x}^T \mathbf{H} \Delta \mathbf{x}\]

Minimize w.r.t. \(\Delta \mathbf{x}\):

\[\Delta \mathbf{x} = -\mathbf{H}^{-1} \nabla f(\mathbf{x})\]

Update: \(\mathbf{x}_{t+1} = \mathbf{x}_t - \mathbf{H}_t^{-1} \nabla f(\mathbf{x}_t)\)

Pros:

Quadratic convergence: \(\|\mathbf{x}_{t+1} - \mathbf{x}^*\| \leq C \|\mathbf{x}_t - \mathbf{x}^*\|^2\)
Scale-invariant (automatic preconditioning)

Cons:

Compute Hessian \(\mathbf{H} \in \mathbb{R}^{d \times d}\): \(O(d^2)\) memory
Invert Hessian: \(O(d^3)\) time
Hessian may not be positive definite

Quasi-Newton: L-BFGS¶

Approximate \(\mathbf{H}^{-1}\) using gradient history (last \(m\) steps):

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha_t \mathbf{B}_t \nabla f(\mathbf{x}_t)\]

where \(\mathbf{B}_t \approx \mathbf{H}_t^{-1}\) is updated via BFGS formula.

Memory: \(O(md)\) instead of \(O(d^2)\)

Typical: \(m = 10-20\)

Use case: Small to medium \(d\) (\(d \lesssim 10^6\)), full-batch optimization.

Problem for SGD: Stochastic curvature estimates are too noisy.

Natural Gradient (Amari, 1998)¶

Use Fisher information matrix \(\mathbf{F}\) instead of Hessian:

\[\mathbf{F} = \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\left[\nabla \log p(\mathbf{y}|\mathbf{x}) \nabla \log p(\mathbf{y}|\mathbf{x})^T\right]\]

Update: \(\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \mathbf{F}^{-1} \nabla f(\mathbf{x}_t)\)

Interpretation: Gradient in parameter space with Riemannian metric.

Advantage: Invariant to reparameterization of model.

Approximations: K-FAC (Kronecker-factored approximate curvature) for neural networks.

9. Gradient Clipping and Scaling¶

Gradient Clipping (Pascanu et al., 2013)¶

Prevent exploding gradients in RNNs:

\[\begin{split}\mathbf{g}_t = \begin{cases} \nabla f(\mathbf{x}_t) & \text{if } \|\nabla f(\mathbf{x}_t)\| \leq \theta \\ \theta \frac{\nabla f(\mathbf{x}_t)}{\|\nabla f(\mathbf{x}_t)\|} & \text{otherwise} \end{cases}\end{split}\]

Effect: Cap gradient norm at threshold \(\theta\) (typical: \(\theta = 1\) or \(5\)).

Gradient Scaling¶

For mixed-precision training (FP16):

\[\text{Loss}_{\text{scaled}} = s \cdot \text{Loss}\]

Compute gradients, then unscale before update.

Reason: FP16 underflows for small gradients → scale loss by \(s = 2^{10}-2^{15}\).

10. Optimization Landscape of Neural Networks¶

Loss Surface Geometry¶

High-dimensional phenomena:

Saddle points dominate: Most critical points are saddles, not local minima
Mode connectivity: Different minima connected by low-loss paths
Wide vs. sharp minima: Wide minima generalize better (flatter curvature)

Theorem (Dauphin et al., 2014): For random high-dimensional functions, probability of local minimum decreases exponentially with dimension.

Lottery Ticket Hypothesis (Frankle & Carbin, 2019)¶

Claim: Dense networks contain sparse subnetworks that can train to similar accuracy.

Finding: Initialization matters more than commonly thought.

Sharpness-Aware Minimization (SAM) (Foret et al., 2021)¶

Minimize loss in neighborhood:

\[\min_{\mathbf{x}} \max_{\|\mathbf{p}\| \leq \rho} f(\mathbf{x} + \mathbf{p})\]

Update:

Compute adversarial perturbation: \(\mathbf{p}_t = \rho \nabla f(\mathbf{x}_t) / \|\nabla f(\mathbf{x}_t)\|\)
Update: \(\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t + \mathbf{p}_t)\)

Effect: Encourages flat minima → better generalization.

11. Practical Recommendations¶

Task	Optimizer	Learning Rate	Schedule
Computer Vision	AdamW	\(10^{-4}\) to \(10^{-3}\)	Cosine annealing
NLP (Transformers)	AdamW	\(10^{-4}\)	Warmup + linear decay
Reinforcement Learning	Adam	\(3 \times 10^{-4}\)	Constant or linear decay
Small datasets	SGD + momentum	\(0.01\) to \(0.1\)	Step decay (÷10 at epochs 60, 80)
Large-scale vision	LAMB/LARS	Scaled with batch size	Warmup + polynomial decay

Hyperparameter tuning priority:

Learning rate (most important!)
Batch size
Optimizer choice
Weight decay
\(\beta\) coefficients (Adam)

Debugging tips:

Loss exploding → reduce learning rate or use gradient clipping
Slow convergence → increase learning rate or use adaptive optimizer
Overfitting → add weight decay, dropout, or data augmentation
Training unstable → add warmup, reduce batch size, use gradient accumulation

# Advanced Optimizer Implementations and Comparisons

import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, Tuple, Dict, List
import seaborn as sns

# ============================================================
# 1. Optimizer Implementations
# ============================================================

class Optimizer:
    """Base optimizer class"""
    def __init__(self, params: np.ndarray, lr: float):
        self.params = params.copy()
        self.lr = lr
        self.history = [params.copy()]
    
    def step(self, grad: np.ndarray):
        raise NotImplementedError
    
    def get_history(self):
        return np.array(self.history)

class SGD(Optimizer):
    """Stochastic Gradient Descent"""
    def step(self, grad: np.ndarray):
        self.params -= self.lr * grad
        self.history.append(self.params.copy())

class MomentumSGD(Optimizer):
    """SGD with Momentum"""
    def __init__(self, params: np.ndarray, lr: float, beta: float = 0.9):
        super().__init__(params, lr)
        self.beta = beta
        self.v = np.zeros_like(params)
    
    def step(self, grad: np.ndarray):
        self.v = self.beta * self.v + grad
        self.params -= self.lr * self.v
        self.history.append(self.params.copy())

class NAG(Optimizer):
    """Nesterov Accelerated Gradient"""
    def __init__(self, params: np.ndarray, lr: float, beta: float = 0.9):
        super().__init__(params, lr)
        self.beta = beta
        self.v = np.zeros_like(params)
    
    def step(self, grad: np.ndarray):
        v_prev = self.v.copy()
        self.v = self.beta * self.v + grad
        self.params -= self.lr * (self.beta * self.v + grad)
        self.history.append(self.params.copy())

class AdaGrad(Optimizer):
    """AdaGrad: Adaptive Gradient"""
    def __init__(self, params: np.ndarray, lr: float, eps: float = 1e-8):
        super().__init__(params, lr)
        self.eps = eps
        self.G = np.zeros_like(params)
    
    def step(self, grad: np.ndarray):
        self.G += grad ** 2
        self.params -= self.lr * grad / (np.sqrt(self.G) + self.eps)
        self.history.append(self.params.copy())

class RMSprop(Optimizer):
    """RMSprop: Root Mean Square Propagation"""
    def __init__(self, params: np.ndarray, lr: float, gamma: float = 0.9, eps: float = 1e-8):
        super().__init__(params, lr)
        self.gamma = gamma
        self.eps = eps
        self.v = np.zeros_like(params)
    
    def step(self, grad: np.ndarray):
        self.v = self.gamma * self.v + (1 - self.gamma) * grad ** 2
        self.params -= self.lr * grad / (np.sqrt(self.v) + self.eps)
        self.history.append(self.params.copy())

class Adam(Optimizer):
    """Adam: Adaptive Moment Estimation"""
    def __init__(self, params: np.ndarray, lr: float, beta1: float = 0.9, 
                 beta2: float = 0.999, eps: float = 1e-8):
        super().__init__(params, lr)
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = np.zeros_like(params)
        self.v = np.zeros_like(params)
        self.t = 0
    
    def step(self, grad: np.ndarray):
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
        
        # Bias correction
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)
        
        self.params -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        self.history.append(self.params.copy())

class AdamW(Optimizer):
    """AdamW: Adam with decoupled Weight Decay"""
    def __init__(self, params: np.ndarray, lr: float, beta1: float = 0.9,
                 beta2: float = 0.999, eps: float = 1e-8, weight_decay: float = 0.01):
        super().__init__(params, lr)
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay
        self.m = np.zeros_like(params)
        self.v = np.zeros_like(params)
        self.t = 0
    
    def step(self, grad: np.ndarray):
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
        
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)
        
        # AdamW: Decoupled weight decay
        self.params -= self.lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * self.params)
        self.history.append(self.params.copy())

# ============================================================
# 2. Test Functions
# ============================================================

def rosenbrock(x: np.ndarray) -> float:
    """Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x: np.ndarray) -> np.ndarray:
    """Gradient of Rosenbrock"""
    dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dy = 200 * (x[1] - x[0]**2)
    return np.array([dx, dy])

def beale(x: np.ndarray) -> float:
    """Beale function: highly non-convex"""
    return ((1.5 - x[0] + x[0]*x[1])**2 + 
            (2.25 - x[0] + x[0]*x[1]**2)**2 + 
            (2.625 - x[0] + x[0]*x[1]**3)**2)

def beale_grad(x: np.ndarray) -> np.ndarray:
    """Gradient of Beale (numerical approximation)"""
    eps = 1e-8
    grad = np.zeros(2)
    for i in range(2):
        x_plus = x.copy()
        x_plus[i] += eps
        grad[i] = (beale(x_plus) - beale(x)) / eps
    return grad

# ============================================================
# 3. Optimizer Comparison
# ============================================================

def run_optimizer(opt_class, params_init: np.ndarray, grad_fn: Callable, 
                  n_steps: int, **kwargs) -> np.ndarray:
    """Run optimizer for n_steps"""
    opt = opt_class(params_init, **kwargs)
    for _ in range(n_steps):
        grad = grad_fn(opt.params)
        opt.step(grad)
    return opt.get_history()

# Test parameters
x0 = np.array([-1.0, 1.0])
n_steps = 100

# Run all optimizers
optimizers = {
    'SGD': (SGD, {'lr': 0.001}),
    'Momentum': (MomentumSGD, {'lr': 0.001, 'beta': 0.9}),
    'NAG': (NAG, {'lr': 0.001, 'beta': 0.9}),
    'AdaGrad': (AdaGrad, {'lr': 0.1}),
    'RMSprop': (RMSprop, {'lr': 0.01, 'gamma': 0.9}),
    'Adam': (Adam, {'lr': 0.01, 'beta1': 0.9, 'beta2': 0.999}),
    'AdamW': (AdamW, {'lr': 0.01, 'weight_decay': 0.01}),
}

trajectories = {}
for name, (opt_class, kwargs) in optimizers.items():
    trajectories[name] = run_optimizer(opt_class, x0, rosenbrock_grad, n_steps, **kwargs)

print("="*70)
print("OPTIMIZER COMPARISON ON ROSENBROCK FUNCTION")
print("="*70)
print(f"Starting point: x₀ = {x0}")
print(f"True minimum: x* = [1, 1], f(x*) = 0")
print(f"\nAfter {n_steps} steps:")
print(f"{'Optimizer':<12} {'Final x':<20} {'Final f(x)':<15} {'Distance to x*'}")
print("-"*70)

for name, traj in trajectories.items():
    final = traj[-1]
    f_final = rosenbrock(final)
    dist = np.linalg.norm(final - np.array([1, 1]))
    print(f"{name:<12} [{final[0]:6.3f}, {final[1]:6.3f}]    {f_final:12.6f}    {dist:8.5f}")

# ============================================================
# 4. Visualization
# ============================================================

fig = plt.figure(figsize=(18, 12))

# Create contour plot
x_range = np.linspace(-1.5, 1.5, 300)
y_range = np.linspace(-0.5, 1.5, 300)
X, Y = np.meshgrid(x_range, y_range)
Z = np.array([[rosenbrock(np.array([x, y])) for x in x_range] for y in y_range])

# Plot 1: All optimizers
ax1 = plt.subplot(2, 3, 1)
levels = np.logspace(-1, 3, 20)
cs = ax1.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
ax1.plot(1, 1, 'r*', markersize=20, label='Optimum')

colors = plt.cm.tab10(np.linspace(0, 1, len(trajectories)))
for (name, traj), color in zip(trajectories.items(), colors):
    ax1.plot(traj[:, 0], traj[:, 1], '-', color=color, linewidth=1.5, 
             label=name, alpha=0.7)
    ax1.plot(traj[0, 0], traj[0, 1], 'o', color=color, markersize=8)

ax1.set_xlim(-1.5, 1.5)
ax1.set_ylim(-0.5, 1.5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Optimizer Trajectories on Rosenbrock')
ax1.legend(fontsize=8, loc='upper left')
ax1.grid(True, alpha=0.3)

# Plot 2-4: Individual optimizer comparisons
for idx, (group_name, opt_names) in enumerate([
    ('First-order', ['SGD', 'Momentum', 'NAG']),
    ('Adaptive', ['AdaGrad', 'RMSprop']),
    ('Adam variants', ['Adam', 'AdamW'])
], 2):
    ax = plt.subplot(2, 3, idx)
    ax.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
    ax.plot(1, 1, 'r*', markersize=15)
    
    for name in opt_names:
        if name in trajectories:
            traj = trajectories[name]
            ax.plot(traj[:, 0], traj[:, 1], '-o', linewidth=2, markersize=3,
                   label=name, alpha=0.8)
    
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{group_name} Optimizers')
    ax.legend()
    ax.grid(True, alpha=0.3)

# Plot 5: Convergence curves
ax5 = plt.subplot(2, 3, 5)
for name, traj in trajectories.items():
    f_values = np.array([rosenbrock(x) for x in traj])
    ax5.semilogy(f_values, linewidth=2, label=name, alpha=0.8)

ax5.set_xlabel('Iteration')
ax5.set_ylabel('f(x) (log scale)')
ax5.set_title('Convergence Comparison')
ax5.legend(fontsize=8)
ax5.grid(True, alpha=0.3)

# Plot 6: Distance to optimum
ax6 = plt.subplot(2, 3, 6)
for name, traj in trajectories.items():
    distances = np.array([np.linalg.norm(x - np.array([1, 1])) for x in traj])
    ax6.semilogy(distances, linewidth=2, label=name, alpha=0.8)

ax6.set_xlabel('Iteration')
ax6.set_ylabel('||x - x*|| (log scale)')
ax6.set_title('Distance to Optimum')
ax6.legend(fontsize=8)
ax6.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ============================================================
# 5. Learning Rate Sensitivity Analysis
# ============================================================

print("\n" + "="*70)
print("LEARNING RATE SENSITIVITY (Adam)")
print("="*70)

learning_rates = [0.001, 0.01, 0.1, 1.0]
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for idx, lr in enumerate(learning_rates):
    traj = run_optimizer(Adam, x0, rosenbrock_grad, n_steps=50, lr=lr)
    
    ax = axes[idx]
    ax.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.3)
    ax.plot(1, 1, 'r*', markersize=15)
    ax.plot(traj[:, 0], traj[:, 1], 'b-o', linewidth=2, markersize=3, alpha=0.8)
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_title(f'Adam (lr={lr})')
    ax.set_xlabel('x')
    if idx == 0:
        ax.set_ylabel('y')
    
    final = traj[-1]
    dist = np.linalg.norm(final - np.array([1, 1]))
    print(f"lr = {lr:5.3f}: Final distance = {dist:8.5f}")

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("KEY OBSERVATIONS")
print("="*70)
print("1. Momentum methods accelerate in consistent directions")
print("2. NAG has better convergence than classical momentum")
print("3. Adaptive methods (Adam, RMSprop) auto-tune learning rates")
print("4. AdaGrad learning rate decays too aggressively")
print("5. Adam balances momentum + adaptive learning rates")
print("6. AdamW improves regularization via decoupled weight decay")
print("="*70)

# Gradient descent on 1D function

def f_1d(x):
    """Objective function: f(x) = (x-2)² + 1"""
    return (x - 2)**2 + 1

def grad_f_1d(x):
    """Gradient: f'(x) = 2(x-2)"""
    return 2 * (x - 2)

def gradient_descent_1d(x0, learning_rate=0.1, n_iterations=20):
    """
    Perform gradient descent.
    
    Returns:
        history: List of (x, f(x)) tuples at each iteration
    """
    x = x0
    history = [(x, f_1d(x))]
    
    for _ in range(n_iterations):
        grad = grad_f_1d(x)
        x = x - learning_rate * grad
        history.append((x, f_1d(x)))
    
    return history

# Run gradient descent
x0 = 5.0
history = gradient_descent_1d(x0, learning_rate=0.1, n_iterations=20)

# Extract trajectory
x_vals = np.array([h[0] for h in history])
f_vals = np.array([h[1] for h in history])

print("Gradient Descent (1D)")
print(f"\nObjective: f(x) = (x-2)² + 1")
print(f"Starting point: x₀ = {x0}")
print(f"Learning rate: α = 0.1")
print(f"\nIteration    x         f(x)")
print("-" * 35)
for i, (x, fx) in enumerate(history[:10]):
    print(f"{i:5d}     {x:8.4f}   {fx:8.4f}")
print("...")
print(f"{len(history)-1:5d}     {x_vals[-1]:8.4f}   {f_vals[-1]:8.4f}")

print(f"\nTrue minimum: x* = 2, f(x*) = 1")
print(f"Found: x ≈ {x_vals[-1]:.4f}, f(x) ≈ {f_vals[-1]:.4f}")

# Visualize
x_plot = np.linspace(-1, 6, 300)
f_plot = f_1d(x_plot)

plt.figure(figsize=(12, 5))

# Function and trajectory
plt.subplot(1, 2, 1)
plt.plot(x_plot, f_plot, 'b-', linewidth=2, label='f(x)')
plt.plot(x_vals, f_vals, 'ro-', markersize=6, linewidth=1.5, label='GD trajectory')
plt.plot(2, 1, 'g*', markersize=20, label='True minimum')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Trajectory')
plt.legend()
plt.grid(True, alpha=0.3)

# Convergence
plt.subplot(1, 2, 2)
plt.semilogy(f_vals - 1, 'r.-', linewidth=2, markersize=8)
plt.xlabel('Iteration')
plt.ylabel('f(x) - f(x*)')
plt.title('Convergence (log scale)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Gradient descent on 2D function

def f_2d(x):
    """Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def grad_f_2d(x):
    """Gradient of Rosenbrock"""
    dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dy = 200 * (x[1] - x[0]**2)
    return np.array([dx, dy])

def gradient_descent_2d(x0, learning_rate=0.001, n_iterations=100):
    """Gradient descent in 2D."""
    x = x0.copy()
    history = [x.copy()]
    
    for _ in range(n_iterations):
        grad = grad_f_2d(x)
        x = x - learning_rate * grad
        history.append(x.copy())
    
    return np.array(history)

# Run
x0 = np.array([-1.0, 1.0])
history = gradient_descent_2d(x0, learning_rate=0.001, n_iterations=500)

print("Gradient Descent (2D)")
print(f"\nObjective: Rosenbrock function")
print(f"f(x,y) = (1-x)² + 100(y-x²)²")
print(f"\nStarting point: x₀ = {x0}")
print(f"Learning rate: α = 0.001")
print(f"Iterations: 500")

print(f"\nTrue minimum: x* = [1, 1], f(x*) = 0")
print(f"Found: x ≈ {history[-1]}, f(x) ≈ {f_2d(history[-1]):.6f}")

# Visualize
x_range = np.linspace(-1.5, 1.5, 200)
y_range = np.linspace(-0.5, 1.5, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros_like(X)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        Z[i, j] = f_2d(np.array([X[i, j], Y[i, j]]))

plt.figure(figsize=(14, 6))

# Contour plot
plt.subplot(1, 2, 1)
levels = np.logspace(-1, 3, 20)
plt.contour(X, Y, Z, levels=levels, cmap='viridis')
plt.plot(history[:, 0], history[:, 1], 'r.-', markersize=4, linewidth=1, alpha=0.7)
plt.plot(history[0, 0], history[0, 1], 'go', markersize=10, label='Start')
plt.plot(1, 1, 'r*', markersize=15, label='True minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('GD on Rosenbrock Function')
plt.legend()
plt.grid(True, alpha=0.3)

# Convergence
plt.subplot(1, 2, 2)
f_history = [f_2d(x) for x in history]
plt.semilogy(f_history, 'b.-', linewidth=2, markersize=6)
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Convergence (log scale)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nRosenbrock is a challenging function with a narrow valley!")

# Effect of learning rate

def test_learning_rates():
    """Test different learning rates."""
    x0 = np.array([5.0])
    learning_rates = [0.01, 0.1, 0.5, 0.9]
    
    plt.figure(figsize=(14, 8))
    
    for idx, lr in enumerate(learning_rates, 1):
        # Run GD
        history = gradient_descent_1d(x0[0], learning_rate=lr, n_iterations=20)
        x_vals = np.array([h[0] for h in history])
        f_vals = np.array([h[1] for h in history])
        
        plt.subplot(2, 2, idx)
        x_plot = np.linspace(-1, 6, 300)
        f_plot = f_1d(x_plot)
        
        plt.plot(x_plot, f_plot, 'b-', linewidth=2, alpha=0.3)
        plt.plot(x_vals, f_vals, 'ro-', markersize=6, linewidth=1.5)
        plt.plot(2, 1, 'g*', markersize=15)
        plt.xlabel('x')
        plt.ylabel('f(x)')
        plt.title(f'Learning rate α = {lr}')
        plt.grid(True, alpha=0.3)
        
        if lr == 0.9:
            plt.ylim([0, 20])
    
    plt.tight_layout()
    plt.show()
    
    print("Learning rate selection:")
    print("  α too small → slow convergence")
    print("  α too large → oscillation or divergence!")
    print("  α just right → fast, stable convergence")

test_learning_rates()

# Gradient descent with momentum

def gradient_descent_momentum(x0, learning_rate=0.01, momentum=0.9, n_iterations=100):
    """
    GD with momentum:
    v_{t+1} = β*v_t + α*∇f(x_t)
    x_{t+1} = x_t - v_{t+1}
    """
    x = x0.copy()
    v = np.zeros_like(x0)
    history = [x.copy()]
    
    for _ in range(n_iterations):
        grad = grad_f_2d(x)
        v = momentum * v + learning_rate * grad
        x = x - v
        history.append(x.copy())
    
    return np.array(history)

# Compare vanilla GD vs GD with momentum
x0 = np.array([-1.0, 1.0])
n_iter = 200

history_vanilla = gradient_descent_2d(x0, learning_rate=0.001, n_iterations=n_iter)
history_momentum = gradient_descent_momentum(x0, learning_rate=0.001, momentum=0.9, n_iterations=n_iter)

print("Gradient Descent with Momentum")
print(f"\nVanilla GD after {n_iter} iterations:")
print(f"  x = {history_vanilla[-1]}")
print(f"  f(x) = {f_2d(history_vanilla[-1]):.6f}")

print(f"\nGD with momentum (β=0.9) after {n_iter} iterations:")
print(f"  x = {history_momentum[-1]}")
print(f"  f(x) = {f_2d(history_momentum[-1]):.6f}")

# Visualize
plt.figure(figsize=(14, 6))

# Trajectories
plt.subplot(1, 2, 1)
levels = np.logspace(-1, 3, 20)
plt.contour(X, Y, Z, levels=levels, cmap='viridis', alpha=0.3)
plt.plot(history_vanilla[:, 0], history_vanilla[:, 1], 'b.-', 
         markersize=3, linewidth=1, alpha=0.6, label='Vanilla GD')
plt.plot(history_momentum[:, 0], history_momentum[:, 1], 'r.-', 
         markersize=3, linewidth=1, alpha=0.6, label='GD + Momentum')
plt.plot(1, 1, 'g*', markersize=15, label='Minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Comparison of Trajectories')
plt.legend()
plt.grid(True, alpha=0.3)

# Convergence
plt.subplot(1, 2, 2)
f_vanilla = [f_2d(x) for x in history_vanilla]
f_momentum = [f_2d(x) for x in history_momentum]
plt.semilogy(f_vanilla, 'b-', linewidth=2, label='Vanilla GD')
plt.semilogy(f_momentum, 'r-', linewidth=2, label='GD + Momentum')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Convergence Rate')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nMomentum accelerates convergence in narrow valleys!")

7.2 Constrained Optimization and Lagrange Multipliers¶

Problem:¶

Minimize f(x) subject to constraints g(x) = 0

\[\min_x f(x) \quad \text{s.t.} \quad g(x) = 0\]

Lagrangian:¶

\[\mathcal{L}(x, \lambda) = f(x) + \lambda^T g(x)\]

Optimality Conditions (KKT):¶

At optimal (x*, λ*):

\(\nabla_x \mathcal{L} = \nabla f(x^*) + \lambda^* \nabla g(x^*) = 0\)
\(g(x^*) = 0\) (feasibility)

Intuition: At optimum, gradient of f is parallel to gradient of g!

# Lagrange multipliers example
# Minimize f(x,y) = x² + y² subject to x + y = 1

from scipy.optimize import minimize

def objective(xy):
    """f(x,y) = x² + y²"""
    return xy[0]**2 + xy[1]**2

def constraint(xy):
    """g(x,y) = x + y - 1 = 0"""
    return xy[0] + xy[1] - 1

# Analytical solution using Lagrange multipliers
# L = x² + y² + λ(x + y - 1)
# ∂L/∂x = 2x + λ = 0  →  x = -λ/2
# ∂L/∂y = 2y + λ = 0  →  y = -λ/2
# ∂L/∂λ = x + y - 1 = 0
#
# Substituting: -λ/2 - λ/2 - 1 = 0  →  λ = -1
# Therefore: x* = y* = 1/2

x_analytical = np.array([0.5, 0.5])
f_analytical = objective(x_analytical)

print("Constrained Optimization Example")
print("\nProblem:")
print("  minimize f(x,y) = x² + y²")
print("  subject to x + y = 1")

print("\nAnalytical solution (Lagrange multipliers):")
print(f"  x* = {x_analytical}")
print(f"  f(x*) = {f_analytical:.4f}")
print(f"  Constraint satisfied? x + y = {x_analytical.sum():.4f}")

# Numerical solution
x0 = np.array([0.0, 1.0])
constraints = {'type': 'eq', 'fun': constraint}
result = minimize(objective, x0, constraints=constraints)

print("\nNumerical solution (scipy.optimize):")
print(f"  x* = {result.x}")
print(f"  f(x*) = {result.fun:.4f}")
print(f"  Success? {result.success}")

# Visualize
x_range = np.linspace(-0.5, 1.5, 200)
y_range = np.linspace(-0.5, 1.5, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + Y**2

plt.figure(figsize=(10, 8))
contour = plt.contour(X, Y, Z, levels=20, cmap='viridis')
plt.clabel(contour, inline=True, fontsize=8)

# Constraint: x + y = 1  →  y = 1 - x
x_constraint = np.linspace(-0.5, 1.5, 100)
y_constraint = 1 - x_constraint
plt.plot(x_constraint, y_constraint, 'r-', linewidth=3, label='Constraint: x+y=1')

# Optimal point
plt.plot(x_analytical[0], x_analytical[1], 'g*', markersize=20, label='Optimal x*')

# Gradients at optimal point
grad_f = 2 * x_analytical  # ∇f = [2x, 2y]
grad_g = np.array([1, 1])  # ∇g = [1, 1]

# Normalize for visualization
grad_f_norm = grad_f / np.linalg.norm(grad_f) * 0.3
grad_g_norm = grad_g / np.linalg.norm(grad_g) * 0.3

plt.quiver(x_analytical[0], x_analytical[1], grad_f_norm[0], grad_f_norm[1],
          color='blue', width=0.008, scale=1, scale_units='xy', label='∇f')
plt.quiver(x_analytical[0], x_analytical[1], grad_g_norm[0], grad_g_norm[1],
          color='orange', width=0.008, scale=1, scale_units='xy', label='∇g')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Constrained Optimization\n∇f and ∇g are parallel at optimum!')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.xlim([-0.5, 1.5])
plt.ylim([-0.5, 1.5])
plt.show()

print("\nKey insight: At x*, gradients ∇f and ∇g are parallel!")
print("This is exactly what Lagrange multipliers enforce.")

7.3 Convex Optimization¶

Definition 7.2 (Convex Function)¶

f is convex if for all x, y and λ ∈ [0,1]:

\[f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)\]

Geometric interpretation: Function lies below any line connecting two points.

Properties:¶

Any local minimum is a global minimum
Gradient descent is guaranteed to converge
Efficiently solvable

Examples:¶

Convex:

x²
|x|
exp(x)
-log(x) for x > 0
||x||₂²

Not convex:

sin(x)
x⁴ - x²
Neural networks (generally)

# Visualize convex vs non-convex

x = np.linspace(-3, 3, 300)

# Convex functions
f_convex1 = x**2
f_convex2 = np.abs(x)
f_convex3 = np.exp(x)

# Non-convex functions
f_nonconvex1 = np.sin(x)
f_nonconvex2 = x**4 - x**2

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Convex
axes[0, 0].plot(x, f_convex1, 'b-', linewidth=2)
axes[0, 0].set_title('f(x) = x² (Convex)', fontsize=12)
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].plot(x, f_convex2, 'b-', linewidth=2)
axes[0, 1].set_title('f(x) = |x| (Convex)', fontsize=12)
axes[0, 1].grid(True, alpha=0.3)

axes[0, 2].plot(x, f_convex3, 'b-', linewidth=2)
axes[0, 2].set_title('f(x) = exp(x) (Convex)', fontsize=12)
axes[0, 2].set_ylim([0, 20])
axes[0, 2].grid(True, alpha=0.3)

# Non-convex
axes[1, 0].plot(x, f_nonconvex1, 'r-', linewidth=2)
axes[1, 0].set_title('f(x) = sin(x) (Non-convex)', fontsize=12)
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(x, f_nonconvex2, 'r-', linewidth=2)
axes[1, 1].set_title('f(x) = x⁴ - x² (Non-convex)', fontsize=12)
axes[1, 1].grid(True, alpha=0.3)

# Demonstrate convexity definition
x_demo = np.linspace(-2, 2, 100)
f_demo = x_demo**2

# Two points
x1, x2 = -1.5, 1.5
f1, f2 = x1**2, x2**2

# Line connecting them
x_line = np.linspace(x1, x2, 50)
f_line = f1 + (f2 - f1) * (x_line - x1) / (x2 - x1)

axes[1, 2].plot(x_demo, f_demo, 'b-', linewidth=2, label='f(x) = x²')
axes[1, 2].plot(x_line, f_line, 'g--', linewidth=2, label='Linear interpolation')
axes[1, 2].plot([x1, x2], [f1, f2], 'ro', markersize=10)
axes[1, 2].fill_between(x_line, f_demo[np.searchsorted(x_demo, x_line)], f_line, 
                        alpha=0.3, color='yellow', label='f ≤ line')
axes[1, 2].set_title('Convexity Definition', fontsize=12)
axes[1, 2].legend(fontsize=9)
axes[1, 2].grid(True, alpha=0.3)

for ax in axes.flat:
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')

plt.tight_layout()
plt.show()

print("Convex functions: curve lies BELOW any connecting line")
print("Non-convex: can have multiple local minima!")

# Demonstrate: Convex → unique global minimum
#             Non-convex → multiple local minima

def convex_function(x):
    return x**2 + 2*x + 1

def nonconvex_function(x):
    return x**4 - 4*x**2 + 2*x

# Test gradient descent from multiple starting points
starting_points = [-3, -1, 0, 1, 3]

def grad_convex(x):
    return 2*x + 2

def grad_nonconvex(x):
    return 4*x**3 - 8*x + 2

def gd_1d_general(x0, grad_func, lr=0.01, n_iter=100):
    x = x0
    for _ in range(n_iter):
        x = x - lr * grad_func(x)
    return x

# Run GD
x_plot = np.linspace(-3, 3, 300)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Convex
axes[0].plot(x_plot, convex_function(x_plot), 'b-', linewidth=2)
results_convex = []
for x0 in starting_points:
    x_final = gd_1d_general(x0, grad_convex, lr=0.1, n_iter=50)
    results_convex.append(x_final)
    axes[0].plot(x0, convex_function(x0), 'ro', markersize=8)
    axes[0].plot(x_final, convex_function(x_final), 'g*', markersize=15)
    axes[0].annotate('', xy=(x_final, convex_function(x_final)), 
                    xytext=(x0, convex_function(x0)),
                    arrowprops=dict(arrowstyle='->', color='red', alpha=0.5))

axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Convex: All paths lead to same minimum!')
axes[0].grid(True, alpha=0.3)

# Non-convex
axes[1].plot(x_plot, nonconvex_function(x_plot), 'r-', linewidth=2)
results_nonconvex = []
for x0 in starting_points:
    x_final = gd_1d_general(x0, grad_nonconvex, lr=0.01, n_iter=100)
    results_nonconvex.append(x_final)
    axes[1].plot(x0, nonconvex_function(x0), 'ro', markersize=8)
    axes[1].plot(x_final, nonconvex_function(x_final), 'g*', markersize=15)
    axes[1].annotate('', xy=(x_final, nonconvex_function(x_final)), 
                    xytext=(x0, nonconvex_function(x0)),
                    arrowprops=dict(arrowstyle='->', color='red', alpha=0.5))

axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Non-convex: Different local minima!')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Convex optimization results (from different starting points):")
print(f"  Converged to: {results_convex}")
print(f"  All same? {len(set(np.round(results_convex, 2))) == 1}")

print("\nNon-convex optimization results:")
print(f"  Converged to: {np.round(results_nonconvex, 2)}")
print(f"  Different minima found!")

print("\n" + "="*60)
print("This is why convex optimization is so valuable in ML!")

Summary¶

Key Concepts from Chapter 7:¶

Gradient Descent: x_{t+1} = x_t - α∇f(x_t)
Learning Rate: Controls step size (critical!)
Momentum: Accelerates convergence
Lagrange Multipliers: Handle constraints
Convexity: Guarantees global optimum
KKT Conditions: Optimality for constrained problems

Optimization Variants:¶

Method	Update Rule	Advantage
Batch GD	Full gradient	Stable
SGD	Single sample	Fast, online
Mini-batch	Batch gradient	Good trade-off
Momentum	v = βv + α∇f	Accelerates
Adam	Adaptive rates	State-of-the-art

ML Applications:¶

Concept	ML Application
Gradient descent	Training all models
SGD	Deep learning
Momentum	Faster training
Convex optimization	Linear regression, SVM
Lagrange multipliers	SVM dual problem

Convexity Benefits:¶

✓ Unique global minimum
✓ Guaranteed convergence
✓ Efficient algorithms
✓ Theoretical guarantees

Next Steps:¶

Chapter 9: Linear Regression (uses gradient descent)
Chapter 10: PCA (eigenvalue optimization)
Chapter 12: SVM (constrained optimization)

Practice: Implement Adam optimizer and compare with vanilla GD!