Run this notebook: Open in Colab Open in Kaggle

Chapter 2: Gradient Descent ¶

How Neural Networks Learn: Minimizing a Cost Function¶

Training a neural network means finding the set of weights and biases that make it perform well on training data. To make “perform well” precise, we define a cost function (also called loss function) that quantifies how far the network’s predictions are from the correct answers. For classification, a common choice is the mean squared error:

\[C(\vec{w}, \vec{b}) = \frac{1}{n} \sum_{i=1}^{n} \| \vec{y}_i - \hat{\vec{y}}_i \|^2\]

where \(\vec{y}_i\) is the desired output (e.g., a one-hot vector with 1 at the correct digit) and \(\hat{\vec{y}}_i\) is the network’s actual output. The cost is a function of all weights and biases – a scalar-valued function of thousands of variables. Training is an optimization problem: find the parameter values that minimize \(C\). Since the cost landscape is high-dimensional and non-convex, we cannot solve this analytically. Instead, we use an iterative approach: gradient descent.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

The Gradient Descent Algorithm: Following the Slope Downhill¶

Imagine standing in a foggy mountain landscape where you can only feel the slope beneath your feet. To reach the lowest valley, you repeatedly check which direction is steepest downhill and take a step in that direction. This is gradient descent in a nutshell.

The gradient \(\nabla C\) is a vector of partial derivatives – one for each parameter – that points in the direction of steepest increase in the cost. To decrease the cost, we step in the opposite direction:

\[\vec{w}_{new} = \vec{w}_{old} - \eta \nabla C\]

The learning rate \(\eta\) controls the step size. Too large and you overshoot the minimum (or diverge entirely); too small and training takes forever. The gradient tells you both the direction and the relative magnitude of improvement for each parameter. In a network with 13,000 parameters, the gradient is a 13,000-dimensional vector, and each entry tells you how sensitive the cost is to a tiny change in that specific weight or bias. Computing this gradient efficiently is the job of backpropagation, covered in the next chapter.

# Simple gradient descent demo
import numpy as np
w = 4.0
for i in range(10):
    grad = 2*(w-2)  # derivative
    w = w - 0.3*grad
    print(f"Step {i}: w={w:.4f}")
print(f"Final: {w:.4f}")

Stochastic Gradient Descent: Trading Precision for Speed¶

Computing the exact gradient requires evaluating the cost over the entire training set – for MNIST, that means running all 60,000 images through the network before making a single parameter update. Stochastic Gradient Descent (SGD) dramatically speeds this up by estimating the gradient from a small random subset (a mini-batch) of, say, 32 or 128 examples.

Each mini-batch gradient is a noisy approximation of the true gradient, but the noise is actually beneficial: it helps the optimizer escape shallow local minima and saddle points that would trap standard gradient descent. Over many mini-batch updates, the random fluctuations average out and the parameters converge toward a good solution. Modern variants like Adam (Adaptive Moment Estimation) maintain per-parameter learning rates and momentum terms, adapting automatically to the loss landscape’s curvature. Nearly all modern deep learning uses some form of mini-batch SGD.

Looking Ahead: Computing Gradients with Backpropagation¶

Gradient descent tells us what to do with gradients (step downhill), but not how to compute them. For a network with thousands of weights arranged in layers of matrix multiplications and nonlinear activations, computing \(\frac{\partial C}{\partial w}\) for every weight by finite differences would require millions of forward passes. Backpropagation solves this elegantly: it computes all gradients in a single backward pass through the network, using the chain rule of calculus to propagate error signals from the output layer back to the input layer. This is the subject of the next two chapters.