Run this notebook: Open in Colab Open in Kaggle

Chapter 3: What is Backpropagation?¶

The Algorithm That Makes Deep Learning Possible¶

Gradient descent requires knowing \(\frac{\partial C}{\partial w}\) for every weight in the network – the sensitivity of the total cost to each individual parameter. For a network with 13,000 weights, computing each partial derivative independently (e.g., by nudging one weight and measuring the cost change) would require 13,000 forward passes per training step. Backpropagation computes all these gradients simultaneously in just one forward pass plus one backward pass, making deep learning computationally feasible.

The core insight is that the chain rule of calculus can be applied layer by layer, starting from the output and working backward. Each layer receives an “error signal” from the layer above it and uses it to compute both (a) the gradients for its own weights and (b) the error signal to pass further back. This backward flow of gradient information gives the algorithm its name. Backpropagation was popularized by Rumelhart, Hinton, and Williams in 1986 and remains the foundation of all modern neural network training.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Intuition: How Each Weight Influences the Cost¶

Consider a single training example where the network should output “7” but instead gives high activation to “3.” Backpropagation asks: for each weight in the network, how would a tiny increase in that weight change the final cost? A weight connecting to a neuron that is already active and pointing toward the wrong answer should be decreased. A weight connecting an active neuron toward the correct answer should be increased.

More precisely, the gradient \(\frac{\partial C}{\partial w_{jk}^{(l)}}\) decomposes into three factors via the chain rule: (1) how much the cost changes with the neuron’s activation, (2) how much the activation changes with the weighted sum, and (3) how much the weighted sum changes with the weight. Each factor is easy to compute locally, and their product gives the desired gradient. This local-computation structure is what makes backpropagation efficient – no neuron needs global knowledge of the network.

The Chain Rule: Propagating Gradients Layer by Layer¶

Backpropagation is fundamentally an application of the chain rule from calculus. If the network computes \(C = f_3(f_2(f_1(\vec{x})))\) where each \(f_i\) represents one layer’s computation (linear transform followed by activation), then:

\[\frac{\partial C}{\partial W^{(1)}} = \frac{\partial C}{\partial \vec{a}^{(3)}} \cdot \frac{\partial \vec{a}^{(3)}}{\partial \vec{a}^{(2)}} \cdot \frac{\partial \vec{a}^{(2)}}{\partial W^{(1)}}\]

The key efficiency comes from reusing intermediate results. The error signal \(\frac{\partial C}{\partial \vec{a}^{(l)}}\) at layer \(l\) is computed once and used to derive gradients for all weights in that layer, then propagated backward to compute \(\frac{\partial C}{\partial \vec{a}^{(l-1)}}\). This is why the algorithm runs in time proportional to a single forward pass – each layer does a constant amount of extra work during the backward pass. Modern frameworks like PyTorch and TensorFlow implement this automatically through automatic differentiation, recording a computational graph during the forward pass and traversing it in reverse to compute gradients.