Chapter 4: Backpropagation CalculusΒΆ
The Mathematics Behind BackpropagationΒΆ
While the previous chapter gave the intuition for backpropagation, this chapter works through the calculus in detail. The chain rule for a composition of functions \(f(g(x))\) states that \(\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\). In a neural network, each neuronβs output is a composition of many functions: the weighted sum \(z = w_1 a_1 + w_2 a_2 + \cdots + b\), followed by the activation \(a = \sigma(z)\), feeding into the next layer, and ultimately into the cost \(C\).
The chain rule lets us decompose the derivative of \(C\) with respect to any weight \(w\) into a product of local derivatives along the path from that weight to the cost. For a weight \(w^{(L)}\) in the last layer:
Each of these three terms has a simple, intuitive form: the first measures how the cost responds to the output activation, the second is just \(\sigma'(z)\) (the activation functionβs derivative), and the third is simply the previous layerβs activation \(a^{(L-1)}\). For weights in earlier layers, additional chain rule factors accumulate β but the structure remains the same.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)
Computational Graphs: Visualizing the Flow of GradientsΒΆ
A computational graph represents each operation in the network as a node: additions, multiplications, activation functions. The forward pass flows left to right, computing activations. The backward pass flows right to left, computing gradients by applying the chain rule at each node.
At every node, the local operation defines a simple derivative (e.g., for multiplication \(z = xy\), \(\frac{\partial z}{\partial x} = y\)). The incoming gradient from downstream is multiplied by this local derivative and passed upstream. At branch points where one value feeds into multiple downstream paths, gradients from all paths are summed (the multivariate chain rule). This βmultiply-and-pass-backβ pattern means that all gradients in the entire network are computed in a single backward traversal of the graph, with computational cost roughly equal to the forward pass. This is what makes training networks with millions or billions of parameters practical β frameworks like PyTorch build the computational graph dynamically during the forward pass (autograd) and traverse it automatically when you call .backward().