Run this notebook: Open in Colab Open in Kaggle

Chapter 4: Backpropagation Calculus ¶

The Mathematics Behind Backpropagation¶

While the previous chapter gave the intuition for backpropagation, this chapter works through the calculus in detail. The chain rule for a composition of functions \(f(g(x))\) states that \(\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\). In a neural network, each neuron’s output is a composition of many functions: the weighted sum \(z = w_1 a_1 + w_2 a_2 + \cdots + b\), followed by the activation \(a = \sigma(z)\), feeding into the next layer, and ultimately into the cost \(C\).

The chain rule lets us decompose the derivative of \(C\) with respect to any weight \(w\) into a product of local derivatives along the path from that weight to the cost. For a weight \(w^{(L)}\) in the last layer:

\[\frac{\partial C}{\partial w^{(L)}} = \frac{\partial C}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial w^{(L)}}\]

Each of these three terms has a simple, intuitive form: the first measures how the cost responds to the output activation, the second is just \(\sigma'(z)\) (the activation function’s derivative), and the third is simply the previous layer’s activation \(a^{(L-1)}\). For weights in earlier layers, additional chain rule factors accumulate – but the structure remains the same.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Computational Graphs: Visualizing the Flow of Gradients¶

A computational graph represents each operation in the network as a node: additions, multiplications, activation functions. The forward pass flows left to right, computing activations. The backward pass flows right to left, computing gradients by applying the chain rule at each node.

At every node, the local operation defines a simple derivative (e.g., for multiplication \(z = xy\), \(\frac{\partial z}{\partial x} = y\)). The incoming gradient from downstream is multiplied by this local derivative and passed upstream. At branch points where one value feeds into multiple downstream paths, gradients from all paths are summed (the multivariate chain rule). This “multiply-and-pass-back” pattern means that all gradients in the entire network are computed in a single backward traversal of the graph, with computational cost roughly equal to the forward pass. This is what makes training networks with millions or billions of parameters practical – frameworks like PyTorch build the computational graph dynamically during the forward pass (autograd) and traverse it automatically when you call .backward().

Chapter 4: Backpropagation Calculus¶

The Mathematics Behind Backpropagation¶

Computational Graphs: Visualizing the Flow of Gradients¶

Chapter 4: Backpropagation Calculus ¶