Neural Network MathematicsΒΆ
Backpropagation, vanishing/exploding gradients, batch normalization, and attention mechanism math.
Neural networks are compositions of differentiable functions, and their power comes from the ability to compute gradients of a scalar loss with respect to millions of parameters simultaneously. The mathematical machinery behind this β the chain rule of calculus applied systematically through a computation graph β is called backpropagation. Understanding this math is not optional for serious ML practitioners: debugging training failures, designing new architectures, and choosing the right normalization or activation strategy all require fluency in how gradients flow through network layers.
This notebook builds the core mathematical concepts layer by layer: from the chain rule in backpropagation, through the gradient pathologies that plagued early deep networks, to the innovations (ReLU, residual connections, batch normalization, attention) that made modern deep learning possible. Each concept is implemented from scratch in NumPy so you can see exactly what frameworks like PyTorch compute behind the scenes.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
1. Backpropagation: Chain Rule in ActionΒΆ
Computing gradients efficiently through a computation graphΒΆ
Backpropagation is the algorithm that makes training neural networks tractable. The goal is to compute \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) in the network, so gradient descent can update them. Naively computing each partial derivative independently would be prohibitively expensive, but the chain rule allows us to reuse intermediate computations by propagating gradients backward through the network.
For a simple two-layer network with input \(x\), hidden layer \(h\), and output \(\hat{y}\):
Forward pass computes predictions: $\(h = \sigma(W_1 x + b_1), \quad \hat{y} = W_2 h + b_2, \quad L = \frac{1}{2}(\hat{y} - y)^2\)$
Backward pass computes gradients via the chain rule: $\(\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W_2} = (\hat{y} - y) \cdot h^T\)$
Each layer only needs the gradient from the layer above it (the βupstream gradientβ) and its own local derivative. This modularity is why frameworks like PyTorch can automatically differentiate arbitrary computation graphs β each operation defines a forward and backward method, and the chain rule glues them together.
# Implement backpropagation from scratch
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
class SimpleNN:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights
self.W1 = np.random.randn(input_size, hidden_size) * 0.1
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.1
self.b2 = np.zeros((1, output_size))
def forward(self, X):
"""Forward pass - store intermediate values for backprop"""
self.z1 = X @ self.W1 + self.b1
self.h = sigmoid(self.z1)
self.z2 = self.h @ self.W2 + self.b2
self.y_pred = self.z2 # Linear output
return self.y_pred
def backward(self, X, y, learning_rate=0.01):
"""Backward pass - compute gradients and update weights"""
m = X.shape[0]
# Output layer gradients
dL_dyp red = self.y_pred - y # Gradient of MSE
dL_dW2 = (self.h.T @ dL_dypred) / m
dL_db2 = np.sum(dL_dypred, axis=0, keepdims=True) / m
# Hidden layer gradients (chain rule!)
dL_dh = dL_dypred @ self.W2.T
dL_dz1 = dL_dh * sigmoid_derivative(self.z1)
dL_dW1 = (X.T @ dL_dz1) / m
dL_db1 = np.sum(dL_dz1, axis=0, keepdims=True) / m
# Update weights
self.W2 -= learning_rate * dL_dW2
self.b2 -= learning_rate * dL_db2
self.W1 -= learning_rate * dL_dW1
self.b1 -= learning_rate * dL_db1
return dL_dW1, dL_dW2
def loss(self, y_pred, y):
return np.mean((y_pred - y)**2)
# Test on simple data
X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_train = np.array([[0], [1], [1], [0]]) # XOR problem
nn = SimpleNN(input_size=2, hidden_size=4, output_size=1)
# Train
losses = []
for epoch in range(5000):
y_pred = nn.forward(X_train)
loss = nn.loss(y_pred, y_train)
losses.append(loss)
nn.backward(X_train, y_train, learning_rate=0.5)
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
print("\nFinal predictions:")
print(nn.forward(X_train))
# Visualize learning
plt.figure(figsize=(10, 6))
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Training Loss (XOR Problem)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
print("Backpropagation successfully learns non-linear XOR function!")
2. Vanishing Gradients ProblemΒΆ
Problem: In deep networks, gradients can become extremely small
Why? Chain rule multiplies many small derivatives: $\(\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial W_n} \cdot \frac{\partial W_n}{\partial W_{n-1}} \cdot ... \cdot \frac{\partial W_2}{\partial W_1}\)$
For sigmoid: \(\sigma'(x) \leq 0.25\), so gradients shrink exponentially with depth!
Solutions:
ReLU activation (derivative = 1 for x > 0)
Residual connections
Batch normalization
Careful initialization
# Demonstrate vanishing gradients
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
# Simulate gradient flow through deep network
def gradient_flow(n_layers, activation='sigmoid'):
"""Simulate how gradient magnitude changes with depth"""
gradient = 1.0 # Start with gradient of 1
gradients = [gradient]
for layer in range(n_layers):
# Simulate activation derivative
if activation == 'sigmoid':
# Average derivative of sigmoid is about 0.25
gradient *= 0.25
elif activation == 'tanh':
# Average derivative of tanh is about 0.4
gradient *= 0.4
elif activation == 'relu':
# ReLU derivative is 0 or 1; average ~0.5 (if half neurons active)
gradient *= 0.5
# Also multiply by weight derivative (assume ~0.5)
gradient *= 0.5
gradients.append(gradient)
return gradients
# Compare activations
n_layers = 20
sigmoid_grads = gradient_flow(n_layers, 'sigmoid')
tanh_grads = gradient_flow(n_layers, 'tanh')
relu_grads = gradient_flow(n_layers, 'relu')
plt.figure(figsize=(12, 6))
plt.plot(sigmoid_grads, 'o-', label='Sigmoid', linewidth=2, markersize=5)
plt.plot(tanh_grads, 's-', label='Tanh', linewidth=2, markersize=5)
plt.plot(relu_grads, '^-', label='ReLU', linewidth=2, markersize=5)
plt.xlabel('Layer Depth', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Gradient Flow in Deep Networks', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
print("Sigmoid/Tanh: Gradients vanish exponentially with depth!")
print("ReLU: Better gradient flow, but still degrades")
# Visualize activation function derivatives
x = np.linspace(-5, 5, 1000)
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Sigmoid
axes[0, 0].plot(x, sigmoid(x), linewidth=2)
axes[0, 0].set_title('Sigmoid', fontsize=12)
axes[0, 0].grid(True, alpha=0.3)
axes[1, 0].plot(x, sigmoid_derivative(x), linewidth=2, color='red')
axes[1, 0].set_title('Sigmoid Derivative (max = 0.25)', fontsize=12)
axes[1, 0].axhline(y=0.25, color='k', linestyle='--', alpha=0.5)
axes[1, 0].grid(True, alpha=0.3)
# Tanh
axes[0, 1].plot(x, np.tanh(x), linewidth=2)
axes[0, 1].set_title('Tanh', fontsize=12)
axes[0, 1].grid(True, alpha=0.3)
axes[1, 1].plot(x, 1 - np.tanh(x)**2, linewidth=2, color='red')
axes[1, 1].set_title('Tanh Derivative (max = 1)', fontsize=12)
axes[1, 1].axhline(y=1, color='k', linestyle='--', alpha=0.5)
axes[1, 1].grid(True, alpha=0.3)
# ReLU
axes[0, 2].plot(x, relu(x), linewidth=2)
axes[0, 2].set_title('ReLU', fontsize=12)
axes[0, 2].grid(True, alpha=0.3)
axes[1, 2].plot(x, relu_derivative(x), linewidth=2, color='red')
axes[1, 2].set_title('ReLU Derivative (0 or 1)', fontsize=12)
axes[1, 2].axhline(y=1, color='k', linestyle='--', alpha=0.5)
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("ReLU has constant derivative of 1 for x > 0")
print("β Better gradient flow in deep networks!")
3. Residual Connections (ResNets)ΒΆ
Skip connections that let gradients flow freely through deep networksΒΆ
The key insight of Residual Networks (He et al., 2015) is deceptively simple: instead of asking a layer to learn a desired mapping \(H(x)\), ask it to learn the residual \(F(x) = H(x) - x\), and add the input back:
This changes the gradient dynamics fundamentally. During backpropagation:
The β+1β term is the critical innovation β it creates a βgradient highwayβ that allows gradients to flow directly from later layers to earlier ones, bypassing any vanishing that might occur through \(F'(x)\). Even if the learned function \(F\) has near-zero gradients, the identity shortcut ensures the gradient signal survives. This is why ResNets can be trained with 100+ layers while plain networks of the same depth fail to converge. The idea has become so fundamental that nearly every modern architecture β from Transformers to diffusion models β incorporates some form of residual connection.
# Compare gradient flow: Standard vs Residual
def gradient_flow_residual(n_layers):
"""Gradient flow with residual connections"""
gradient = 1.0
gradients = [gradient]
for layer in range(n_layers):
# With residual: derivative includes +1
layer_derivative = 0.25 # F'(x) from activation
gradient *= (layer_derivative + 1) # Residual connection!
gradients.append(gradient)
return gradients
n_layers = 50
standard_grads = gradient_flow(n_layers, 'sigmoid')
residual_grads = gradient_flow_residual(n_layers)
plt.figure(figsize=(12, 6))
plt.plot(standard_grads, 'o-', label='Standard Network', linewidth=2, markersize=4)
plt.plot(residual_grads, 's-', label='ResNet (with skip connections)', linewidth=2, markersize=4)
plt.xlabel('Layer Depth', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Gradient Flow: Standard vs Residual Networks', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
print("Residual connections prevent vanishing gradients!")
print("β Enables training of very deep networks (100+ layers)")
4. Batch NormalizationΒΆ
Problem: Internal covariate shift - layer inputs change during training
Solution: Normalize layer inputs to have mean=0, variance=1
Algorithm (for mini-batch):
Compute mean: \(\mu_B = \frac{1}{m} \sum_{i=1}^m x_i\)
Compute variance: \(\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2\)
Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)
Scale and shift: \(y_i = \gamma \hat{x}_i + \beta\)
Benefits: Faster training, higher learning rates, regularization effect
def batch_norm(X, gamma=1, beta=0, epsilon=1e-8):
"""Batch normalization"""
# Compute statistics
mean = np.mean(X, axis=0)
var = np.var(X, axis=0)
# Normalize
X_norm = (X - mean) / np.sqrt(var + epsilon)
# Scale and shift
X_out = gamma * X_norm + beta
return X_out, X_norm, mean, var
# Example: layer activations before and after batch norm
np.random.seed(42)
activations = np.random.randn(100, 3) * 5 + 10 # Mean~10, std~5
activations_bn, _, _, _ = batch_norm(activations)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Before batch norm
axes[0].hist(activations[:, 0], bins=30, alpha=0.7, label='Feature 1')
axes[0].hist(activations[:, 1], bins=30, alpha=0.7, label='Feature 2')
axes[0].hist(activations[:, 2], bins=30, alpha=0.7, label='Feature 3')
axes[0].set_title('Before Batch Normalization', fontsize=14)
axes[0].set_xlabel('Value', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# After batch norm
axes[1].hist(activations_bn[:, 0], bins=30, alpha=0.7, label='Feature 1')
axes[1].hist(activations_bn[:, 1], bins=30, alpha=0.7, label='Feature 2')
axes[1].hist(activations_bn[:, 2], bins=30, alpha=0.7, label='Feature 3')
axes[1].set_title('After Batch Normalization', fontsize=14)
axes[1].set_xlabel('Value', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Before BN:")
print(f" Mean: {np.mean(activations, axis=0)}")
print(f" Std: {np.std(activations, axis=0)}")
print("\nAfter BN:")
print(f" Mean: {np.mean(activations_bn, axis=0)}")
print(f" Std: {np.std(activations_bn, axis=0)}")
5. Attention Mechanism (Simplified)ΒΆ
Core idea: Focus on relevant parts of input
Scaled Dot-Product Attention: $\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\)$
Where:
Q = Queries (what weβre looking for)
K = Keys (what each position represents)
V = Values (actual information)
\(d_k\) = dimension (for numerical stability)
def softmax(x, axis=-1):
"""Numerical stable softmax"""
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V):
"""Compute scaled dot-product attention"""
d_k = K.shape[-1]
# Compute attention scores
scores = Q @ K.T / np.sqrt(d_k)
# Apply softmax
attention_weights = softmax(scores)
# Weighted sum of values
output = attention_weights @ V
return output, attention_weights
# Example: Simple sequence
sequence_length = 5
d_model = 4
# Create random Q, K, V matrices
np.random.seed(42)
Q = np.random.randn(sequence_length, d_model)
K = np.random.randn(sequence_length, d_model)
V = np.random.randn(sequence_length, d_model)
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("Input shapes:")
print(f" Q: {Q.shape}")
print(f" K: {K.shape}")
print(f" V: {V.shape}")
print(f"\nAttention weights shape: {attention_weights.shape}")
print(f"Output shape: {output.shape}")
# Visualize attention weights
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights, annot=True, fmt='.3f', cmap='YlOrRd',
xticklabels=range(1, sequence_length+1),
yticklabels=range(1, sequence_length+1))
plt.xlabel('Key Position', fontsize=12)
plt.ylabel('Query Position', fontsize=12)
plt.title('Attention Weights Matrix', fontsize=14)
plt.show()
print("\nEach row sums to 1 (softmax property)")
print(f"Row sums: {np.sum(attention_weights, axis=1)}")
6. Convolution MathematicsΒΆ
Local pattern detection with weight sharingΒΆ
Convolution is the mathematical operation at the heart of CNNs, enabling networks to detect local spatial patterns (edges, textures, shapes) while keeping the parameter count manageable through weight sharing β the same filter is applied at every position in the input.
1D Convolution slides a kernel \(g\) across a signal \(f\): $\((f * g)[n] = \sum_{m=-\infty}^{\infty} f[m] \cdot g[n-m]\)$
2D Convolution extends this to images, sliding a kernel \(K\) across an image \(I\): $\((I * K)[i,j] = \sum_m \sum_n I[i-m, j-n] \cdot K[m,n]\)$
The output at each position is a weighted sum of the local neighborhood, where the weights are the kernel values. Different kernels detect different features: a vertical Sobel filter detects vertical edges, a Gaussian kernel performs smoothing, and a Laplacian detects regions of rapid intensity change. In a CNN, the network learns the kernel values through backpropagation, discovering whatever features are most useful for the task. The property of translation equivariance β a shifted input produces a correspondingly shifted output β makes convolutions naturally suited for visual data where objects can appear anywhere in the image.
# Simple 2D convolution
def convolve2d(image, kernel):
"""Naive 2D convolution (valid mode)"""
k_h, k_w = kernel.shape
i_h, i_w = image.shape
out_h = i_h - k_h + 1
out_w = i_w - k_w + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
# Element-wise multiplication and sum
output[i, j] = np.sum(image[i:i+k_h, j:j+k_w] * kernel)
return output
# Create simple image
image = np.array([
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]
])
# Edge detection kernels
vertical_edge = np.array([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
]) # Sobel vertical
horizontal_edge = np.array([
[-1, -2, -1],
[0, 0, 0],
[1, 2, 1]
]) # Sobel horizontal
# Apply convolutions
vertical_features = convolve2d(image, vertical_edge)
horizontal_features = convolve2d(image, horizontal_edge)
# Visualize
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original Image', fontsize=12)
axes[0].axis('off')
axes[1].imshow(vertical_edge, cmap='RdBu', vmin=-2, vmax=2)
axes[1].set_title('Vertical Edge Kernel', fontsize=12)
axes[1].axis('off')
axes[2].imshow(vertical_features, cmap='gray')
axes[2].set_title('Vertical Features', fontsize=12)
axes[2].axis('off')
axes[3].imshow(horizontal_features, cmap='gray')
axes[3].set_title('Horizontal Features', fontsize=12)
axes[3].axis('off')
plt.tight_layout()
plt.show()
print(f"Input shape: {image.shape}")
print(f"Kernel shape: {vertical_edge.shape}")
print(f"Output shape: {vertical_features.shape}")
SummaryΒΆ
β Backpropagation: Chain rule for efficient gradient computation β Vanishing Gradients: Major problem in deep networks, solved by ReLU, ResNets, BatchNorm β Residual Connections: Skip connections enable training of very deep networks β Batch Normalization: Normalizes layer inputs, accelerates training β Attention: Focuses on relevant input parts, foundation of Transformers β Convolution: Local pattern detection with parameter sharing
Key Insights:ΒΆ
Modern architectures combine multiple techniques:
ResNets: Skip connections + BatchNorm
Transformers: Attention + LayerNorm
CNNs: Convolutions + BatchNorm + ReLU
Gradient flow is critical:
Deep networks need careful design
Skip connections preserve gradients
Normalization stabilizes training
Mathematical foundations matter:
Understanding the math enables debugging
Helps design new architectures
Informs hyperparameter choices
Next Steps:ΒΆ
Study Layer Normalization vs Batch Normalization
Explore Multi-Head Attention in detail
Learn about modern architectures (Vision Transformers, etc.)
Understand gradient clipping and accumulation