Neural Network MathematicsΒΆ

Backpropagation, vanishing/exploding gradients, batch normalization, and attention mechanism math.

Neural networks are compositions of differentiable functions, and their power comes from the ability to compute gradients of a scalar loss with respect to millions of parameters simultaneously. The mathematical machinery behind this – the chain rule of calculus applied systematically through a computation graph – is called backpropagation. Understanding this math is not optional for serious ML practitioners: debugging training failures, designing new architectures, and choosing the right normalization or activation strategy all require fluency in how gradients flow through network layers.

This notebook builds the core mathematical concepts layer by layer: from the chain rule in backpropagation, through the gradient pathologies that plagued early deep networks, to the innovations (ReLU, residual connections, batch normalization, attention) that made modern deep learning possible. Each concept is implemented from scratch in NumPy so you can see exactly what frameworks like PyTorch compute behind the scenes.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

1. Backpropagation: Chain Rule in ActionΒΆ

Computing gradients efficiently through a computation graphΒΆ

Backpropagation is the algorithm that makes training neural networks tractable. The goal is to compute \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) in the network, so gradient descent can update them. Naively computing each partial derivative independently would be prohibitively expensive, but the chain rule allows us to reuse intermediate computations by propagating gradients backward through the network.

For a simple two-layer network with input \(x\), hidden layer \(h\), and output \(\hat{y}\):

Forward pass computes predictions: $\(h = \sigma(W_1 x + b_1), \quad \hat{y} = W_2 h + b_2, \quad L = \frac{1}{2}(\hat{y} - y)^2\)$

Backward pass computes gradients via the chain rule: $\(\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W_2} = (\hat{y} - y) \cdot h^T\)$

\[\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial W_1}\]

Each layer only needs the gradient from the layer above it (the β€œupstream gradient”) and its own local derivative. This modularity is why frameworks like PyTorch can automatically differentiate arbitrary computation graphs – each operation defines a forward and backward method, and the chain rule glues them together.

# Implement backpropagation from scratch
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

class SimpleNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros((1, output_size))
        
    def forward(self, X):
        """Forward pass - store intermediate values for backprop"""
        self.z1 = X @ self.W1 + self.b1
        self.h = sigmoid(self.z1)
        self.z2 = self.h @ self.W2 + self.b2
        self.y_pred = self.z2  # Linear output
        return self.y_pred
    
    def backward(self, X, y, learning_rate=0.01):
        """Backward pass - compute gradients and update weights"""
        m = X.shape[0]
        
        # Output layer gradients
        dL_dyp red = self.y_pred - y  # Gradient of MSE
        dL_dW2 = (self.h.T @ dL_dypred) / m
        dL_db2 = np.sum(dL_dypred, axis=0, keepdims=True) / m
        
        # Hidden layer gradients (chain rule!)
        dL_dh = dL_dypred @ self.W2.T
        dL_dz1 = dL_dh * sigmoid_derivative(self.z1)
        dL_dW1 = (X.T @ dL_dz1) / m
        dL_db1 = np.sum(dL_dz1, axis=0, keepdims=True) / m
        
        # Update weights
        self.W2 -= learning_rate * dL_dW2
        self.b2 -= learning_rate * dL_db2
        self.W1 -= learning_rate * dL_dW1
        self.b1 -= learning_rate * dL_db1
        
        return dL_dW1, dL_dW2
    
    def loss(self, y_pred, y):
        return np.mean((y_pred - y)**2)

# Test on simple data
X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_train = np.array([[0], [1], [1], [0]])  # XOR problem

nn = SimpleNN(input_size=2, hidden_size=4, output_size=1)

# Train
losses = []
for epoch in range(5000):
    y_pred = nn.forward(X_train)
    loss = nn.loss(y_pred, y_train)
    losses.append(loss)
    nn.backward(X_train, y_train, learning_rate=0.5)
    
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

print("\nFinal predictions:")
print(nn.forward(X_train))
# Visualize learning
plt.figure(figsize=(10, 6))
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Training Loss (XOR Problem)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Backpropagation successfully learns non-linear XOR function!")

2. Vanishing Gradients ProblemΒΆ

Problem: In deep networks, gradients can become extremely small

Why? Chain rule multiplies many small derivatives: $\(\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial W_n} \cdot \frac{\partial W_n}{\partial W_{n-1}} \cdot ... \cdot \frac{\partial W_2}{\partial W_1}\)$

For sigmoid: \(\sigma'(x) \leq 0.25\), so gradients shrink exponentially with depth!

Solutions:

  • ReLU activation (derivative = 1 for x > 0)

  • Residual connections

  • Batch normalization

  • Careful initialization

# Demonstrate vanishing gradients
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Simulate gradient flow through deep network
def gradient_flow(n_layers, activation='sigmoid'):
    """Simulate how gradient magnitude changes with depth"""
    gradient = 1.0  # Start with gradient of 1
    gradients = [gradient]
    
    for layer in range(n_layers):
        # Simulate activation derivative
        if activation == 'sigmoid':
            # Average derivative of sigmoid is about 0.25
            gradient *= 0.25
        elif activation == 'tanh':
            # Average derivative of tanh is about 0.4
            gradient *= 0.4
        elif activation == 'relu':
            # ReLU derivative is 0 or 1; average ~0.5 (if half neurons active)
            gradient *= 0.5
        
        # Also multiply by weight derivative (assume ~0.5)
        gradient *= 0.5
        gradients.append(gradient)
    
    return gradients

# Compare activations
n_layers = 20
sigmoid_grads = gradient_flow(n_layers, 'sigmoid')
tanh_grads = gradient_flow(n_layers, 'tanh')
relu_grads = gradient_flow(n_layers, 'relu')

plt.figure(figsize=(12, 6))
plt.plot(sigmoid_grads, 'o-', label='Sigmoid', linewidth=2, markersize=5)
plt.plot(tanh_grads, 's-', label='Tanh', linewidth=2, markersize=5)
plt.plot(relu_grads, '^-', label='ReLU', linewidth=2, markersize=5)
plt.xlabel('Layer Depth', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Gradient Flow in Deep Networks', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Sigmoid/Tanh: Gradients vanish exponentially with depth!")
print("ReLU: Better gradient flow, but still degrades")
# Visualize activation function derivatives
x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Sigmoid
axes[0, 0].plot(x, sigmoid(x), linewidth=2)
axes[0, 0].set_title('Sigmoid', fontsize=12)
axes[0, 0].grid(True, alpha=0.3)

axes[1, 0].plot(x, sigmoid_derivative(x), linewidth=2, color='red')
axes[1, 0].set_title('Sigmoid Derivative (max = 0.25)', fontsize=12)
axes[1, 0].axhline(y=0.25, color='k', linestyle='--', alpha=0.5)
axes[1, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(x, np.tanh(x), linewidth=2)
axes[0, 1].set_title('Tanh', fontsize=12)
axes[0, 1].grid(True, alpha=0.3)

axes[1, 1].plot(x, 1 - np.tanh(x)**2, linewidth=2, color='red')
axes[1, 1].set_title('Tanh Derivative (max = 1)', fontsize=12)
axes[1, 1].axhline(y=1, color='k', linestyle='--', alpha=0.5)
axes[1, 1].grid(True, alpha=0.3)

# ReLU
axes[0, 2].plot(x, relu(x), linewidth=2)
axes[0, 2].set_title('ReLU', fontsize=12)
axes[0, 2].grid(True, alpha=0.3)

axes[1, 2].plot(x, relu_derivative(x), linewidth=2, color='red')
axes[1, 2].set_title('ReLU Derivative (0 or 1)', fontsize=12)
axes[1, 2].axhline(y=1, color='k', linestyle='--', alpha=0.5)
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("ReLU has constant derivative of 1 for x > 0")
print("β†’ Better gradient flow in deep networks!")

3. Residual Connections (ResNets)ΒΆ

Skip connections that let gradients flow freely through deep networksΒΆ

The key insight of Residual Networks (He et al., 2015) is deceptively simple: instead of asking a layer to learn a desired mapping \(H(x)\), ask it to learn the residual \(F(x) = H(x) - x\), and add the input back:

\[y = F(x) + x\]

This changes the gradient dynamics fundamentally. During backpropagation:

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \left(\frac{\partial F}{\partial x} + 1\right)\]

The β€œ+1” term is the critical innovation – it creates a β€œgradient highway” that allows gradients to flow directly from later layers to earlier ones, bypassing any vanishing that might occur through \(F'(x)\). Even if the learned function \(F\) has near-zero gradients, the identity shortcut ensures the gradient signal survives. This is why ResNets can be trained with 100+ layers while plain networks of the same depth fail to converge. The idea has become so fundamental that nearly every modern architecture – from Transformers to diffusion models – incorporates some form of residual connection.

# Compare gradient flow: Standard vs Residual
def gradient_flow_residual(n_layers):
    """Gradient flow with residual connections"""
    gradient = 1.0
    gradients = [gradient]
    
    for layer in range(n_layers):
        # With residual: derivative includes +1
        layer_derivative = 0.25  # F'(x) from activation
        gradient *= (layer_derivative + 1)  # Residual connection!
        gradients.append(gradient)
    
    return gradients

n_layers = 50
standard_grads = gradient_flow(n_layers, 'sigmoid')
residual_grads = gradient_flow_residual(n_layers)

plt.figure(figsize=(12, 6))
plt.plot(standard_grads, 'o-', label='Standard Network', linewidth=2, markersize=4)
plt.plot(residual_grads, 's-', label='ResNet (with skip connections)', linewidth=2, markersize=4)
plt.xlabel('Layer Depth', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Gradient Flow: Standard vs Residual Networks', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Residual connections prevent vanishing gradients!")
print("β†’ Enables training of very deep networks (100+ layers)")

4. Batch NormalizationΒΆ

Problem: Internal covariate shift - layer inputs change during training

Solution: Normalize layer inputs to have mean=0, variance=1

Algorithm (for mini-batch):

  1. Compute mean: \(\mu_B = \frac{1}{m} \sum_{i=1}^m x_i\)

  2. Compute variance: \(\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2\)

  3. Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)

  4. Scale and shift: \(y_i = \gamma \hat{x}_i + \beta\)

Benefits: Faster training, higher learning rates, regularization effect

def batch_norm(X, gamma=1, beta=0, epsilon=1e-8):
    """Batch normalization"""
    # Compute statistics
    mean = np.mean(X, axis=0)
    var = np.var(X, axis=0)
    
    # Normalize
    X_norm = (X - mean) / np.sqrt(var + epsilon)
    
    # Scale and shift
    X_out = gamma * X_norm + beta
    
    return X_out, X_norm, mean, var

# Example: layer activations before and after batch norm
np.random.seed(42)
activations = np.random.randn(100, 3) * 5 + 10  # Mean~10, std~5

activations_bn, _, _, _ = batch_norm(activations)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Before batch norm
axes[0].hist(activations[:, 0], bins=30, alpha=0.7, label='Feature 1')
axes[0].hist(activations[:, 1], bins=30, alpha=0.7, label='Feature 2')
axes[0].hist(activations[:, 2], bins=30, alpha=0.7, label='Feature 3')
axes[0].set_title('Before Batch Normalization', fontsize=14)
axes[0].set_xlabel('Value', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# After batch norm
axes[1].hist(activations_bn[:, 0], bins=30, alpha=0.7, label='Feature 1')
axes[1].hist(activations_bn[:, 1], bins=30, alpha=0.7, label='Feature 2')
axes[1].hist(activations_bn[:, 2], bins=30, alpha=0.7, label='Feature 3')
axes[1].set_title('After Batch Normalization', fontsize=14)
axes[1].set_xlabel('Value', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Before BN:")
print(f"  Mean: {np.mean(activations, axis=0)}")
print(f"  Std: {np.std(activations, axis=0)}")
print("\nAfter BN:")
print(f"  Mean: {np.mean(activations_bn, axis=0)}")
print(f"  Std: {np.std(activations_bn, axis=0)}")

5. Attention Mechanism (Simplified)ΒΆ

Core idea: Focus on relevant parts of input

Scaled Dot-Product Attention: $\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\)$

Where:

  • Q = Queries (what we’re looking for)

  • K = Keys (what each position represents)

  • V = Values (actual information)

  • \(d_k\) = dimension (for numerical stability)

def softmax(x, axis=-1):
    """Numerical stable softmax"""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    """Compute scaled dot-product attention"""
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Apply softmax
    attention_weights = softmax(scores)
    
    # Weighted sum of values
    output = attention_weights @ V
    
    return output, attention_weights

# Example: Simple sequence
sequence_length = 5
d_model = 4

# Create random Q, K, V matrices
np.random.seed(42)
Q = np.random.randn(sequence_length, d_model)
K = np.random.randn(sequence_length, d_model)
V = np.random.randn(sequence_length, d_model)

output, attention_weights = scaled_dot_product_attention(Q, K, V)

print("Input shapes:")
print(f"  Q: {Q.shape}")
print(f"  K: {K.shape}")
print(f"  V: {V.shape}")
print(f"\nAttention weights shape: {attention_weights.shape}")
print(f"Output shape: {output.shape}")

# Visualize attention weights
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights, annot=True, fmt='.3f', cmap='YlOrRd',
           xticklabels=range(1, sequence_length+1),
           yticklabels=range(1, sequence_length+1))
plt.xlabel('Key Position', fontsize=12)
plt.ylabel('Query Position', fontsize=12)
plt.title('Attention Weights Matrix', fontsize=14)
plt.show()

print("\nEach row sums to 1 (softmax property)")
print(f"Row sums: {np.sum(attention_weights, axis=1)}")

6. Convolution MathematicsΒΆ

Local pattern detection with weight sharingΒΆ

Convolution is the mathematical operation at the heart of CNNs, enabling networks to detect local spatial patterns (edges, textures, shapes) while keeping the parameter count manageable through weight sharing – the same filter is applied at every position in the input.

1D Convolution slides a kernel \(g\) across a signal \(f\): $\((f * g)[n] = \sum_{m=-\infty}^{\infty} f[m] \cdot g[n-m]\)$

2D Convolution extends this to images, sliding a kernel \(K\) across an image \(I\): $\((I * K)[i,j] = \sum_m \sum_n I[i-m, j-n] \cdot K[m,n]\)$

The output at each position is a weighted sum of the local neighborhood, where the weights are the kernel values. Different kernels detect different features: a vertical Sobel filter detects vertical edges, a Gaussian kernel performs smoothing, and a Laplacian detects regions of rapid intensity change. In a CNN, the network learns the kernel values through backpropagation, discovering whatever features are most useful for the task. The property of translation equivariance – a shifted input produces a correspondingly shifted output – makes convolutions naturally suited for visual data where objects can appear anywhere in the image.

# Simple 2D convolution
def convolve2d(image, kernel):
    """Naive 2D convolution (valid mode)"""
    k_h, k_w = kernel.shape
    i_h, i_w = image.shape
    
    out_h = i_h - k_h + 1
    out_w = i_w - k_w + 1
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            # Element-wise multiplication and sum
            output[i, j] = np.sum(image[i:i+k_h, j:j+k_w] * kernel)
    
    return output

# Create simple image
image = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25]
])

# Edge detection kernels
vertical_edge = np.array([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
])  # Sobel vertical

horizontal_edge = np.array([
    [-1, -2, -1],
    [0, 0, 0],
    [1, 2, 1]
])  # Sobel horizontal

# Apply convolutions
vertical_features = convolve2d(image, vertical_edge)
horizontal_features = convolve2d(image, horizontal_edge)

# Visualize
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original Image', fontsize=12)
axes[0].axis('off')

axes[1].imshow(vertical_edge, cmap='RdBu', vmin=-2, vmax=2)
axes[1].set_title('Vertical Edge Kernel', fontsize=12)
axes[1].axis('off')

axes[2].imshow(vertical_features, cmap='gray')
axes[2].set_title('Vertical Features', fontsize=12)
axes[2].axis('off')

axes[3].imshow(horizontal_features, cmap='gray')
axes[3].set_title('Horizontal Features', fontsize=12)
axes[3].axis('off')

plt.tight_layout()
plt.show()

print(f"Input shape: {image.shape}")
print(f"Kernel shape: {vertical_edge.shape}")
print(f"Output shape: {vertical_features.shape}")

SummaryΒΆ

βœ… Backpropagation: Chain rule for efficient gradient computation βœ… Vanishing Gradients: Major problem in deep networks, solved by ReLU, ResNets, BatchNorm βœ… Residual Connections: Skip connections enable training of very deep networks βœ… Batch Normalization: Normalizes layer inputs, accelerates training βœ… Attention: Focuses on relevant input parts, foundation of Transformers βœ… Convolution: Local pattern detection with parameter sharing

Key Insights:ΒΆ

  1. Modern architectures combine multiple techniques:

    • ResNets: Skip connections + BatchNorm

    • Transformers: Attention + LayerNorm

    • CNNs: Convolutions + BatchNorm + ReLU

  2. Gradient flow is critical:

    • Deep networks need careful design

    • Skip connections preserve gradients

    • Normalization stabilizes training

  3. Mathematical foundations matter:

    • Understanding the math enables debugging

    • Helps design new architectures

    • Informs hyperparameter choices

Next Steps:ΒΆ

  • Study Layer Normalization vs Batch Normalization

  • Explore Multi-Head Attention in detail

  • Learn about modern architectures (Vision Transformers, etc.)

  • Understand gradient clipping and accumulation