Lecture 12: Backpropagation & Deep Learning

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 12 (Autumn 2018)

“We’re going to derive the backpropagation with the chain rule and after that, we’re going to talk about how to improve our neural networks. It’s not because you designed a neural network that it’s going to work - there’s a lot of hacks and tricks you need to know.” - TA

Quick Recap: Forward Propagation

The Neural Network:

  • Cat image flattened: x₁ to xₙ

  • Layer 1: 3 neurons (w₁, b₁)

  • Layer 2: 2 neurons (w₂, b₂)

  • Layer 3: 1 neuron (w₃, b₃)

  • Output: ŷ (probability of cat)

Batch Processing:

  • Forward propagate m examples at once

  • Why? Vectorization - use GPU parallelization

  • Each example has loss L⁽ⁱ⁾

  • Average loss = Cost function J

Loss Function: Binary Cross-Entropy

L = -[y log(ŷ) + (1 - y) log(1 - ŷ)]

Also called logistic loss function

Backpropagation Strategy

Key Insight:

“The relationship between w₃ and the cost is easier than the relationship between w₁ and the cost because w₁ had much more connections going through the network”

Approach:

  1. Start from output layer (w₃, b₃)

  2. Work backwards to layer 2 (w₂, b₂)

  3. Finally reach layer 1 (w₁, b₁)

  4. Use chain rule throughout

Gradient Descent Update:

wˡ := wˡ - α J/wˡ
bˡ := bˡ - α J/bˡ

Linearity of Derivatives

Important property:

J/w = (1/m) Σᵢ L/w

Derivative of average = average of derivatives

Topics Covered

  1. Backpropagation Derivation

    • Chain rule application

    • Computing gradients layer by layer

  2. Improving Neural Networks

    • Initialization strategies

    • Activation function choices

    • Regularization techniques

    • Optimization algorithms

    • Batch normalization

  3. Practical Tips

    • Debugging gradient computations

    • Monitoring training progress

    • Avoiding common pitfalls

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

7.1 Backpropagation Derivation

“We’re going to derive the backpropagation with the chain rule and after that, we’re going to talk about how to improve our neural networks. Because in practice, it’s not because you designed a neural network that it’s going to work, there’s a lot of hacks and tricks that you need to know.” - Andrew Ng

Review: The Training Process

Forward Propagation:

  • Feed m examples through network

  • Compute loss \(L^{(i)}\) for each example

  • Average to get cost: \(J = \frac{1}{m} \sum_{i=1}^m L^{(i)}\)

“Vectorization. We want to use what our GPU can do and parallelize the computation.” - Andrew Ng

Backward Propagation:

  • Compute gradients: \(\frac{\partial J}{\partial W^{[l]}}, \frac{\partial J}{\partial b^{[l]}}\) for all layers

  • Update parameters using gradient descent

Deriving Gradients for Layer 3 (Output Layer)

Binary cross-entropy loss: $\(L(i) = -y^{(i)} \log(\hat{y}^{(i)}) - (1-y^{(i)}) \log(1-\hat{y}^{(i)})\)$

Gradient computation: $\(\frac{\partial L}{\partial W^{[3]}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W^{[3]}}\)$

Since \(\hat{y} = a^{[3]} = \sigma(Z^{[3]})\) and \(Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}\):

Step-by-step derivation:

  1. Derivative of log: \(\frac{\partial}{\partial \hat{y}}[\log(\hat{y})] = \frac{1}{\hat{y}}\)

  2. Derivative of sigmoid: \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)

  3. Derivative of linear part: \(\frac{\partial Z^{[3]}}{\partial W^{[3]}} = A^{[2]T}\)

Result: $\(\frac{\partial L}{\partial W^{[3]}} = (\hat{y} - y) \cdot A^{[2]T} = (a^{[3]} - y) \cdot A^{[2]T}\)$

“So it doesn’t look that bad actually. When we take a derivative of something kinda ugly we expect something ugly to come out but this doesn’t seem too bad.” - Andrew Ng

For the full batch (cost function): $\(\frac{\partial J}{\partial W^{[3]}} = -\frac{1}{m} \sum_{i=1}^m (y^{(i)} - a_3^{(i)}) \cdot A_2^{(i)T}\)$

Backpropagating to Layer 2

“Now the question is how I’m gonna get this one without having too much work. I’m not gonna start over as we said last time, I’m going to use the chain rule of calculus.” - Andrew Ng

Chain rule decomposition: $\(\frac{\partial L}{\partial W^{[2]}} = \frac{\partial L}{\partial a^{[3]}} \cdot \frac{\partial a^{[3]}}{\partial Z^{[3]}} \cdot \frac{\partial Z^{[3]}}{\partial A^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial W^{[2]}}\)$

Key insight - reuse computations!

“How can we use this? How can we use the derivative we already have? We just calculated it. The first two terms here appear in the calculation for \(W^{[3]}\).” - Andrew Ng

We already computed: $\(\frac{\partial L}{\partial Z^{[3]}} = \frac{\partial L}{\partial a^{[3]}} \cdot \frac{\partial a^{[3]}}{\partial Z^{[3]}} = (a^{[3]} - y)\)$

Remaining terms:

  • \(\frac{\partial Z^{[3]}}{\partial A^{[2]}} = W^{[3]T}\) (transpose due to shape!)

  • \(\frac{\partial A^{[2]}}{\partial Z^{[2]}} = A^{[2]} \odot (1 - A^{[2]})\) (element-wise product)

  • \(\frac{\partial Z^{[2]}}{\partial W^{[2]}} = A^{[1]T}\)

Result: $\(\frac{\partial L}{\partial W^{[2]}} = (a^{[3]} - y) \cdot W^{[3]T} \odot [A^{[2]} \odot (1 - A^{[2]})] \cdot A^{[1]T}\)$

Why Backward Propagation?

“Why is it called backward propagation? It’s because we will start with the top layer, the one that’s closest to the loss function.” - Andrew Ng

Computational efficiency:

  • Start from output layer (simplest relationship to loss)

  • Reuse intermediate gradients

  • Each layer’s gradient depends on the next layer’s gradient

  • Store cached values from forward pass

“If you want to understand how much should we move \(W^{[1]}\) to make the loss move, it’s much more complicated than answering how much should \(W^{[3]}\) move to move the loss. Because there’s much more connections if you want to compete with \(W^{[1]}\).” - Andrew Ng

The Importance of Caching

“During backpropagation, there is a lot of terms that appear that were computed during forward propagation… If we don’t cache anything, we have to recompute them.” - Andrew Ng

What to cache during forward propagation:

  • All activations: \(A^{[1]}, A^{[2]}, A^{[3]}\)

  • All linear outputs: \(Z^{[1]}, Z^{[2]}, Z^{[3]}\)

  • All weights: \(W^{[1]}, W^{[2]}, W^{[3]}\) (needed for backprop)

Trade-off:

  • Benefit: Faster backward pass (no recomputation)

  • Cost: Higher memory usage

“It’s all for computational efficiency. It has some memory costs.” - Andrew Ng

class AdamOptimizer:
    """
    Adam optimizer implementation.
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.t = 0
        self.m = {}
        self.v = {}
    
    def update(self, params, grads):
        """
        Update parameters using Adam.
        """
        self.t += 1
        
        for key in params.keys():
            # Initialize momentum and velocity
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
            
            # Update biased first moment estimate
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            
            # Update biased second raw moment estimate
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)
            
            # Update parameters
            params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return params

print("Adam optimizer implemented!")

7.2 Activation Functions: Choosing Wisely

“In practice, when you do this process of training forward propagation, backward propagation updates, you don’t end up having a good network most of the time. In order to get a good network, you need to improve it. You need to use a bunch of techniques.” - Andrew Ng

Three Common Activation Functions

1. Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)

  • Range: (0, 1)

  • Derivative: \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)

  • Use case: Output layer for binary classification (probability interpretation)

  • Problem: Vanishing gradients at extreme values

“If z is very big or very low in the negatives, your gradient is very close to 0… The slope of this graph is very, very small. It’s almost flat.” - Andrew Ng

2. Tanh: \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)

  • Range: (-1, 1)

  • Derivative: \(\tanh'(z) = 1 - \tanh^2(z)\)

  • Use case: Hidden layers (zero-centered), reinforcement learning rewards

  • Problem: Also suffers from vanishing gradients

3. ReLU (Rectified Linear Unit): \(\text{ReLU}(z) = \max(0, z)\)

  • Range: [0, ∞)

  • Derivative: \(\text{ReLU}'(z) = \mathbb{1}_{z > 0} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\)

  • Use case: Hidden layers (most common), regression output

  • Advantage: No vanishing gradient problem for positive z

“ReLU on the other hand doesn’t have this problem. If z is very big in the positives, there is no saturation. The gradient just passes and the gradient is 1.” - Andrew Ng

Why Non-Linear Activations Are Essential

Thought experiment: What if we use identity activation \(a = z\) everywhere?

Starting with 3-layer network: $\(\hat{y} = a^{[3]} = Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}\)$

Substituting \(A^{[2]} = Z^{[2]}\): $\(= W^{[3]} (W^{[2]} A^{[1]} + b^{[2]}) + b^{[3]}\)$

Substituting \(A^{[1]} = Z^{[1]}\): $\(= W^{[3]} W^{[2]} (W^{[1]} X + b^{[1]}) + W^{[3]} b^{[2]} + b^{[3]}\)$

Final form: $\(\hat{y} = (W^{[3]} W^{[2]} W^{[1]}) X + (W^{[3]} W^{[2]} b^{[1]} + W^{[3]} b^{[2]} + b^{[3]})\)\( \)\(= W'X + b'\)$

“The insight here is that we need activation functions. The reason is, if you don’t choose activation functions, no matter how deep is your network, it’s going to be equivalent to a linear regression.” - Andrew Ng

This is just linear regression! All those layers collapse into one linear transformation.

“The complexity of the network comes from the activation function… If we’re trying to detect cats, we want to mimic the formula of detecting cats. We don’t know this formula, so we want to mimic it using a lot of parameters. If we just have a linear regression, we cannot mimic this because we are going to look at pixel by pixel and assign every weight to a certain pixel.” - Andrew Ng

Practical Guidelines

For hidden layers:

  • Default choice: ReLU (fast training, no vanishing gradient)

  • Alternative: Leaky ReLU, tanh

For output layer:

  • Binary classification: Sigmoid (0-1 probability)

  • Multi-class: Softmax (probability distribution)

  • Regression (positive values): ReLU

  • Regression (any values): Linear (identity)

  • Reinforcement learning: Tanh (-1 to 1 rewards)

“Usually you would use the same activation functions inside every layer… You would call this layer a ReLU layer meaning it’s a fully connected layer with ReLU activation.” - Andrew Ng

Hyperparameter tuning:

“These are all hyperparameters. In practice, you’re not going to choose these randomly, you’re going to try a bunch of them and choose some that seem to help your model train.” - Andrew Ng

class BatchNormalization:
    """
    Batch Normalization layer.
    """
    def __init__(self, num_features, epsilon=1e-5, momentum=0.9):
        self.num_features = num_features
        self.epsilon = epsilon
        self.momentum = momentum
        
        # Learnable parameters
        self.gamma = np.ones((num_features, 1))
        self.beta = np.zeros((num_features, 1))
        
        # Running statistics (for inference)
        self.running_mean = np.zeros((num_features, 1))
        self.running_var = np.ones((num_features, 1))
        
        # Cache for backprop
        self.cache = {}
    
    def forward(self, Z, training=True):
        """
        Forward pass.
        Z: (num_features, m)
        """
        if training:
            # Batch statistics
            mu = np.mean(Z, axis=1, keepdims=True)
            var = np.var(Z, axis=1, keepdims=True)
            
            # Update running statistics
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * mu
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * var
        else:
            # Use running statistics during inference
            mu = self.running_mean
            var = self.running_var
        
        # Normalize
        Z_norm = (Z - mu) / np.sqrt(var + self.epsilon)
        
        # Scale and shift
        out = self.gamma * Z_norm + self.beta
        
        # Cache for backprop
        self.cache = {'Z': Z, 'Z_norm': Z_norm, 'mu': mu, 'var': var}
        
        return out
    
    def backward(self, dout):
        """
        Backward pass.
        """
        Z = self.cache['Z']
        Z_norm = self.cache['Z_norm']
        mu = self.cache['mu']
        var = self.cache['var']
        m = Z.shape[1]
        
        # Gradients of learnable parameters
        dgamma = np.sum(dout * Z_norm, axis=1, keepdims=True)
        dbeta = np.sum(dout, axis=1, keepdims=True)
        
        # Gradient of normalized Z
        dZ_norm = dout * self.gamma
        
        # Gradient of Z (chain rule through normalization)
        std = np.sqrt(var + self.epsilon)
        dZ = (1 / m) * (1 / std) * (
            m * dZ_norm - np.sum(dZ_norm, axis=1, keepdims=True) - 
            Z_norm * np.sum(dZ_norm * Z_norm, axis=1, keepdims=True)
        )
        
        return dZ, dgamma, dbeta

print("Batch Normalization implemented!")

7.3 Input Normalization

“If z is too big, or z is too low in the negative numbers, it will lead to saturation of the network. So in order to avoid that you can use normalization of the input.” - Andrew Ng

The Problem: Unnormalized Inputs

If input features have different scales (e.g., \(x_1 \in [0, 1]\), \(x_2 \in [1000, 10000]\)):

  • Computing \(Z = WX + b\) can lead to very large or small \(Z\) values

  • Large \(|Z|\) → Saturated activations → Vanishing gradients

  • Loss function becomes elongated, slowing convergence

Two-Step Normalization Process

Step 1: Zero-center the data $\(\mu = \frac{1}{m} \sum_{i=1}^m X^{(i)}\)\( \)\(\tilde{X} = X - \mu\)$

Step 2: Standardize (unit variance) $\(\sigma^2 = \frac{1}{m} \sum_{i=1}^m (X^{(i)})^2\)\( \)\(X_{normalized} = \frac{\tilde{X}}{\sigma}\)$

Impact on Optimization

Before normalization (elongated loss):

  • Gradient descent oscillates

  • Many iterations needed

  • Zig-zag path to minimum

After normalization (circular loss):

  • Gradient points directly to minimum

  • Faster convergence

  • Fewer iterations

“Why is this one easier to train? It’s because if you have the starting point that is here, the gradient descent algorithm is going to go towards approximately the steepest slope… But the steeper slope in this loss contour is always pointing towards the middle. So if you start somewhere, it will directly go towards the minimum of your loss function.” - Andrew Ng

Critical Point: Train/Test Consistency

“This Sigma and Mu are computed over the training set. You have a training set, you compute the mean of the training set, the standard deviation of the training set, and these Sigma and Mu have to be used on the test set as well.” - Andrew Ng

Correct procedure:

  1. Compute \(\mu_{train}\) and \(\sigma_{train}\) on training data

  2. Normalize training data: \(X_{train,norm} = \frac{X_{train} - \mu_{train}}{\sigma_{train}}\)

  3. Use same statistics on test data: \(X_{test,norm} = \frac{X_{test} - \mu_{train}}{\sigma_{train}}\)

Why?

“Your network is used to seeing this type of transformation as an input. So you want the distribution of the inputs at the first neuron to be always the same, no matter if it’s a train or the test set.” - Andrew Ng

class Dropout:
    """
    Dropout regularization layer.
    """
    def __init__(self, keep_prob=0.8):
        self.keep_prob = keep_prob
        self.mask = None
    
    def forward(self, A, training=True):
        """
        Apply dropout during training.
        """
        if training:
            # Create dropout mask
            self.mask = (np.random.rand(*A.shape) < self.keep_prob).astype(float)
            # Apply mask and scale
            return A * self.mask / self.keep_prob
        else:
            # No dropout during inference
            return A
    
    def backward(self, dA):
        """
        Backprop through dropout.
        """
        return dA * self.mask / self.keep_prob

print("Dropout implemented!")

7.4 Weight Initialization: Avoiding Vanishing/Exploding Gradients

The Vanishing/Exploding Gradient Problem

Toy example: 10-layer network, 2D input, identity activations, zero biases

Forward propagation collapses to: $\(\hat{y} = W^{[L]} W^{[L-1]} \cdots W^{[2]} W^{[1]} X\)$

Case 1: Weights slightly > 1 If all \(W^{[l]} = \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}\):

\[\begin{split}\hat{y} = \begin{bmatrix} 1.5^L & 0 \\ 0 & 1.5^L \end{bmatrix} X\end{split}\]

With \(L=10\): \(1.5^{10} \approx 57.7\)Exploding gradients!

Case 2: Weights slightly < 1 If all \(W^{[l]} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}\):

\[\begin{split}\hat{y} = \begin{bmatrix} 0.5^L & 0 \\ 0 & 0.5^L \end{bmatrix} X\end{split}\]

With \(L=10\): \(0.5^{10} \approx 0.001\)Vanishing gradients!

“You see, the issue with vanishing exploding gradients is that all the errors add up like multiply each other. If you end up with numbers that are smaller than one, you will get a totally vanished gradient. If you have values that are a little bigger than 1 you will get exploding gradients.” - Andrew Ng

The Initialization Insight

Single neuron analysis: $\(Z = W_1 X_1 + W_2 X_2 + \cdots + W_n X_n\)$

With \(n\) terms, if all \(W_i\) are large → \(Z\) explodes

Solution: Make weights smaller as \(n\) increases

“The interesting thing to notice is that we have n terms here. So in order for Z to not explode, we would like all of these terms to be small. If W’s are too big, then this term will explode with the size of the inputs of the layer… The larger n, the smaller it has to be \(W_i\).” - Andrew Ng

Intuition: Initialize \(W_i \propto \frac{1}{n}\)

Common Initialization Schemes

1. Xavier/Glorot Initialization (for sigmoid/tanh):

W[l] = np.random.randn(n[l], n[l-1]) * np.sqrt(1 / n[l-1])

“This initialization has been shown to work very well for sigmoid activations.” - Andrew Ng

2. He Initialization (for ReLU):

W[l] = np.random.randn(n[l], n[l-1]) * np.sqrt(2 / n[l-1])

“If you use ReLU, it’s been observed that putting a 2 here instead of a 1 would make the network train better. And again, it’s very practical.” - Andrew Ng

3. Glorot Initialization (for tanh):

W[l] = np.random.randn(n[l], n[l-1]) * np.sqrt(1 / n[l-1])

4. Advanced Glorot (considers forward and backward):

W[l] = np.random.randn(n[l], n[l-1]) * np.sqrt(2 / (n[l-1] + n[l]))

“The quick intuition is that we’re doing the same thing but also for the backpropagated gradients. L is the number of inputs you have during backpropagation and L minus 1 is the number of inputs during forward propagation. So taking a geometric average of those.” - Andrew Ng

Why Random Initialization?

“The reason we have a random function here is because if you don’t initialize your weights randomly, you will end up with some problem called the symmetry problem where every neuron is going to learn kind of the same thing. To avoid that, you will make the neurons start at different places and let them evolve independently from each other as much as possible.” - Andrew Ng

Symmetry problem:

  • If all weights identical → all neurons learn identical features

  • Breaking symmetry allows diverse feature learning

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

class AdvancedNeuralNetwork:
    """
    Advanced neural network with modern techniques.
    """
    def __init__(self, layers_dims, learning_rate=0.001, 
                 use_batchnorm=True, dropout_prob=0.8):
        """
        layers_dims: list of layer sizes [input, hidden1, hidden2, ..., output]
        """
        self.layers_dims = layers_dims
        self.L = len(layers_dims) - 1
        self.use_batchnorm = use_batchnorm
        
        # Initialize parameters
        self.params = {}
        for l in range(1, self.L + 1):
            # He initialization
            self.params[f'W{l}'] = np.random.randn(layers_dims[l], layers_dims[l-1]) * \
                                    np.sqrt(2.0 / layers_dims[l-1])
            self.params[f'b{l}'] = np.zeros((layers_dims[l], 1))
        
        # Batch normalization layers
        self.bn_layers = {}
        if use_batchnorm:
            for l in range(1, self.L):
                self.bn_layers[f'bn{l}'] = BatchNormalization(layers_dims[l])
        
        # Dropout layers
        self.dropout_layers = {}
        for l in range(1, self.L):
            self.dropout_layers[f'dropout{l}'] = Dropout(dropout_prob)
        
        # Adam optimizer
        self.optimizer = AdamOptimizer(learning_rate)
        
        self.cache = {}
        self.costs = []
    
    def forward(self, X, training=True):
        """
        Forward propagation with batch norm and dropout.
        """
        A = X
        self.cache['A0'] = X
        
        # Hidden layers
        for l in range(1, self.L):
            Z = self.params[f'W{l}'] @ A + self.params[f'b{l}']
            self.cache[f'Z{l}'] = Z
            
            # Batch normalization
            if self.use_batchnorm:
                Z = self.bn_layers[f'bn{l}'].forward(Z, training)
            
            # Activation
            A = relu(Z)
            self.cache[f'A{l}_pre_dropout'] = A
            
            # Dropout
            A = self.dropout_layers[f'dropout{l}'].forward(A, training)
            self.cache[f'A{l}'] = A
        
        # Output layer (no batchnorm or dropout)
        Z = self.params[f'W{self.L}'] @ A + self.params[f'b{self.L}']
        self.cache[f'Z{self.L}'] = Z
        A = sigmoid(Z)
        self.cache[f'A{self.L}'] = A
        
        return A
    
    def compute_cost(self, Y):
        """
        Binary cross-entropy cost.
        """
        m = Y.shape[1]
        A = self.cache[f'A{self.L}']
        A_clipped = np.clip(A, 1e-10, 1 - 1e-10)
        cost = -1/m * np.sum(Y * np.log(A_clipped) + (1 - Y) * np.log(1 - A_clipped))
        return cost
    
    def backward(self, Y):
        """
        Backpropagation.
        """
        m = Y.shape[1]
        grads = {}
        
        # Output layer
        dA = self.cache[f'A{self.L}'] - Y
        dZ = dA
        grads[f'W{self.L}'] = 1/m * dZ @ self.cache[f'A{self.L-1}'].T
        grads[f'b{self.L}'] = 1/m * np.sum(dZ, axis=1, keepdims=True)
        dA = self.params[f'W{self.L}'].T @ dZ
        
        # Hidden layers (backward)
        for l in reversed(range(1, self.L)):
            # Dropout backward
            dA = self.dropout_layers[f'dropout{l}'].backward(dA)
            
            # ReLU backward
            dZ = dA * relu_derivative(self.cache[f'Z{l}'])
            
            # Batch norm backward
            if self.use_batchnorm:
                dZ, dgamma, dbeta = self.bn_layers[f'bn{l}'].backward(dZ)
                grads[f'gamma{l}'] = dgamma
                grads[f'beta{l}'] = dbeta
            
            # Linear backward
            grads[f'W{l}'] = 1/m * dZ @ self.cache[f'A{l-1}'].T
            grads[f'b{l}'] = 1/m * np.sum(dZ, axis=1, keepdims=True)
            
            if l > 1:
                dA = self.params[f'W{l}'].T @ dZ
        
        return grads
    
    def train(self, X, Y, num_epochs=100, print_cost=False):
        """
        Train the network.
        """
        for epoch in range(num_epochs):
            # Forward
            A = self.forward(X, training=True)
            
            # Cost
            cost = self.compute_cost(Y)
            self.costs.append(cost)
            
            # Backward
            grads = self.backward(Y)
            
            # Update with Adam
            self.params = self.optimizer.update(self.params, grads)
            
            if print_cost and epoch % 100 == 0:
                print(f"Epoch {epoch}: Cost = {cost:.4f}")
    
    def predict(self, X):
        """
        Predict (inference mode).
        """
        A = self.forward(X, training=False)
        return (A > 0.5).astype(int)

print("Advanced Neural Network implemented!")

7.5 Training on Digits Dataset

What and Why

Moving from toy 2D datasets to the handwritten digits dataset (8x8 pixel images, 10 classes) tests whether our neural network implementation scales to a realistic multi-class classification problem. This is a classic benchmark that bridges the gap between understanding the math and applying neural networks to real data.

How It Works

The 64-dimensional input (flattened 8x8 images) is fed through the network with a softmax output layer producing 10 class probabilities. The cross-entropy loss \(L = -\sum_{k} y_k \log \hat{y}_k\) penalizes confident wrong predictions, and backpropagation computes gradients through all layers. Training accuracy on this dataset typically reaches 95%+ with a properly tuned single hidden layer.

Connection to ML

The digits dataset is the precursor to MNIST, and training on it demonstrates the full ML pipeline: data loading, preprocessing (normalization), model training, and evaluation. The same architecture principles scale to modern image classification with deeper networks.

# Load and prepare data
digits = load_digits()
X_all = digits.data
y_all = digits.target

# Binary classification: 0 vs 1
mask = (y_all == 0) | (y_all == 1)
X_binary = X_all[mask]
y_binary = y_all[mask]

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Transpose for neural network
X_train_T = X_train_scaled.T
X_test_T = X_test_scaled.T
Y_train = y_train.reshape(1, -1)
Y_test = y_test.reshape(1, -1)

print(f"Training: {X_train_T.shape}, Test: {X_test_T.shape}")

# Compare: Without vs With advanced techniques
print("\n" + "="*60)
print("Training WITHOUT advanced techniques...")
print("="*60)
nn_basic = AdvancedNeuralNetwork([64, 32, 16, 1], 
                                 learning_rate=0.001,
                                 use_batchnorm=False,
                                 dropout_prob=1.0)  # No dropout
nn_basic.train(X_train_T, Y_train, num_epochs=1000, print_cost=True)

print("\n" + "="*60)
print("Training WITH advanced techniques...")
print("="*60)
nn_advanced = AdvancedNeuralNetwork([64, 32, 16, 1],
                                    learning_rate=0.001,
                                    use_batchnorm=True,
                                    dropout_prob=0.8)
nn_advanced.train(X_train_T, Y_train, num_epochs=1000, print_cost=True)

# Evaluate
train_pred_basic = nn_basic.predict(X_train_T)
test_pred_basic = nn_basic.predict(X_test_T)
train_pred_adv = nn_advanced.predict(X_train_T)
test_pred_adv = nn_advanced.predict(X_test_T)

print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60)
print("\nBasic Network:")
print(f"  Train Accuracy: {np.mean(train_pred_basic == Y_train):.2%}")
print(f"  Test Accuracy:  {np.mean(test_pred_basic == Y_test):.2%}")
print("\nAdvanced Network (BatchNorm + Dropout + Adam):")
print(f"  Train Accuracy: {np.mean(train_pred_adv == Y_train):.2%}")
print(f"  Test Accuracy:  {np.mean(test_pred_adv == Y_test):.2%}")

# Plot cost curves
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].plot(nn_basic.costs, linewidth=2, label='Basic')
axes[0].plot(nn_advanced.costs, linewidth=2, label='Advanced')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Cost', fontsize=12)
axes[0].set_title('Training Cost Comparison', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Zoom in on later epochs
axes[1].plot(nn_basic.costs[100:], linewidth=2, label='Basic')
axes[1].plot(nn_advanced.costs[100:], linewidth=2, label='Advanced')
axes[1].set_xlabel('Epoch (after 100)', fontsize=12)
axes[1].set_ylabel('Cost', fontsize=12)
axes[1].set_title('Training Cost (Zoomed)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

7.5 Optimization Algorithms

Mini-Batch Gradient Descent

“In practice, there is a trade-off between these two which is called mini-batch gradient descent.” - Andrew Ng

Three approaches:

  1. Batch Gradient Descent

    • Update after entire dataset (m examples)

    • Most accurate gradient

    • Very slow for large datasets

  2. Stochastic Gradient Descent (SGD)

    • Update after each example

    • Very fast iterations

    • Noisy gradient direction

  3. Mini-Batch Gradient Descent (most common)

    • Update after small batch (e.g., 32, 64, 128, 256, 512, 1024)

    • Balance between accuracy and speed

“Batch gradient descent is cool because you can use vectorization… Stochastic gradient descent’s advantage is that the updates are very quick. And imagine you have a dataset with one million images… Do you know how long it’s going to take to do one update? Very long.” - Andrew Ng

Mini-Batch Algorithm

Notation:

  • \(X = [X^{(1)}, X^{(2)}, \ldots, X^{(m)}]\) - full training set

  • \(X^{\{1\}} = [X^{(1)}, \ldots, X^{(1000)}]\) - first mini-batch (1000 examples)

  • \(X^{\{2\}} = [X^{(1001)}, \ldots, X^{(2000)}]\) - second mini-batch

  • Curly braces \(\{\}\) = mini-batch index

Algorithm:

for t = 1 to num_iterations:
    1. Select mini-batch X^{t}, Y^{t}
    2. Forward propagate the batch
    3. Compute cost: J^{t} = (1/batch_size) * Σ L_i
    4. Backward propagate the batch  
    5. Update: W^[l] := W^[l] - α * J^{t}/W^[l]
              b^[l] := b^[l] - α * J^{t}/b^[l]

Cost Function Behavior

Batch GD: Smooth decrease (each iteration uses all data)

Mini-Batch GD: Noisy decrease (approximates gradient)

“For batch gradient descent, your cost function would have looked like a smooth curve. On the other hand, if you use Mini-batch gradient descent, you’re most likely to see something that is also decreasing as a trend, but because the gradient is approximated and doesn’t necessarily go straight to the lower point of the loss function, you will see a kind of noisy graph.” - Andrew Ng

Smaller batch size → More noise (but faster iterations)

Convergence Paths

Batch GD: Direct path to minimum, heavy iterations

Mini-Batch/SGD: Meandering path, fast iterations

“The difference is there seem to be less iteration with the batch gradient algorithm, but the iterations are much heavier to compute. So each of the mini-batch iterations are going to be very quick, while the batch ones are going to be slow to compute. This is a trade off.” - Andrew Ng

# Learning rate schedule functions
def step_decay(initial_lr, epoch, drop_every=200, drop_factor=0.5):
    return initial_lr * (drop_factor ** (epoch // drop_every))

def exponential_decay(initial_lr, epoch, k=0.005):
    return initial_lr * np.exp(-k * epoch)

def inverse_time_decay(initial_lr, epoch, k=0.5):
    return initial_lr / (1 + k * epoch / 100)

# Visualize schedules
epochs = np.arange(1000)
initial_lr = 0.01

lrs_step = [step_decay(initial_lr, e) for e in epochs]
lrs_exp = [exponential_decay(initial_lr, e) for e in epochs]
lrs_inv = [inverse_time_decay(initial_lr, e) for e in epochs]

plt.figure(figsize=(12, 7))
plt.plot(epochs, lrs_step, linewidth=2, label='Step Decay')
plt.plot(epochs, lrs_exp, linewidth=2, label='Exponential Decay')
plt.plot(epochs, lrs_inv, linewidth=2, label='Inverse Time Decay')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Learning Rate', fontsize=12)
plt.title('Learning Rate Schedules', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nLearning rate schedules:")
print("- Step decay: Sudden drops at intervals")
print("- Exponential: Smooth exponential decrease")
print("- Inverse time: Fast initial decrease, then slow")

7.6 Momentum: Accelerating Gradient Descent

The Problem: Oscillating Gradients

Consider an elongated loss function (high condition number):

      vertical: narrow
         
    _____|_____
   /           \
  /             \   horizontal: wide
 |               |
  \_____________/

Standard gradient descent:

  • Takes orthogonal steps to current contour

  • Oscillates vertically (unwanted)

  • Slow progress horizontally (desired direction)

“Your gradient descent algorithm is going to follow the falling ball, it’s going to be orthogonal to the current contour… going there, and then there, and then there, and then there, and so on.” - Andrew Ng

What we want:

  • Dampen oscillations in vertical direction

  • Accelerate in horizontal direction

Momentum Intuition

“What you would like is to move it faster on the horizontal line and slower on the vertical side… We’re going to use a technique called momentum, which is going to look at the past gradients.” - Andrew Ng

Key idea: Average past updates to smooth out oscillations

Vertical direction:

  • Step 1: Go up

  • Step 2: Go down

  • Average: Small net movement ✓

Horizontal direction:

  • Step 1: Go right

  • Step 2: Go right

  • Average: Large net movement ✓

“If you look at the past update and you take an average of the past updates, the average on the vertical side is going to be small, because one went up, one went down. But on the horizontal axis, both went to the same direction. So the update will not change too much on the vertical axis.” - Andrew Ng

Physics analogy:

“For those of you who do physics, sometimes you can think of momentum as friction. Like if you launch a rocket and you wanna move it quickly around, it’s not gonna move because the rocket has a certain weight and has a certain momentum. You cannot change its direction very, very noisily.” - Andrew Ng

Momentum Algorithm

Standard gradient descent: $\(W := W - \alpha \frac{\partial J}{\partial W}\)$

Gradient descent with momentum: $\(V := \beta V + (1-\beta) \frac{\partial J}{\partial W}\)\( \)\(W := W - \alpha V\)$

Parameters:

  • \(V\) = velocity (running average of gradients)

  • \(\beta\) = momentum coefficient (typical: 0.9)

    • Higher \(\beta\) → More smoothing

    • Lower \(\beta\) → More responsive to current gradient

Exponentially weighted average:

  • Weights recent gradients more heavily

  • Gradually forgets old gradients

  • Creates “momentum” in consistent directions

Implementation

“In terms of implementation it’s one more line of code, in terms of memory, it’s just one additional variable, and it actually has a big impact on the optimization.” - Andrew Ng

# Standard GD
W = W - alpha * dW

# With momentum
V = beta * V + (1 - beta) * dW
W = W - alpha * V

Computational cost: Minimal (one extra variable per parameter) Performance gain: Significant (especially on ill-conditioned problems)

Advanced Optimizers

“There are much more optimization algorithms… In CS230, we teach something called RMSProp and Adam. Those are most likely the ones that are used the most in deep learning.” - Andrew Ng

Adam optimizer:

  • Combines momentum with adaptive learning rates

  • Most popular choice for deep learning (2018)

  • Requires extensive testing before adoption

“If you come up with an optimization algorithm, you still have to prove that it works very well on a wide variety of applications before researchers adopt it for their research. Adam brings momentum to the deep learning optimization algorithms.” - Andrew Ng

from sklearn.datasets import make_circles

# Generate challenging dataset
X_circles, y_circles = make_circles(n_samples=300, noise=0.15, factor=0.3, random_state=42)
X_circles_T = X_circles.T
Y_circles = y_circles.reshape(1, -1)

# Train with different optimizers
optimizers = {
    'Adam (β1=0.9, β2=0.999)': AdamOptimizer(learning_rate=0.01, beta1=0.9, beta2=0.999),
    'Adam (β1=0.8, β2=0.99)': AdamOptimizer(learning_rate=0.01, beta1=0.8, beta2=0.99),
}

results = {}

for name, optimizer in optimizers.items():
    print(f"\nTraining with {name}...")
    nn = AdvancedNeuralNetwork([2, 16, 8, 1], learning_rate=0.01,
                              use_batchnorm=False, dropout_prob=1.0)
    nn.optimizer = optimizer  # Replace optimizer
    nn.train(X_circles_T, Y_circles, num_epochs=500)
    results[name] = nn.costs

# Plot comparison
plt.figure(figsize=(12, 7))
for name, costs in results.items():
    plt.plot(costs, linewidth=2, label=name)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('Optimizer Comparison on Non-Linear Dataset', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()

print("\nFinal Costs:")
for name, costs in results.items():
    print(f"  {name}: {costs[-1]:.6f}")

Summary: Key Concepts from Lecture 12

1. Backpropagation Mathematics

The chain rule in action:

  • Start from output layer (closest to loss)

  • Reuse intermediate computations

  • Each layer’s gradient depends on next layer’s gradient

Key derivatives:

  • Sigmoid: \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)

  • Tanh: \(\tanh'(z) = 1 - \tanh^2(z)\)

  • ReLU: \(\text{ReLU}'(z) = \mathbb{1}_{z>0}\)

Efficiency principle:

“What’s interesting about it is that I’m not gonna redo the work I did, I’m just gonna store the right values while back-propagating and continue to derivate.” - Andrew Ng

2. Activation Functions

Why non-linearity is essential:

“If you don’t choose activation functions, no matter how deep is your network, it’s going to be equivalent to a linear regression. The complexity of the network comes from the activation function.” - Andrew Ng

Practical choices:

  • ReLU: Default for hidden layers (no vanishing gradient)

  • Sigmoid: Binary classification output

  • Tanh: Zero-centered alternative to sigmoid

  • Linear: Regression output

Vanishing gradient problem:

“If z is very big or very low in the negatives, your gradient is very close to 0… It will be super hard to update my parameters that are early in the network because the gradient is just going to vanish.” - Andrew Ng

3. Input Normalization

Two-step process:

  1. Zero-center: \(X := X - \mu\)

  2. Standardize: \(X := X / \sigma\)

Impact on training:

  • Transforms elongated loss → circular loss

  • Gradient points directly to minimum

  • Faster convergence

Critical rule:

“These Sigma and Mu have to be used on the test set as well. You should not compute the mean of the test set and the standard deviation of the test set… Instead, you should use the Mu and the Sigma that were computed on the train set.” - Andrew Ng

4. Weight Initialization

The problem:

  • Weights too large → Exploding gradients

  • Weights too small → Vanishing gradients

  • Identical weights → Symmetry problem (all neurons learn same features)

Solutions:

  • Xavier: \(\sqrt{1/n_{in}}\) for sigmoid/tanh

  • He: \(\sqrt{2/n_{in}}\) for ReLU

  • Glorot: \(\sqrt{2/(n_{in} + n_{out})}\) for both forward and backward

“The larger n, the smaller it has to be \(W_i\). So based on this intuition, it seems that it would be a good idea to initialize \(W_i\)’s with something that is close to 1 over n.” - Andrew Ng

5. Optimization Strategies

Mini-batch gradient descent:

  • Batch size: 32, 64, 128, 256, 512, 1024 (powers of 2)

  • Balances vectorization benefits with update frequency

  • Most practical choice

Momentum:

  • Dampens oscillations in high-curvature directions

  • Accelerates in low-curvature directions

  • Minimal implementation cost, significant benefit

“If you look at the past updates, the average on the vertical side is going to be small because one went up, one went down. But on the horizontal axis, both went to the same direction.” - Andrew Ng

6. Practical Training Workflow

Steps to improve your network:

  1. Normalize inputs (zero-center and standardize)

  2. Initialize weights properly (He/Xavier based on activation)

  3. Choose activation (ReLU for hidden, sigmoid/softmax for output)

  4. Select optimizer (SGD with momentum or Adam)

  5. Tune hyperparameters (learning rate, batch size, momentum coefficient)

  6. Cache activations during forward pass for efficient backprop

“In practice, when you do this process of training forward propagation, backward propagation updates, you don’t end up having a good network most of the time. You need to use a bunch of techniques that will make your network work in practice.” - Andrew Ng

7. Key Equations Reference

Forward propagation: $\(Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}\)\( \)\(A^{[l]} = g^{[l]}(Z^{[l]})\)$

Backpropagation: $\(\frac{\partial J}{\partial W^{[l]}} = \frac{\partial J}{\partial Z^{[l]}} \cdot \frac{\partial Z^{[l]}}{\partial W^{[l]}} = dZ^{[l]} \cdot A^{[l-1]T}\)$

Gradient descent with momentum: $\(V := \beta V + (1-\beta) \frac{\partial J}{\partial W}\)\( \)\(W := W - \alpha V\)$

8. Hyperparameters to Tune

  • Architecture: Number of layers, neurons per layer

  • Activation functions: ReLU, tanh, sigmoid

  • Learning rate \(\alpha\): 0.001, 0.01, 0.1

  • Batch size: 32, 64, 128, 256

  • Momentum \(\beta\): 0.9, 0.99

  • Initialization: He, Xavier, Glorot

  • Number of epochs: Until convergence on validation set

“These are all hyperparameters. In practice, you’re not going to choose these randomly, you’re going to try a bunch of them and choose some that seem to help your model train. There’s a lot of experimental results in deep learning and we don’t really understand fully why certain things work better than others.” - Andrew Ng

Conclusion:

“That’s all for deep learning in CS229 so far.” - Andrew Ng

Practice Exercises

  1. Batch Normalization: Implement from scratch. Verify forward and backward pass with gradient checking.

  2. Optimizer Comparison: Compare SGD, Momentum, RMSprop, and Adam on MNIST. Plot convergence curves.

  3. Dropout Study: Train networks with dropout = [0.0, 0.2, 0.5, 0.8]. How does it affect overfitting?

  4. Learning Rate Finder: Implement LR range test. Train with exponentially increasing LR, plot loss. Find optimal LR.

  5. Deep Network: Build 5-10 layer network. Does batch norm help with vanishing gradients?

  6. Transfer Learning: Pre-train on large dataset, fine-tune on small dataset. Compare with training from scratch.

  7. Learning Rate Schedules: Implement cosine annealing. Compare with step decay.

  8. Gradient Clipping: Implement gradient clipping (max norm). Does it stabilize training?

  9. Multi-class: Extend to 10-class MNIST. Implement softmax output and cross-entropy loss.

  10. Hyperparameter Tuning: Use random search to find best [learning rate, hidden size, dropout, batch norm]. Report best config.

References

  1. “Batch Normalization: Accelerating Deep Network Training”: Ioffe & Szegedy (2015)

  2. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”: Srivastava et al. (2014)

  3. “Adam: A Method for Stochastic Optimization”: Kingma & Ba (2014)

  4. “Delving Deep into Rectifiers”: He et al. (2015)

  5. “Deep Learning”: Goodfellow et al., Chapters 7-8

  6. “Understanding the Difficulty of Training Deep Feedforward Neural Networks”: Glorot & Bengio (2010)

Next: Lecture 8: Clustering