Run this notebook: Open in Colab Open in Kaggle

Lecture 11: Introduction to Neural Networks¶

From Andrew Ng’s CS229 Lecture 11 (Autumn 2018)¶

“Deep learning is a set of techniques that is a subset of machine learning… specifically for problems in computer vision, natural language processing and speech recognition.” - Andrew Ng

Deep Learning Revolution¶

Three key enablers:

Computation: GPUs and parallelization
Data: Large datasets from digitalization
Algorithms: New techniques for training deep networks

Course Overview¶

Topics covered:

Logistic regression as a neural network
Neural network architecture
Forward propagation
Backpropagation algorithm
Training deep networks
Practical tricks and tips

The Cat Detection Problem¶

Binary Classification Task:

Input: 64×64 RGB image (64 × 64 × 3 = 12,288 numbers)
Output: 1 if cat present, 0 if no cat
Flatten 3D image matrix into vector
Apply logistic regression: ŷ = sigmoid(Wx + b)

“In computer science, you know that images can be represented as 3D matrices. If I tell you this is a color image of size 64 by 64, how many numbers do I have to represent those pixels? 64 × 64 × 3. Three for the RGB channel.” - Andrew Ng

From Logistic Regression to Neural Networks¶

Logistic Regression:

One neuron
Linear part: z = Wx + b
Activation: a = sigmoid(z)

Neural Network:

Stack multiple neurons in layers
Stack multiple layers
More parameters → More flexibility → Can model complex patterns

Why Neural Networks?¶

Advantages:

Learn features automatically (no manual feature engineering)
Scale with data (more data → better performance)
Universal approximators (can represent any function)

Challenges:

Computationally expensive
Need lots of data
Many hyperparameters to tune
“Black box” - hard to interpret

Setup¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

6.1 Logistic Regression as a Neural Network¶

“Neuron equals linear plus activation. We define a neuron as an operation that has two parts, one linear part, and one activation part.” - Andrew Ng

Training Process¶

Three steps:

Initialize parameters (w, b)
Optimize using gradient descent: find w, b that minimize loss
Predict using optimized parameters

Loss Function (Logistic Loss): $$\mathcal{L}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})$$

“The idea is: how can I minimize this function? I want to find w and b that minimize this function and I’m going to use a gradient descent algorithm.” - Andrew Ng

Parameter Count:

Logistic regression for 64×64×3 image: 12,289 parameters (12,288 weights + 1 bias)
Count trick: Number of edges + number of neurons (each neuron has a bias)
Problem: Parameters scale with input size (we’ll fix this with layers)

From One to Multiple Classes¶

Goal evolution:

Start: Binary classification (cat vs no cat)
Next: Multi-class (cat vs lion vs iguana)

Solution: Multiple output neurons

3 neurons, each detecting one class
Each neuron: $\hat{y}_i^{[1]} = a_i^{[1]} = \sigma(w_i^{[1]} x + b_i^{[1]})$
Notation: $[1]$ = layer index, $_i$ = neuron index within layer
Total parameters: 3 × 12,289 = 36,867 parameters

“This network is going to be robust because the three neurons aren’t communicating together. We can totally train them independently from each other… The sigmoid here doesn’t depend on the sigmoid here.” - Andrew Ng

Key insight: Multi-label vs Multi-class

Multi-label: Image can have both cat AND lion → Use sigmoid for each neuron
Multi-class: Image has ONLY one animal → Use softmax (constrains outputs to sum to 1)

“Think about health care… Usually there is no overlap between these [diseases]. This model would still work but would not be optimal… You want to start with another model where you put the constraint that there is only one disease that we want to predict and let the model learn with all the neurons learn together by creating interaction between them.” - Andrew Ng

6.2 Activation Functions¶

Activation functions introduce non-linearity into neural networks.

Common Activations:

Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$
Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
ReLU: $\text{ReLU}(z) = \max(0, z)$
Leaky ReLU: $\text{LeakyReLU}(z) = \max(0.01z, z)$

# Activation functions and derivatives
def sigmoid(z):
    """Sigmoid activation."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    """Derivative of sigmoid."""
    s = sigmoid(z)
    return s * (1 - s)

def tanh(z):
    """Hyperbolic tangent."""
    return np.tanh(z)

def tanh_derivative(z):
    """Derivative of tanh."""
    return 1 - np.tanh(z)**2

def relu(z):
    """Rectified Linear Unit."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU."""
    return (z > 0).astype(float)

def leaky_relu(z, alpha=0.01):
    """Leaky ReLU."""
    return np.where(z > 0, z, alpha * z)

def leaky_relu_derivative(z, alpha=0.01):
    """Derivative of Leaky ReLU."""
    return np.where(z > 0, 1, alpha)

# Visualize activations
z = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sigmoid
axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='σ(z)')
axes[0, 0].plot(z, sigmoid_derivative(z), 'r--', linewidth=2, label="σ'(z)")
axes[0, 0].axhline(0, color='black', linewidth=0.5)
axes[0, 0].axvline(0, color='black', linewidth=0.5)
axes[0, 0].set_title('Sigmoid', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('z', fontsize=11)
axes[0, 0].set_ylabel('Activation', fontsize=11)
axes[0, 0].legend(fontsize=10)
axes[0, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(z, tanh(z), 'b-', linewidth=2, label='tanh(z)')
axes[0, 1].plot(z, tanh_derivative(z), 'r--', linewidth=2, label="tanh'(z)")
axes[0, 1].axhline(0, color='black', linewidth=0.5)
axes[0, 1].axvline(0, color='black', linewidth=0.5)
axes[0, 1].set_title('Tanh', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('z', fontsize=11)
axes[0, 1].set_ylabel('Activation', fontsize=11)
axes[0, 1].legend(fontsize=10)
axes[0, 1].grid(True, alpha=0.3)

# ReLU
axes[1, 0].plot(z, relu(z), 'b-', linewidth=2, label='ReLU(z)')
axes[1, 0].plot(z, relu_derivative(z), 'r--', linewidth=2, label="ReLU'(z)")
axes[1, 0].axhline(0, color='black', linewidth=0.5)
axes[1, 0].axvline(0, color='black', linewidth=0.5)
axes[1, 0].set_title('ReLU', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('z', fontsize=11)
axes[1, 0].set_ylabel('Activation', fontsize=11)
axes[1, 0].legend(fontsize=10)
axes[1, 0].grid(True, alpha=0.3)

# Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), 'b-', linewidth=2, label='Leaky ReLU(z)')
axes[1, 1].plot(z, leaky_relu_derivative(z), 'r--', linewidth=2, label="Leaky ReLU'(z)")
axes[1, 1].axhline(0, color='black', linewidth=0.5)
axes[1, 1].axvline(0, color='black', linewidth=0.5)
axes[1, 1].set_title('Leaky ReLU', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('z', fontsize=11)
axes[1, 1].set_ylabel('Activation', fontsize=11)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nActivation Properties:")
print("- Sigmoid: Range (0,1), vanishing gradient for |z| > 3")
print("- Tanh: Range (-1,1), zero-centered, still vanishing gradient")
print("- ReLU: No vanishing gradient for z>0, dead neurons for z<0")
print("- Leaky ReLU: Fixes dying ReLU problem")

6.3 Building Multi-Layer Neural Networks¶

“Let’s go have fun with neural networks.” - Andrew Ng

Network Architecture Vocabulary¶

Layer types:

Input layer: First layer that receives raw features
Hidden layer: Middle layers - “hidden” because they don’t see inputs or outputs directly
Output layer: Final layer producing predictions

“The reason we call it hidden is because the inputs and the outputs are hidden from this layer. The only thing that this layer sees as input is what the previous layer gave it. So it’s an abstraction of the inputs but it’s not the inputs.” - Andrew Ng

Fully-Connected (Dense) Layers:

Every neuron in layer L connected to every neuron in layer L+1
Network discovers useful features automatically
No manual feature engineering needed

“We will let the network figure out what are the interesting features. And oftentimes, the network is going to be able better than the humans to find what are the features that are representative.” - Andrew Ng

Why Deep Networks Work: Hierarchical Learning¶

Example: Cat detection

Layer 1: Detects edges (horizontal, vertical, diagonal)
Layer 2: Combines edges → Detects parts (ears, mouth, whiskers)
Layer 3: Combines parts → Detects whole cat face

Example: House price prediction

Inputs: # bedrooms, size, zip code, neighborhood wealth
Layer 1 neurons might learn:
- School quality (from zip code + wealth)
- Walkability (from zip code)
- Family size fit (from size + bedrooms)
Layer 2: Combines learned features → Predicts price

“The deeper you go, the more complex information the neurons are able to understand.” - Andrew Ng

Three-Layer Network Example¶

Architecture:

Input: n features
Hidden layer 1: 3 neurons
Hidden layer 2: 2 neurons
Output: 1 neuron (binary classification)

Forward Propagation Equations:

\[Z^{[1]} = W^{[1]} X + b^{[1]} \quad \text{(shape: 3×1)}\]

\[A^{[1]} = \sigma(Z^{[1]}) \quad \text{(shape: 3×1)}\]

\[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \quad \text{(shape: 2×1)}\]

\[A^{[2]} = \sigma(Z^{[2]}) \quad \text{(shape: 2×1)}\]

\[Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]} \quad \text{(shape: 1×1)}\]

\[A^{[3]} = \sigma(Z^{[3]}) = \hat{y} \quad \text{(shape: 1×1)}\]

Notation:

$[l]$ = layer index (superscript in square brackets)
$Z^{[l]}$ = linear part (pre-activation)
$A^{[l]}$ = activation output
$W^{[l]}$ = weights for layer l
$b^{[l]}$ = bias for layer l

Parameter count:

Layer 1: $3n + 3$ (3 neurons × n inputs + 3 biases)
Layer 2: $2 \times 3 + 2 = 8$ (2 neurons × 3 inputs + 2 biases)
Layer 3: $2 \times 1 + 1 = 3$ (1 neuron × 2 inputs + 1 bias)
Total: $3n + 14$ parameters

“Most of the parameters are still in the input layer.” - Andrew Ng

Key insight: Parameter count = edges + neurons (one bias per neuron)

class NeuralNetwork:
    """
    Simple 2-layer neural network for binary classification.
    Architecture: Input -> Hidden (ReLU) -> Output (Sigmoid)
    """
    def __init__(self, input_size, hidden_size, learning_rate=0.01):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        # Xavier initialization
        self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(1, hidden_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((1, 1))
        
        # Cache for backprop
        self.cache = {}
        self.costs = []
    
    def forward(self, X):
        """
        Forward propagation.
        X: (n_features, m_samples)
        """
        # Layer 1
        Z1 = self.W1 @ X + self.b1  # (hidden_size, m)
        A1 = relu(Z1)
        
        # Layer 2
        Z2 = self.W2 @ A1 + self.b2  # (1, m)
        A2 = sigmoid(Z2)
        
        # Cache for backprop
        self.cache = {
            'X': X,
            'Z1': Z1,
            'A1': A1,
            'Z2': Z2,
            'A2': A2
        }
        
        return A2
    
    def compute_cost(self, Y):
        """
        Compute binary cross-entropy cost.
        Y: (1, m)
        """
        m = Y.shape[1]
        A2 = self.cache['A2']
        
        # Clip to avoid log(0)
        A2_clipped = np.clip(A2, 1e-10, 1 - 1e-10)
        
        cost = -1/m * np.sum(Y * np.log(A2_clipped) + (1 - Y) * np.log(1 - A2_clipped))
        return cost
    
    def backward(self, Y):
        """
        Backpropagation.
        """
        m = Y.shape[1]
        X = self.cache['X']
        A1 = self.cache['A1']
        A2 = self.cache['A2']
        Z1 = self.cache['Z1']
        
        # Layer 2 gradients
        dZ2 = A2 - Y  # (1, m)
        dW2 = 1/m * dZ2 @ A1.T  # (1, hidden_size)
        db2 = 1/m * np.sum(dZ2, axis=1, keepdims=True)  # (1, 1)
        
        # Layer 1 gradients
        dA1 = self.W2.T @ dZ2  # (hidden_size, m)
        dZ1 = dA1 * relu_derivative(Z1)  # (hidden_size, m)
        dW1 = 1/m * dZ1 @ X.T  # (hidden_size, n_features)
        db1 = 1/m * np.sum(dZ1, axis=1, keepdims=True)  # (hidden_size, 1)
        
        return {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    
    def update_parameters(self, gradients):
        """
        Gradient descent update.
        """
        self.W1 -= self.learning_rate * gradients['dW1']
        self.b1 -= self.learning_rate * gradients['db1']
        self.W2 -= self.learning_rate * gradients['dW2']
        self.b2 -= self.learning_rate * gradients['db2']
    
    def train(self, X, Y, num_iterations=1000, print_cost=False):
        """
        Train the neural network.
        X: (n_features, m)
        Y: (1, m)
        """
        for i in range(num_iterations):
            # Forward
            A2 = self.forward(X)
            
            # Cost
            cost = self.compute_cost(Y)
            self.costs.append(cost)
            
            # Backward
            gradients = self.backward(Y)
            
            # Update
            self.update_parameters(gradients)
            
            if print_cost and i % 100 == 0:
                print(f"Iteration {i}: Cost = {cost:.4f}")
    
    def predict(self, X):
        """
        Predict class labels.
        """
        A2 = self.forward(X)
        predictions = (A2 > 0.5).astype(int)
        return predictions

print("Neural Network class implemented!")

6.4 Vectorization and Batch Processing¶

“We want to be able to parallelize our code or our computation as much as possible by giving batches of inputs and parallelizing these equations.” - Andrew Ng

From Single Example to Mini-Batch¶

Single example: $X$ is $n \times 1$ column vector (one image)

Mini-batch: $X$ is $n \times m$ matrix where:

Each column $X^{(i)}$ is one training example
$m$ = batch size
Notation: $(i)$ in parentheses = example index

Broadcasting in NumPy¶

The problem: $$Z^{[1]} = W^{[1]} X + b^{[1]}$$

$Z^{[1]}$: $3 \times m$ (want one output per example)
$W^{[1]}$: $3 \times n$ (parameters stay constant!)
$X$: $n \times m$
$b^{[1]}$: $3 \times 1$ (parameters stay constant!)

Issue: Can’t add $3 \times m$ matrix to $3 \times 1$ vector!

Solution: Broadcasting

“Broadcasting is the fact that we don’t want to change the number of parameters, it should stay the same. But we still want this operation to be able to be written in parallel version.” - Andrew Ng

NumPy automatically replicates $b^{[1]}$ from $(3 \times 1)$ to $(3 \times m)$: $$\tilde{b}^{[1]} = [b^{[1]}, b^{[1]}, \ldots, b^{[1]}] \quad \text{(repeat } m \text{ times)}$$

“In Python there is a package called NumPy… If you sum this 3 by m matrix with a 3 by 1 parameter vector, it’s going to automatically reproduce the parameter vector m times so that the equation works.” - Andrew Ng

Batched Forward Propagation¶

Layer 1: $$Z^{[1]} = W^{[1]} X + b^{[1]} \quad \text{(3×m)}$$A^{[1]} = \sigma(Z^{[1]}) \quad \text{(3×m)}$$

Layer 2: $$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \quad \text{(2×m)}$$A^{[2]} = \sigma(Z^{[2]}) \quad \text{(2×m)}$$

Layer 3: $$Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]} \quad \text{(1×m)}$$A^{[3]} = \sigma(Z^{[3]}) = \hat{Y} \quad \text{(1×m)}$$

Key point: Capital letters denote batched versions, lowercase for single examples

# XOR dataset
X_xor = np.array([[0, 0, 1, 1],
                  [0, 1, 0, 1]])
Y_xor = np.array([[0, 1, 1, 0]])

# Train neural network
nn_xor = NeuralNetwork(input_size=2, hidden_size=4, learning_rate=0.5)
nn_xor.train(X_xor, Y_xor, num_iterations=5000, print_cost=True)

# Predictions
predictions = nn_xor.predict(X_xor)
accuracy = np.mean(predictions == Y_xor)

print(f"\nFinal Accuracy: {accuracy:.2%}")
print("\nPredictions vs True:")
for i in range(4):
    print(f"  Input: {X_xor[:, i]}, Predicted: {predictions[0, i]}, True: {int(Y_xor[0, i])}")

# Plot cost
plt.figure(figsize=(10, 6))
plt.plot(nn_xor.costs, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('XOR Problem: Training Cost', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

6.4 Decision Boundary Visualization¶

What and Why¶

Unlike logistic regression which can only produce linear decision boundaries, a neural network with hidden layers can learn nonlinear boundaries that curve and fold through feature space. Visualizing these boundaries demonstrates the expressive power gained by composing multiple layers of learned transformations.

How It Works¶

The code evaluates the trained network on a dense 2D grid and plots a contour at the 0.5 probability threshold. Each hidden neuron contributes a linear boundary that gets composed through activation functions to form the final nonlinear surface. The smoothness and complexity of the boundary depend on the number of hidden units and the activation function used.

Connection to ML¶

This visualization illustrates the universal approximation theorem in action – with enough hidden units, a single-hidden-layer network can approximate any continuous decision boundary. In practice, deeper networks achieve complex boundaries with fewer parameters.

# Generate non-linear dataset
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)

# Prepare data
X_circles_T = X_circles.T
Y_circles = y_circles.reshape(1, -1)

# Train neural network
nn_circles = NeuralNetwork(input_size=2, hidden_size=8, learning_rate=0.1)
nn_circles.train(X_circles_T, Y_circles, num_iterations=3000)

# Create mesh for decision boundary
h = 0.01
x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict on mesh
Z = nn_circles.predict(np.c_[xx.ravel(), yy.ravel()].T)
Z = Z.reshape(xx.shape)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Decision boundary
axes[0].contourf(xx, yy, Z, alpha=0.6, cmap='RdBu')
axes[0].scatter(X_circles[y_circles==0, 0], X_circles[y_circles==0, 1],
               c='blue', marker='o', s=50, edgecolors='k', alpha=0.7, label='Class 0')
axes[0].scatter(X_circles[y_circles==1, 0], X_circles[y_circles==1, 1],
               c='red', marker='s', s=50, edgecolors='k', alpha=0.7, label='Class 1')
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Neural Network Decision Boundary', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Cost curve
axes[1].plot(nn_circles.costs, linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Cost', fontsize=12)
axes[1].set_title('Training Cost', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Accuracy
predictions = nn_circles.predict(X_circles_T)
accuracy = np.mean(predictions == Y_circles)
print(f"\nAccuracy: {accuracy:.2%}")

6.5 MNIST Digit Classification¶

Let’s apply our neural network to handwritten digit recognition.

# Load digits dataset (1797 samples, 8x8 images)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# Binary classification: 0 vs 1
mask = (y_digits == 0) | (y_digits == 1)
X_binary = X_digits[mask]
y_binary = y_digits[mask]

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Prepare for neural network (transpose)
X_train_T = X_train_scaled.T
X_test_T = X_test_scaled.T
Y_train = y_train.reshape(1, -1)
Y_test = y_test.reshape(1, -1)

print(f"Training set: {X_train_T.shape}")
print(f"Test set: {X_test_T.shape}")

# Train neural network
nn_digits = NeuralNetwork(input_size=64, hidden_size=32, learning_rate=0.1)
print("\nTraining Neural Network...")
nn_digits.train(X_train_T, Y_train, num_iterations=2000, print_cost=True)

# Evaluate
train_pred = nn_digits.predict(X_train_T)
test_pred = nn_digits.predict(X_test_T)

train_acc = np.mean(train_pred == Y_train)
test_acc = np.mean(test_pred == Y_test)

print(f"\nTraining Accuracy: {train_acc:.2%}")
print(f"Test Accuracy: {test_acc:.2%}")

# Plot some predictions
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
    pred = int(test_pred[0, i])
    true = int(y_test[i])
    color = 'green' if pred == true else 'red'
    ax.set_title(f'Pred: {pred}, True: {true}', color=color, fontsize=10)
    ax.axis('off')

plt.suptitle('MNIST Digit Predictions (0 vs 1)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

6.5 Loss Functions and Optimization¶

Cost vs Loss¶

Loss function $L$: Error for single example Cost function $J$: Average error over entire batch

\[J(\hat{Y}, Y) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})\]

“Usually we would call it loss if there is only one example in the batch, and cost if there are multiple examples in a batch.” - Andrew Ng

Binary Cross-Entropy (Logistic Loss): $$L(\hat{y}, y) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})$$

Optimization: Gradient Descent¶

Goal: Find $W^{[1]}, W^{[2]}, W^{[3]}, b^{[1]}, b^{[2]}, b^{[3]}$ that minimize $J$

Update rule for all layers $l = 1, 2, 3$:

\[W^{[l]} := W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}\]

\[b^{[l]} := b^{[l]} - \alpha \frac{\partial J}{\partial b^{[l]}}\]

“Remember: model equals architecture plus parameters. We have our architecture, if we have our parameters we’re done.” - Andrew Ng

Backward Propagation: The Chain Rule¶

“Why is it called backward propagation? It’s because we will start with the top layer, the one that’s closest to the loss function.” - Andrew Ng

Key insight: Compute gradients starting from output layer, moving backwards

Why start with $W^{[3]}$, not $W^{[1]}$?

“If you want to understand how much should we move $W^{[1]}$ to make the loss move, it’s much more complicated than answering the question how much should $W^{[3]}$ move to move the loss. Because there’s much more connections if you want to compete with $W^{[1]}$.” - Andrew Ng

Chain rule decomposition:

For $W^{[3]}$: $$\frac{\partial J}{\partial W^{[3]}} = \frac{\partial J}{\partial A^{[3]}} \cdot \frac{\partial A^{[3]}}{\partial Z^{[3]}} \cdot \frac{\partial Z^{[3]}}{\partial W^{[3]}}$$

For $W^{[2]}$ (reuse previous computation!): $$\frac{\partial J}{\partial W^{[2]}} = \underbrace{\frac{\partial J}{\partial A^{[3]}} \cdot \frac{\partial A^{[3]}}{\partial Z^{[3]}}}_{\text{already computed!}} \cdot \frac{\partial Z^{[3]}}{\partial A^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial W^{[2]}}$$

For $W^{[1]}$: $$\frac{\partial J}{\partial W^{[1]}} = \underbrace{\frac{\partial J}{\partial Z^{[2]}}}_{\text{already computed!}} \cdot \frac{\partial Z^{[2]}}{\partial A^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial Z^{[1]}}{\partial W^{[1]}}$$

Efficiency:

Store intermediate gradients while backpropagating
Reuse them for earlier layers
Don’t redo work!

“What’s interesting about it is that I’m not gonna redo the work I did, I’m just gonna store the right values while back-propagating and continue to derivate.” - Andrew Ng

Important: Follow the computational graph from forward propagation

Can’t skip connections that don’t exist
$Z^{[2]}$ depends on $W^{[2]}$, but $A^{[1]}$ doesn’t depend on $W^{[2]}$
Choose the path carefully to avoid cancellations

Batch Gradient Descent vs Stochastic Gradient Descent¶

“Stochastic gradient descent updates the weights and bias after you see every example. So the direction of the gradient is quite noisy… while gradient descent or batch gradient descent updates after you’ve seen the whole batch of examples. And the gradient is much more precise.” - Andrew Ng

Batch GD:

Update after entire dataset
More accurate gradient direction
Slower per update

SGD:

Update after each example
Noisier gradient
Faster iterations

Mini-batch GD (most common):

Update after small batch (e.g., 32, 64, 128 examples)
Balance between accuracy and speed

# Test different initializations
class NeuralNetworkInit(NeuralNetwork):
    def __init__(self, input_size, hidden_size, learning_rate=0.01, init_method='xavier'):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        # Different initializations
        if init_method == 'zeros':
            self.W1 = np.zeros((hidden_size, input_size))
            self.W2 = np.zeros((1, hidden_size))
        elif init_method == 'small':
            self.W1 = np.random.randn(hidden_size, input_size) * 0.01
            self.W2 = np.random.randn(1, hidden_size) * 0.01
        elif init_method == 'large':
            self.W1 = np.random.randn(hidden_size, input_size) * 1.0
            self.W2 = np.random.randn(1, hidden_size) * 1.0
        elif init_method == 'xavier':
            self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(1.0 / input_size)
            self.W2 = np.random.randn(1, hidden_size) * np.sqrt(1.0 / hidden_size)
        elif init_method == 'he':
            self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
            self.W2 = np.random.randn(1, hidden_size) * np.sqrt(2.0 / hidden_size)
        
        self.b1 = np.zeros((hidden_size, 1))
        self.b2 = np.zeros((1, 1))
        self.cache = {}
        self.costs = []

# Test on circles dataset
init_methods = ['small', 'large', 'xavier', 'he']
results = {}

for method in init_methods:
    nn = NeuralNetworkInit(input_size=2, hidden_size=8, 
                          learning_rate=0.1, init_method=method)
    nn.train(X_circles_T, Y_circles, num_iterations=1000)
    results[method] = nn.costs

# Plot comparison
plt.figure(figsize=(12, 7))
for method, costs in results.items():
    plt.plot(costs, linewidth=2, label=method.capitalize())

plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('Weight Initialization Comparison', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFinal Costs:")
for method, costs in results.items():
    print(f"  {method.capitalize()}: {costs[-1]:.4f}")

6.7 Gradient Checking¶

Verify backpropagation implementation using numerical gradients: $$\frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$$

def gradient_check(nn, X, Y, epsilon=1e-7):
    """
    Check gradients using numerical approximation.
    """
    # Forward and backward
    nn.forward(X)
    gradients = nn.backward(Y)
    
    # Get parameters
    params = {'W1': nn.W1, 'b1': nn.b1, 'W2': nn.W2, 'b2': nn.b2}
    
    # Numerical gradients
    num_gradients = {}
    
    for param_name, param in params.items():
        num_grad = np.zeros_like(param)
        
        # Iterate through each element
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
        
        count = 0
        while not it.finished and count < 10:  # Check only first 10 for speed
            idx = it.multi_index
            old_value = param[idx]
            
            # J(theta + epsilon)
            param[idx] = old_value + epsilon
            nn.forward(X)
            cost_plus = nn.compute_cost(Y)
            
            # J(theta - epsilon)
            param[idx] = old_value - epsilon
            nn.forward(X)
            cost_minus = nn.compute_cost(Y)
            
            # Numerical gradient
            num_grad[idx] = (cost_plus - cost_minus) / (2 * epsilon)
            
            # Restore
            param[idx] = old_value
            it.iternext()
            count += 1
        
        num_gradients['d' + param_name] = num_grad
    
    # Compare
    print("\nGradient Check Results:")
    for param_name in ['W1', 'b1', 'W2', 'b2']:
        grad_key = 'd' + param_name
        
        # Get non-zero elements for comparison
        mask = num_gradients[grad_key] != 0
        if np.any(mask):
            analytical = gradients[grad_key][mask]
            numerical = num_gradients[grad_key][mask]
            
            diff = np.linalg.norm(analytical - numerical) / \
                   (np.linalg.norm(analytical) + np.linalg.norm(numerical))
            
            status = "✓ PASS" if diff < 1e-5 else "✗ FAIL"
            print(f"  {param_name}: Difference = {diff:.2e} {status}")

# Test gradient checking
nn_check = NeuralNetwork(input_size=2, hidden_size=3, learning_rate=0.01)
X_small = X_xor[:, :2]  # Use small dataset
Y_small = Y_xor[:, :2]

gradient_check(nn_check, X_small, Y_small)

6.8 Hidden Layer Visualization¶

What and Why¶

Examining hidden layer activations reveals the intermediate representations the network learns to solve the classification task. Each hidden neuron acts as a learned feature detector, and understanding what these detectors respond to is key to interpreting neural networks.

How It Works¶

The code extracts the activation values at each hidden unit for all input points and visualizes them as 2D heatmaps over the input space. Each hidden neuron partitions the input space with a (nonlinear) threshold, and the final output layer combines these partitions to form the decision boundary. Neurons that activate for similar regions are redundant, while neurons with complementary activation patterns contribute unique discriminative information.

Connection to ML¶

Hidden layer visualization is the foundation of neural network interpretability. The same principle extends to feature visualization in CNNs (what patterns activate specific filters), attention maps in transformers, and representation learning analysis in deep learning research.

# Train on digits again
nn_vis = NeuralNetwork(input_size=64, hidden_size=16, learning_rate=0.1)
nn_vis.train(X_train_T, Y_train, num_iterations=2000)

# Visualize hidden layer weights
fig, axes = plt.subplots(4, 4, figsize=(10, 10))

for i, ax in enumerate(axes.flat):
    if i < nn_vis.hidden_size:
        # Reshape weights to 8x8 image
        weights = nn_vis.W1[i].reshape(8, 8)
        ax.imshow(weights, cmap='RdBu', aspect='auto')
        ax.set_title(f'Neuron {i+1}', fontsize=9)
    ax.axis('off')

plt.suptitle('Hidden Layer Learned Features\n(Weights from Input to Hidden Layer)',
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nEach neuron learns different features of the input digits.")

Summary: Key Concepts from Lecture 11¶

1. Neural Network Fundamentals¶

Model = Architecture + Parameters

Architecture: How neurons and layers are arranged
Parameters: Weights $W$ and biases $b$ learned from data

Neuron = Linear + Activation $$a = \sigma(Wx + b)$$

2. Deep Learning Enablers¶

“Deep learning is really computationally expensive and people had to find techniques to parallelize the code and use GPUs… The second part is data available has been growing after the Internet bubble… And finally algorithms - people have come up with new techniques.” - Andrew Ng

Three key factors:

Computation: GPUs, parallelization
Data: Large datasets from digitalization
Algorithms: New training techniques

3. From Logistic Regression to Deep Networks¶

Evolution:

1 neuron → Logistic regression
Multiple output neurons → Multi-class (sigmoid) or Multi-label (softmax)
Multiple layers → Deep neural network

Multi-class vs Multi-label:

Multi-label (sigmoid): Image can have cat AND lion → outputs independent
Multi-class (softmax): Image has ONLY cat OR lion → outputs sum to 1

“Think about health care… Usually there is no overlap between diseases. You want to classify a specific disease among a large number of diseases… You want the model to learn with all neurons learning together by creating interaction between them.” - Andrew Ng

4. Hierarchical Feature Learning¶

Why deep networks work:

Layer 1: Low-level features (edges)
Layer 2: Mid-level features (parts - ears, mouth)
Layer 3: High-level features (whole objects)

“The deeper you go, the more complex information the neurons are able to understand.” - Andrew Ng

Fully-connected layers:

Connect every neuron to every neuron in next layer
Let network discover features automatically
“End-to-end learning” - just input and output, no manual feature engineering

“We will let the network figure out what are the interesting features. And oftentimes, the network is going to be able better than the humans.” - Andrew Ng

5. Training Process¶

Three steps:

Initialize parameters (random values)
Optimize via gradient descent:
- Forward propagation: $X \rightarrow \hat{Y}$
- Compute loss: $J(\hat{Y}, Y)$
- Backward propagation: Compute $\frac{\partial J}{\partial W}, \frac{\partial J}{\partial b}$
- Update: $W := W - \alpha \frac{\partial J}{\partial W}$
Predict using optimized parameters

6. Implementation Tips¶

Vectorization and broadcasting:

Process entire batch at once (parallelization)
NumPy automatically broadcasts bias vectors
Parameters don’t scale with batch size

Backpropagation efficiency:

Start from output layer (closest to loss)
Use chain rule to reuse intermediate gradients
Store values during backward pass

7. Architecture Decisions¶

How to choose network size?

“Nobody knows the right answer, so you have to test it. We would try ten different architectures, train the network, look at validation set accuracy, and decide which one seems to be the best.” - Andrew Ng

Guidelines:

More complex problem → Deeper network
More data → Can support larger network
Start simple, increase complexity as needed
Use validation set to compare architectures

8. What’s Next?¶

“One thing that I’d like you to do is just think about the things that can be tweaked in a neural network. When you build a neural network, you are not done, you have to tweak the activations, you have to tweak the loss function. There’s many things you can tweak.” - Andrew Ng

Topics for Lecture 12:

Detailed backpropagation derivations
Activation function choices (ReLU, tanh, etc.)
Initialization strategies
Regularization techniques
Practical training tips

Key Equations Reference¶

Forward Propagation (3-layer network): $$Z^{[1]} = W^{[1]} X + b^{[1]}, \quad A^{[1]} = \sigma(Z^{[1]})$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}, \quad A^{[2]} = \sigma(Z^{[2]})$$Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}, \quad \hat{Y} = A^{[3]} = \sigma(Z^{[3]})$$

Cost Function: $$J = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

Gradient Descent: $$W^{[l]} := W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}, \quad b^{[l]} := b^{[l]} - \alpha \frac{\partial J}{\partial b^{[l]}}$$

Practice Exercises¶

Activation Comparison: Implement a network with different activations (sigmoid, tanh, ReLU). Compare convergence speed and final accuracy.
Deeper Network: Extend to 3-4 layers. Implement forward and backprop. Does it improve performance on MNIST?
Momentum: Add momentum to gradient descent: $v_t = \beta v_{t-1} + (1-\beta) \nabla J$. Does training become faster/more stable?
Mini-batch: Implement mini-batch gradient descent instead of full-batch. Compare speed and convergence.
Batch Normalization: Add batch norm layer. Verify it helps with vanishing gradients in deep networks.
Vanishing Gradient: Build 10-layer network with sigmoid. Visualize gradient magnitudes per layer. Confirm vanishing gradient problem.
Dropout: Implement dropout regularization. Does it improve test accuracy?
Learning Rate Decay: Implement exponential decay: $\alpha_t = \alpha_0 e^{-kt}$. Plot cost curves.
Multi-class: Extend to 10-class MNIST (all digits). Use softmax output. Report confusion matrix.
Feature Visualization: Use t-SNE to visualize hidden layer activations. Do classes cluster?

References¶

CS229 Lecture Notes: Andrew Ng’s notes on neural networks
“Gradient-Based Learning Applied to Document Recognition”: LeCun et al. (1998)
“Understanding the Difficulty of Training Deep Feedforward Neural Networks”: Glorot & Bengio (2010)
“Delving Deep into Rectifiers”: He et al. (2015) - He initialization
“Deep Learning”: Goodfellow et al., Chapters 6-8
“Neural Networks and Deep Learning”: Michael Nielsen (online book)

Next: Lecture 7: Neural Networks - Advanced