Lecture 11: Introduction to Neural Networks¶
From Andrew Ng’s CS229 Lecture 11 (Autumn 2018)¶
“Deep learning is a set of techniques that is a subset of machine learning… specifically for problems in computer vision, natural language processing and speech recognition.” - Andrew Ng
Deep Learning Revolution¶
Three key enablers:
Computation: GPUs and parallelization
Data: Large datasets from digitalization
Algorithms: New techniques for training deep networks
Course Overview¶
Topics covered:
Logistic regression as a neural network
Neural network architecture
Forward propagation
Backpropagation algorithm
Training deep networks
Practical tricks and tips
The Cat Detection Problem¶
Binary Classification Task:
Input: 64×64 RGB image (64 × 64 × 3 = 12,288 numbers)
Output: 1 if cat present, 0 if no cat
Flatten 3D image matrix into vector
Apply logistic regression: ŷ = sigmoid(Wx + b)
“In computer science, you know that images can be represented as 3D matrices. If I tell you this is a color image of size 64 by 64, how many numbers do I have to represent those pixels? 64 × 64 × 3. Three for the RGB channel.” - Andrew Ng
From Logistic Regression to Neural Networks¶
Logistic Regression:
One neuron
Linear part: z = Wx + b
Activation: a = sigmoid(z)
Neural Network:
Stack multiple neurons in layers
Stack multiple layers
More parameters → More flexibility → Can model complex patterns
Why Neural Networks?¶
Advantages:
Learn features automatically (no manual feature engineering)
Scale with data (more data → better performance)
Universal approximators (can represent any function)
Challenges:
Computationally expensive
Need lots of data
Many hyperparameters to tune
“Black box” - hard to interpret
Setup¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded successfully!")
6.1 Logistic Regression as a Neural Network¶
“Neuron equals linear plus activation. We define a neuron as an operation that has two parts, one linear part, and one activation part.” - Andrew Ng
Training Process¶
Three steps:
Initialize parameters (w, b)
Optimize using gradient descent: find w, b that minimize loss
Predict using optimized parameters
Loss Function (Logistic Loss): $\(\mathcal{L}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})\)$
“The idea is: how can I minimize this function? I want to find w and b that minimize this function and I’m going to use a gradient descent algorithm.” - Andrew Ng
Parameter Count:
Logistic regression for 64×64×3 image: 12,289 parameters (12,288 weights + 1 bias)
Count trick: Number of edges + number of neurons (each neuron has a bias)
Problem: Parameters scale with input size (we’ll fix this with layers)
From One to Multiple Classes¶
Goal evolution:
Start: Binary classification (cat vs no cat)
Next: Multi-class (cat vs lion vs iguana)
Solution: Multiple output neurons
3 neurons, each detecting one class
Each neuron: \(\hat{y}_i^{[1]} = a_i^{[1]} = \sigma(w_i^{[1]} x + b_i^{[1]})\)
Notation: \([1]\) = layer index, \(_i\) = neuron index within layer
Total parameters: 3 × 12,289 = 36,867 parameters
“This network is going to be robust because the three neurons aren’t communicating together. We can totally train them independently from each other… The sigmoid here doesn’t depend on the sigmoid here.” - Andrew Ng
Key insight: Multi-label vs Multi-class
Multi-label: Image can have both cat AND lion → Use sigmoid for each neuron
Multi-class: Image has ONLY one animal → Use softmax (constrains outputs to sum to 1)
“Think about health care… Usually there is no overlap between these [diseases]. This model would still work but would not be optimal… You want to start with another model where you put the constraint that there is only one disease that we want to predict and let the model learn with all the neurons learn together by creating interaction between them.” - Andrew Ng
6.2 Activation Functions¶
Activation functions introduce non-linearity into neural networks.
Common Activations:
Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Tanh: \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
ReLU: \(\text{ReLU}(z) = \max(0, z)\)
Leaky ReLU: \(\text{LeakyReLU}(z) = \max(0.01z, z)\)
# Activation functions and derivatives
def sigmoid(z):
"""Sigmoid activation."""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(z):
"""Derivative of sigmoid."""
s = sigmoid(z)
return s * (1 - s)
def tanh(z):
"""Hyperbolic tangent."""
return np.tanh(z)
def tanh_derivative(z):
"""Derivative of tanh."""
return 1 - np.tanh(z)**2
def relu(z):
"""Rectified Linear Unit."""
return np.maximum(0, z)
def relu_derivative(z):
"""Derivative of ReLU."""
return (z > 0).astype(float)
def leaky_relu(z, alpha=0.01):
"""Leaky ReLU."""
return np.where(z > 0, z, alpha * z)
def leaky_relu_derivative(z, alpha=0.01):
"""Derivative of Leaky ReLU."""
return np.where(z > 0, 1, alpha)
# Visualize activations
z = np.linspace(-5, 5, 200)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Sigmoid
axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='σ(z)')
axes[0, 0].plot(z, sigmoid_derivative(z), 'r--', linewidth=2, label="σ'(z)")
axes[0, 0].axhline(0, color='black', linewidth=0.5)
axes[0, 0].axvline(0, color='black', linewidth=0.5)
axes[0, 0].set_title('Sigmoid', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('z', fontsize=11)
axes[0, 0].set_ylabel('Activation', fontsize=11)
axes[0, 0].legend(fontsize=10)
axes[0, 0].grid(True, alpha=0.3)
# Tanh
axes[0, 1].plot(z, tanh(z), 'b-', linewidth=2, label='tanh(z)')
axes[0, 1].plot(z, tanh_derivative(z), 'r--', linewidth=2, label="tanh'(z)")
axes[0, 1].axhline(0, color='black', linewidth=0.5)
axes[0, 1].axvline(0, color='black', linewidth=0.5)
axes[0, 1].set_title('Tanh', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('z', fontsize=11)
axes[0, 1].set_ylabel('Activation', fontsize=11)
axes[0, 1].legend(fontsize=10)
axes[0, 1].grid(True, alpha=0.3)
# ReLU
axes[1, 0].plot(z, relu(z), 'b-', linewidth=2, label='ReLU(z)')
axes[1, 0].plot(z, relu_derivative(z), 'r--', linewidth=2, label="ReLU'(z)")
axes[1, 0].axhline(0, color='black', linewidth=0.5)
axes[1, 0].axvline(0, color='black', linewidth=0.5)
axes[1, 0].set_title('ReLU', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('z', fontsize=11)
axes[1, 0].set_ylabel('Activation', fontsize=11)
axes[1, 0].legend(fontsize=10)
axes[1, 0].grid(True, alpha=0.3)
# Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), 'b-', linewidth=2, label='Leaky ReLU(z)')
axes[1, 1].plot(z, leaky_relu_derivative(z), 'r--', linewidth=2, label="Leaky ReLU'(z)")
axes[1, 1].axhline(0, color='black', linewidth=0.5)
axes[1, 1].axvline(0, color='black', linewidth=0.5)
axes[1, 1].set_title('Leaky ReLU', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('z', fontsize=11)
axes[1, 1].set_ylabel('Activation', fontsize=11)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nActivation Properties:")
print("- Sigmoid: Range (0,1), vanishing gradient for |z| > 3")
print("- Tanh: Range (-1,1), zero-centered, still vanishing gradient")
print("- ReLU: No vanishing gradient for z>0, dead neurons for z<0")
print("- Leaky ReLU: Fixes dying ReLU problem")
6.3 Building Multi-Layer Neural Networks¶
“Let’s go have fun with neural networks.” - Andrew Ng
Network Architecture Vocabulary¶
Layer types:
Input layer: First layer that receives raw features
Hidden layer: Middle layers - “hidden” because they don’t see inputs or outputs directly
Output layer: Final layer producing predictions
“The reason we call it hidden is because the inputs and the outputs are hidden from this layer. The only thing that this layer sees as input is what the previous layer gave it. So it’s an abstraction of the inputs but it’s not the inputs.” - Andrew Ng
Fully-Connected (Dense) Layers:
Every neuron in layer L connected to every neuron in layer L+1
Network discovers useful features automatically
No manual feature engineering needed
“We will let the network figure out what are the interesting features. And oftentimes, the network is going to be able better than the humans to find what are the features that are representative.” - Andrew Ng
Why Deep Networks Work: Hierarchical Learning¶
Example: Cat detection
Layer 1: Detects edges (horizontal, vertical, diagonal)
Layer 2: Combines edges → Detects parts (ears, mouth, whiskers)
Layer 3: Combines parts → Detects whole cat face
Example: House price prediction
Inputs: # bedrooms, size, zip code, neighborhood wealth
Layer 1 neurons might learn:
School quality (from zip code + wealth)
Walkability (from zip code)
Family size fit (from size + bedrooms)
Layer 2: Combines learned features → Predicts price
“The deeper you go, the more complex information the neurons are able to understand.” - Andrew Ng
Three-Layer Network Example¶
Architecture:
Input: n features
Hidden layer 1: 3 neurons
Hidden layer 2: 2 neurons
Output: 1 neuron (binary classification)
Forward Propagation Equations:
Notation:
\([l]\) = layer index (superscript in square brackets)
\(Z^{[l]}\) = linear part (pre-activation)
\(A^{[l]}\) = activation output
\(W^{[l]}\) = weights for layer l
\(b^{[l]}\) = bias for layer l
Parameter count:
Layer 1: \(3n + 3\) (3 neurons × n inputs + 3 biases)
Layer 2: \(2 \times 3 + 2 = 8\) (2 neurons × 3 inputs + 2 biases)
Layer 3: \(2 \times 1 + 1 = 3\) (1 neuron × 2 inputs + 1 bias)
Total: \(3n + 14\) parameters
“Most of the parameters are still in the input layer.” - Andrew Ng
Key insight: Parameter count = edges + neurons (one bias per neuron)
class NeuralNetwork:
"""
Simple 2-layer neural network for binary classification.
Architecture: Input -> Hidden (ReLU) -> Output (Sigmoid)
"""
def __init__(self, input_size, hidden_size, learning_rate=0.01):
self.input_size = input_size
self.hidden_size = hidden_size
self.learning_rate = learning_rate
# Xavier initialization
self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * np.sqrt(2.0 / hidden_size)
self.b2 = np.zeros((1, 1))
# Cache for backprop
self.cache = {}
self.costs = []
def forward(self, X):
"""
Forward propagation.
X: (n_features, m_samples)
"""
# Layer 1
Z1 = self.W1 @ X + self.b1 # (hidden_size, m)
A1 = relu(Z1)
# Layer 2
Z2 = self.W2 @ A1 + self.b2 # (1, m)
A2 = sigmoid(Z2)
# Cache for backprop
self.cache = {
'X': X,
'Z1': Z1,
'A1': A1,
'Z2': Z2,
'A2': A2
}
return A2
def compute_cost(self, Y):
"""
Compute binary cross-entropy cost.
Y: (1, m)
"""
m = Y.shape[1]
A2 = self.cache['A2']
# Clip to avoid log(0)
A2_clipped = np.clip(A2, 1e-10, 1 - 1e-10)
cost = -1/m * np.sum(Y * np.log(A2_clipped) + (1 - Y) * np.log(1 - A2_clipped))
return cost
def backward(self, Y):
"""
Backpropagation.
"""
m = Y.shape[1]
X = self.cache['X']
A1 = self.cache['A1']
A2 = self.cache['A2']
Z1 = self.cache['Z1']
# Layer 2 gradients
dZ2 = A2 - Y # (1, m)
dW2 = 1/m * dZ2 @ A1.T # (1, hidden_size)
db2 = 1/m * np.sum(dZ2, axis=1, keepdims=True) # (1, 1)
# Layer 1 gradients
dA1 = self.W2.T @ dZ2 # (hidden_size, m)
dZ1 = dA1 * relu_derivative(Z1) # (hidden_size, m)
dW1 = 1/m * dZ1 @ X.T # (hidden_size, n_features)
db1 = 1/m * np.sum(dZ1, axis=1, keepdims=True) # (hidden_size, 1)
return {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
def update_parameters(self, gradients):
"""
Gradient descent update.
"""
self.W1 -= self.learning_rate * gradients['dW1']
self.b1 -= self.learning_rate * gradients['db1']
self.W2 -= self.learning_rate * gradients['dW2']
self.b2 -= self.learning_rate * gradients['db2']
def train(self, X, Y, num_iterations=1000, print_cost=False):
"""
Train the neural network.
X: (n_features, m)
Y: (1, m)
"""
for i in range(num_iterations):
# Forward
A2 = self.forward(X)
# Cost
cost = self.compute_cost(Y)
self.costs.append(cost)
# Backward
gradients = self.backward(Y)
# Update
self.update_parameters(gradients)
if print_cost and i % 100 == 0:
print(f"Iteration {i}: Cost = {cost:.4f}")
def predict(self, X):
"""
Predict class labels.
"""
A2 = self.forward(X)
predictions = (A2 > 0.5).astype(int)
return predictions
print("Neural Network class implemented!")
6.4 Vectorization and Batch Processing¶
“We want to be able to parallelize our code or our computation as much as possible by giving batches of inputs and parallelizing these equations.” - Andrew Ng
From Single Example to Mini-Batch¶
Single example: \(X\) is \(n \times 1\) column vector (one image)
Mini-batch: \(X\) is \(n \times m\) matrix where:
Each column \(X^{(i)}\) is one training example
\(m\) = batch size
Notation: \((i)\) in parentheses = example index
Broadcasting in NumPy¶
The problem: $\(Z^{[1]} = W^{[1]} X + b^{[1]}\)$
\(Z^{[1]}\): \(3 \times m\) (want one output per example)
\(W^{[1]}\): \(3 \times n\) (parameters stay constant!)
\(X\): \(n \times m\)
\(b^{[1]}\): \(3 \times 1\) (parameters stay constant!)
Issue: Can’t add \(3 \times m\) matrix to \(3 \times 1\) vector!
Solution: Broadcasting
“Broadcasting is the fact that we don’t want to change the number of parameters, it should stay the same. But we still want this operation to be able to be written in parallel version.” - Andrew Ng
NumPy automatically replicates \(b^{[1]}\) from \((3 \times 1)\) to \((3 \times m)\): $\(\tilde{b}^{[1]} = [b^{[1]}, b^{[1]}, \ldots, b^{[1]}] \quad \text{(repeat } m \text{ times)}\)$
“In Python there is a package called NumPy… If you sum this 3 by m matrix with a 3 by 1 parameter vector, it’s going to automatically reproduce the parameter vector m times so that the equation works.” - Andrew Ng
Batched Forward Propagation¶
Layer 1: $\(Z^{[1]} = W^{[1]} X + b^{[1]} \quad \text{(3×m)}\)\( \)\(A^{[1]} = \sigma(Z^{[1]}) \quad \text{(3×m)}\)$
Layer 2: $\(Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \quad \text{(2×m)}\)\( \)\(A^{[2]} = \sigma(Z^{[2]}) \quad \text{(2×m)}\)$
Layer 3: $\(Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]} \quad \text{(1×m)}\)\( \)\(A^{[3]} = \sigma(Z^{[3]}) = \hat{Y} \quad \text{(1×m)}\)$
Key point: Capital letters denote batched versions, lowercase for single examples
# XOR dataset
X_xor = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]])
Y_xor = np.array([[0, 1, 1, 0]])
# Train neural network
nn_xor = NeuralNetwork(input_size=2, hidden_size=4, learning_rate=0.5)
nn_xor.train(X_xor, Y_xor, num_iterations=5000, print_cost=True)
# Predictions
predictions = nn_xor.predict(X_xor)
accuracy = np.mean(predictions == Y_xor)
print(f"\nFinal Accuracy: {accuracy:.2%}")
print("\nPredictions vs True:")
for i in range(4):
print(f" Input: {X_xor[:, i]}, Predicted: {predictions[0, i]}, True: {int(Y_xor[0, i])}")
# Plot cost
plt.figure(figsize=(10, 6))
plt.plot(nn_xor.costs, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('XOR Problem: Training Cost', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
6.4 Decision Boundary Visualization¶
What and Why¶
Unlike logistic regression which can only produce linear decision boundaries, a neural network with hidden layers can learn nonlinear boundaries that curve and fold through feature space. Visualizing these boundaries demonstrates the expressive power gained by composing multiple layers of learned transformations.
How It Works¶
The code evaluates the trained network on a dense 2D grid and plots a contour at the 0.5 probability threshold. Each hidden neuron contributes a linear boundary that gets composed through activation functions to form the final nonlinear surface. The smoothness and complexity of the boundary depend on the number of hidden units and the activation function used.
Connection to ML¶
This visualization illustrates the universal approximation theorem in action – with enough hidden units, a single-hidden-layer network can approximate any continuous decision boundary. In practice, deeper networks achieve complex boundaries with fewer parameters.
# Generate non-linear dataset
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)
# Prepare data
X_circles_T = X_circles.T
Y_circles = y_circles.reshape(1, -1)
# Train neural network
nn_circles = NeuralNetwork(input_size=2, hidden_size=8, learning_rate=0.1)
nn_circles.train(X_circles_T, Y_circles, num_iterations=3000)
# Create mesh for decision boundary
h = 0.01
x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict on mesh
Z = nn_circles.predict(np.c_[xx.ravel(), yy.ravel()].T)
Z = Z.reshape(xx.shape)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Decision boundary
axes[0].contourf(xx, yy, Z, alpha=0.6, cmap='RdBu')
axes[0].scatter(X_circles[y_circles==0, 0], X_circles[y_circles==0, 1],
c='blue', marker='o', s=50, edgecolors='k', alpha=0.7, label='Class 0')
axes[0].scatter(X_circles[y_circles==1, 0], X_circles[y_circles==1, 1],
c='red', marker='s', s=50, edgecolors='k', alpha=0.7, label='Class 1')
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Neural Network Decision Boundary', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Cost curve
axes[1].plot(nn_circles.costs, linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Cost', fontsize=12)
axes[1].set_title('Training Cost', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Accuracy
predictions = nn_circles.predict(X_circles_T)
accuracy = np.mean(predictions == Y_circles)
print(f"\nAccuracy: {accuracy:.2%}")
6.5 MNIST Digit Classification¶
Let’s apply our neural network to handwritten digit recognition.
# Load digits dataset (1797 samples, 8x8 images)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
# Binary classification: 0 vs 1
mask = (y_digits == 0) | (y_digits == 1)
X_binary = X_digits[mask]
y_binary = y_digits[mask]
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Prepare for neural network (transpose)
X_train_T = X_train_scaled.T
X_test_T = X_test_scaled.T
Y_train = y_train.reshape(1, -1)
Y_test = y_test.reshape(1, -1)
print(f"Training set: {X_train_T.shape}")
print(f"Test set: {X_test_T.shape}")
# Train neural network
nn_digits = NeuralNetwork(input_size=64, hidden_size=32, learning_rate=0.1)
print("\nTraining Neural Network...")
nn_digits.train(X_train_T, Y_train, num_iterations=2000, print_cost=True)
# Evaluate
train_pred = nn_digits.predict(X_train_T)
test_pred = nn_digits.predict(X_test_T)
train_acc = np.mean(train_pred == Y_train)
test_acc = np.mean(test_pred == Y_test)
print(f"\nTraining Accuracy: {train_acc:.2%}")
print(f"Test Accuracy: {test_acc:.2%}")
# Plot some predictions
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
pred = int(test_pred[0, i])
true = int(y_test[i])
color = 'green' if pred == true else 'red'
ax.set_title(f'Pred: {pred}, True: {true}', color=color, fontsize=10)
ax.axis('off')
plt.suptitle('MNIST Digit Predictions (0 vs 1)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
6.5 Loss Functions and Optimization¶
Cost vs Loss¶
Loss function \(L\): Error for single example Cost function \(J\): Average error over entire batch
“Usually we would call it loss if there is only one example in the batch, and cost if there are multiple examples in a batch.” - Andrew Ng
Binary Cross-Entropy (Logistic Loss): $\(L(\hat{y}, y) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})\)$
Optimization: Gradient Descent¶
Goal: Find \(W^{[1]}, W^{[2]}, W^{[3]}, b^{[1]}, b^{[2]}, b^{[3]}\) that minimize \(J\)
Update rule for all layers \(l = 1, 2, 3\):
“Remember: model equals architecture plus parameters. We have our architecture, if we have our parameters we’re done.” - Andrew Ng
Backward Propagation: The Chain Rule¶
“Why is it called backward propagation? It’s because we will start with the top layer, the one that’s closest to the loss function.” - Andrew Ng
Key insight: Compute gradients starting from output layer, moving backwards
Why start with \(W^{[3]}\), not \(W^{[1]}\)?
“If you want to understand how much should we move \(W^{[1]}\) to make the loss move, it’s much more complicated than answering the question how much should \(W^{[3]}\) move to move the loss. Because there’s much more connections if you want to compete with \(W^{[1]}\).” - Andrew Ng
Chain rule decomposition:
For \(W^{[3]}\): $\(\frac{\partial J}{\partial W^{[3]}} = \frac{\partial J}{\partial A^{[3]}} \cdot \frac{\partial A^{[3]}}{\partial Z^{[3]}} \cdot \frac{\partial Z^{[3]}}{\partial W^{[3]}}\)$
For \(W^{[2]}\) (reuse previous computation!): $\(\frac{\partial J}{\partial W^{[2]}} = \underbrace{\frac{\partial J}{\partial A^{[3]}} \cdot \frac{\partial A^{[3]}}{\partial Z^{[3]}}}_{\text{already computed!}} \cdot \frac{\partial Z^{[3]}}{\partial A^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial W^{[2]}}\)$
For \(W^{[1]}\): $\(\frac{\partial J}{\partial W^{[1]}} = \underbrace{\frac{\partial J}{\partial Z^{[2]}}}_{\text{already computed!}} \cdot \frac{\partial Z^{[2]}}{\partial A^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial Z^{[1]}}{\partial W^{[1]}}\)$
Efficiency:
Store intermediate gradients while backpropagating
Reuse them for earlier layers
Don’t redo work!
“What’s interesting about it is that I’m not gonna redo the work I did, I’m just gonna store the right values while back-propagating and continue to derivate.” - Andrew Ng
Important: Follow the computational graph from forward propagation
Can’t skip connections that don’t exist
\(Z^{[2]}\) depends on \(W^{[2]}\), but \(A^{[1]}\) doesn’t depend on \(W^{[2]}\)
Choose the path carefully to avoid cancellations
Batch Gradient Descent vs Stochastic Gradient Descent¶
“Stochastic gradient descent updates the weights and bias after you see every example. So the direction of the gradient is quite noisy… while gradient descent or batch gradient descent updates after you’ve seen the whole batch of examples. And the gradient is much more precise.” - Andrew Ng
Batch GD:
Update after entire dataset
More accurate gradient direction
Slower per update
SGD:
Update after each example
Noisier gradient
Faster iterations
Mini-batch GD (most common):
Update after small batch (e.g., 32, 64, 128 examples)
Balance between accuracy and speed
# Test different initializations
class NeuralNetworkInit(NeuralNetwork):
def __init__(self, input_size, hidden_size, learning_rate=0.01, init_method='xavier'):
self.input_size = input_size
self.hidden_size = hidden_size
self.learning_rate = learning_rate
# Different initializations
if init_method == 'zeros':
self.W1 = np.zeros((hidden_size, input_size))
self.W2 = np.zeros((1, hidden_size))
elif init_method == 'small':
self.W1 = np.random.randn(hidden_size, input_size) * 0.01
self.W2 = np.random.randn(1, hidden_size) * 0.01
elif init_method == 'large':
self.W1 = np.random.randn(hidden_size, input_size) * 1.0
self.W2 = np.random.randn(1, hidden_size) * 1.0
elif init_method == 'xavier':
self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(1.0 / input_size)
self.W2 = np.random.randn(1, hidden_size) * np.sqrt(1.0 / hidden_size)
elif init_method == 'he':
self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
self.W2 = np.random.randn(1, hidden_size) * np.sqrt(2.0 / hidden_size)
self.b1 = np.zeros((hidden_size, 1))
self.b2 = np.zeros((1, 1))
self.cache = {}
self.costs = []
# Test on circles dataset
init_methods = ['small', 'large', 'xavier', 'he']
results = {}
for method in init_methods:
nn = NeuralNetworkInit(input_size=2, hidden_size=8,
learning_rate=0.1, init_method=method)
nn.train(X_circles_T, Y_circles, num_iterations=1000)
results[method] = nn.costs
# Plot comparison
plt.figure(figsize=(12, 7))
for method, costs in results.items():
plt.plot(costs, linewidth=2, label=method.capitalize())
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('Weight Initialization Comparison', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nFinal Costs:")
for method, costs in results.items():
print(f" {method.capitalize()}: {costs[-1]:.4f}")
6.7 Gradient Checking¶
Verify backpropagation implementation using numerical gradients: $\(\frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}\)$
def gradient_check(nn, X, Y, epsilon=1e-7):
"""
Check gradients using numerical approximation.
"""
# Forward and backward
nn.forward(X)
gradients = nn.backward(Y)
# Get parameters
params = {'W1': nn.W1, 'b1': nn.b1, 'W2': nn.W2, 'b2': nn.b2}
# Numerical gradients
num_gradients = {}
for param_name, param in params.items():
num_grad = np.zeros_like(param)
# Iterate through each element
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
count = 0
while not it.finished and count < 10: # Check only first 10 for speed
idx = it.multi_index
old_value = param[idx]
# J(theta + epsilon)
param[idx] = old_value + epsilon
nn.forward(X)
cost_plus = nn.compute_cost(Y)
# J(theta - epsilon)
param[idx] = old_value - epsilon
nn.forward(X)
cost_minus = nn.compute_cost(Y)
# Numerical gradient
num_grad[idx] = (cost_plus - cost_minus) / (2 * epsilon)
# Restore
param[idx] = old_value
it.iternext()
count += 1
num_gradients['d' + param_name] = num_grad
# Compare
print("\nGradient Check Results:")
for param_name in ['W1', 'b1', 'W2', 'b2']:
grad_key = 'd' + param_name
# Get non-zero elements for comparison
mask = num_gradients[grad_key] != 0
if np.any(mask):
analytical = gradients[grad_key][mask]
numerical = num_gradients[grad_key][mask]
diff = np.linalg.norm(analytical - numerical) / \
(np.linalg.norm(analytical) + np.linalg.norm(numerical))
status = "✓ PASS" if diff < 1e-5 else "✗ FAIL"
print(f" {param_name}: Difference = {diff:.2e} {status}")
# Test gradient checking
nn_check = NeuralNetwork(input_size=2, hidden_size=3, learning_rate=0.01)
X_small = X_xor[:, :2] # Use small dataset
Y_small = Y_xor[:, :2]
gradient_check(nn_check, X_small, Y_small)
Summary: Key Concepts from Lecture 11¶
1. Neural Network Fundamentals¶
Model = Architecture + Parameters
Architecture: How neurons and layers are arranged
Parameters: Weights \(W\) and biases \(b\) learned from data
Neuron = Linear + Activation $\(a = \sigma(Wx + b)\)$
2. Deep Learning Enablers¶
“Deep learning is really computationally expensive and people had to find techniques to parallelize the code and use GPUs… The second part is data available has been growing after the Internet bubble… And finally algorithms - people have come up with new techniques.” - Andrew Ng
Three key factors:
Computation: GPUs, parallelization
Data: Large datasets from digitalization
Algorithms: New training techniques
3. From Logistic Regression to Deep Networks¶
Evolution:
1 neuron → Logistic regression
Multiple output neurons → Multi-class (sigmoid) or Multi-label (softmax)
Multiple layers → Deep neural network
Multi-class vs Multi-label:
Multi-label (sigmoid): Image can have cat AND lion → outputs independent
Multi-class (softmax): Image has ONLY cat OR lion → outputs sum to 1
“Think about health care… Usually there is no overlap between diseases. You want to classify a specific disease among a large number of diseases… You want the model to learn with all neurons learning together by creating interaction between them.” - Andrew Ng
4. Hierarchical Feature Learning¶
Why deep networks work:
Layer 1: Low-level features (edges)
Layer 2: Mid-level features (parts - ears, mouth)
Layer 3: High-level features (whole objects)
“The deeper you go, the more complex information the neurons are able to understand.” - Andrew Ng
Fully-connected layers:
Connect every neuron to every neuron in next layer
Let network discover features automatically
“End-to-end learning” - just input and output, no manual feature engineering
“We will let the network figure out what are the interesting features. And oftentimes, the network is going to be able better than the humans.” - Andrew Ng
5. Training Process¶
Three steps:
Initialize parameters (random values)
Optimize via gradient descent:
Forward propagation: \(X \rightarrow \hat{Y}\)
Compute loss: \(J(\hat{Y}, Y)\)
Backward propagation: Compute \(\frac{\partial J}{\partial W}, \frac{\partial J}{\partial b}\)
Update: \(W := W - \alpha \frac{\partial J}{\partial W}\)
Predict using optimized parameters
6. Implementation Tips¶
Vectorization and broadcasting:
Process entire batch at once (parallelization)
NumPy automatically broadcasts bias vectors
Parameters don’t scale with batch size
Backpropagation efficiency:
Start from output layer (closest to loss)
Use chain rule to reuse intermediate gradients
Store values during backward pass
7. Architecture Decisions¶
How to choose network size?
“Nobody knows the right answer, so you have to test it. We would try ten different architectures, train the network, look at validation set accuracy, and decide which one seems to be the best.” - Andrew Ng
Guidelines:
More complex problem → Deeper network
More data → Can support larger network
Start simple, increase complexity as needed
Use validation set to compare architectures
8. What’s Next?¶
“One thing that I’d like you to do is just think about the things that can be tweaked in a neural network. When you build a neural network, you are not done, you have to tweak the activations, you have to tweak the loss function. There’s many things you can tweak.” - Andrew Ng
Topics for Lecture 12:
Detailed backpropagation derivations
Activation function choices (ReLU, tanh, etc.)
Initialization strategies
Regularization techniques
Practical training tips
Key Equations Reference¶
Forward Propagation (3-layer network): $\(Z^{[1]} = W^{[1]} X + b^{[1]}, \quad A^{[1]} = \sigma(Z^{[1]})\)\( \)\(Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}, \quad A^{[2]} = \sigma(Z^{[2]})\)\( \)\(Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}, \quad \hat{Y} = A^{[3]} = \sigma(Z^{[3]})\)$
Cost Function: $\(J = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})\)$
Gradient Descent: $\(W^{[l]} := W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}, \quad b^{[l]} := b^{[l]} - \alpha \frac{\partial J}{\partial b^{[l]}}\)$
Practice Exercises¶
Activation Comparison: Implement a network with different activations (sigmoid, tanh, ReLU). Compare convergence speed and final accuracy.
Deeper Network: Extend to 3-4 layers. Implement forward and backprop. Does it improve performance on MNIST?
Momentum: Add momentum to gradient descent: \(v_t = \beta v_{t-1} + (1-\beta) \nabla J\). Does training become faster/more stable?
Mini-batch: Implement mini-batch gradient descent instead of full-batch. Compare speed and convergence.
Batch Normalization: Add batch norm layer. Verify it helps with vanishing gradients in deep networks.
Vanishing Gradient: Build 10-layer network with sigmoid. Visualize gradient magnitudes per layer. Confirm vanishing gradient problem.
Dropout: Implement dropout regularization. Does it improve test accuracy?
Learning Rate Decay: Implement exponential decay: \(\alpha_t = \alpha_0 e^{-kt}\). Plot cost curves.
Multi-class: Extend to 10-class MNIST (all digits). Use softmax output. Report confusion matrix.
Feature Visualization: Use t-SNE to visualize hidden layer activations. Do classes cluster?
References¶
CS229 Lecture Notes: Andrew Ng’s notes on neural networks
“Gradient-Based Learning Applied to Document Recognition”: LeCun et al. (1998)
“Understanding the Difficulty of Training Deep Feedforward Neural Networks”: Glorot & Bengio (2010)
“Delving Deep into Rectifiers”: He et al. (2015) - He initialization
“Deep Learning”: Goodfellow et al., Chapters 6-8
“Neural Networks and Deep Learning”: Michael Nielsen (online book)