Phase 5: Neural Networks - Post-QuizΒΆ
Time: 15 minutes
Questions: 10
Passing Score: 70%
Purpose: Validate your learning after completing Phase 5
Question 1 (Medium)ΒΆ
What is the output of the sigmoid activation function when input = 0?
A) 0.0
B) 0.5 β
C) 1.0
D) Undefined
Explanation
Answer: B) 0.5
Sigmoid function: Ο(x) = 1 / (1 + e^(-x))
When x = 0:
Ο(0) = 1 / (1 + e^0)
= 1 / (1 + 1)
= 1 / 2
= 0.5
This is the midpoint of the sigmoid curve, which ranges from 0 to 1.
Reference: Phase 6 - Activation Functions
Question 2 (Hard)ΒΆ
def forward_pass(X, W1, b1, W2, b2):
Z1 = np.dot(X, W1) + b1
A1 = relu(Z1)
Z2 = np.dot(A1, W2) + b2
A2 = sigmoid(Z2)
return A2
In this 2-layer network, what does Z1 represent?
A) Activated output of first layer
B) Pre-activation output of first layer β
C) Input to the network
D) Final output
Explanation
Answer: B) Pre-activation output of first layer
Notation:
Z = Pre-activation (weighted sum + bias, before activation function)
A = Activation (after applying activation function)
So the sequence is:
Z1 = XW1 + b1β Pre-activationA1 = ReLU(Z1)β ActivationZ2 = A1W2 + b2β Pre-activationA2 = Sigmoid(Z2)β Final output
Reference: Phase 5 - Forward Propagation
Question 3 (Medium)ΒΆ
Why is the ReLU activation function preferred over sigmoid in hidden layers?
A) Itβs easier to compute
B) It mitigates the vanishing gradient problem β
C) It always outputs positive values
D) Itβs more accurate
Explanation
Answer: B) It mitigates the vanishing gradient problem
ReLU advantages:
Gradient is either 0 or 1 (doesnβt shrink like sigmoid)
Faster training (no exponential computation)
Prevents vanishing gradients in deep networks
ReLU: f(x) = max(0, x)
Gradient: fβ(x) = 1 if x > 0, else 0
Sigmoid problems:
Gradient saturates (very small) for large |x|
Causes vanishing gradients in deep networks
Reference: Phase 6 - Activation Functions
Question 4 (Hard)ΒΆ
What is the derivative of the ReLU function at x = 0?
A) 0
B) 1
C) 0.5
D) Technically undefined, but set to 0 in practice β
Explanation
Answer: D) Technically undefined, but set to 0 in practice
ReLU: f(x) = max(0, x)
Derivative:
fβ(x) = 1 if x > 0
fβ(x) = 0 if x < 0
fβ(x) = undefined at x = 0 (discontinuity)
In practice: We set fβ(0) = 0 (or sometimes 0.5), which works well in gradient descent.
Reference: Phase 6 - Activation Function Derivatives
Question 5 (Medium)ΒΆ
In gradient descent, weights are updated using:
A) W = W + learning_rate * gradient
B) W = W - learning_rate * gradient β
C) W = W * learning_rate * gradient
D) W = W / learning_rate * gradient
Explanation
Answer: B) W = W - learning_rate * gradient
Gradient Descent Update Rule:
W_new = W_old - Ξ± * βL/βW
Where:
Ξ± = learning rate
βL/βW = gradient of loss w.r.t. weight
Why subtract? Gradient points in direction of increasing loss. We want to go in the opposite direction (decreasing loss).
Reference: Phase 5 - Gradient Descent
Question 6 (Hard)ΒΆ
def backprop_step(dZ, A_prev, W):
m = A_prev.shape[0]
dW = (1/m) * np.dot(A_prev.T, dZ)
db = (1/m) * np.sum(dZ, axis=0, keepdims=True)
dA_prev = np.dot(dZ, W.T)
return dW, db, dA_prev
What does dW represent?
A) Change in weights
B) Gradient of loss with respect to weights β
C) New weight values
D) Weight updates after learning rate
Explanation
Answer: B) Gradient of loss with respect to weights
Backpropagation calculates:
dW = βL/βW(gradient of loss w.r.t. weights)db = βL/βb(gradient of loss w.r.t. biases)dA_prev = βL/βA_prev(gradient to pass to previous layer)
The actual weight update is:
W = W - learning_rate * dW
Reference: Phase 5 - Backpropagation Implementation
Question 7 (Medium)ΒΆ
What is the purpose of dividing by m (batch size) in gradient calculation?
A) To speed up computation
B) To average gradients across the batch β
C) To normalize weights
D) To prevent overflow
Explanation
Answer: B) To average gradients across the batch
When training on a batch of m examples, we calculate the average gradient across all examples:
dW = (1/m) * Ξ£(gradient_for_each_example)
Why average?
Makes gradients independent of batch size
Ensures consistent learning rate effect
Reduces gradient variance (more stable training)
Reference: Phase 5 - Mini-Batch Gradient Descent
Question 8 (Hard)ΒΆ
Which initialization strategy is best for deep networks with ReLU?
A) All zeros
B) All ones
C) Random small values from N(0, 0.01)
D) He initialization β
Explanation
Answer: D) He initialization
He Initialization:
W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)
Why it works:
Designed for ReLU activation
Maintains variance across layers
Prevents vanishing/exploding gradients
Wrong answers:
All zeros: Neurons learn same features (symmetry problem)
All ones: Same issue, worse
N(0, 0.01): Too small for deep networks (vanishing gradients)
For sigmoid/tanh: Use Xavier initialization instead
Reference: Phase 5 - Weight Initialization
Question 9 (Medium)ΒΆ
What is βone epochβ in neural network training?
A) One forward pass
B) One backward pass
C) One complete pass through the entire training dataset β
D) One weight update
Explanation
Answer: C) One complete pass through the entire training dataset
Training terminology:
Iteration: One forward + backward pass on one batch
Epoch: Complete pass through all training data
Batch: Subset of training data processed together
Example:
1000 training samples, batch size = 100
1 epoch = 10 iterations (1000/100)
Reference: Phase 5 - Training Process
Question 10 (Hard)ΒΆ
# Training loop
for epoch in range(100):
# Forward pass
A = forward_pass(X, W1, b1, W2, b2)
loss = compute_loss(A, Y)
# Backward pass
dW1, db1, dW2, db2 = backprop(X, Y, A, W1, b1, W2, b2)
# Update weights
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
This implements which optimization algorithm?
A) Stochastic Gradient Descent (SGD)
B) Batch Gradient Descent β
C) Mini-Batch Gradient Descent
D) Adam
Explanation
Answer: B) Batch Gradient Descent
Clue: forward_pass(X, ...) processes entire dataset X at once.
Gradient Descent variants:
Batch GD:
Uses entire dataset for each update
Most accurate gradients, but slow
Whatβs shown in the code
Stochastic GD (SGD):
Uses one example at a time
Fast but noisy updates
Mini-Batch GD:
Uses small batches (e.g., 32, 64, 128)
Good balance: fast + stable
Most commonly used in practice
Adam:
Adaptive learning rates
Requires momentum terms (not shown)
Reference: Phase 5 - Optimization Algorithms
Scoring GuideΒΆ
0-5 correct (0-50%): Review Phase 5 content more carefully. Focus on:
Forward and backward propagation
Activation functions and their derivatives
Weight initialization and updates
6-7 correct (60-70%): Good progress! Review the questions you missed and practice implementing a neural network from scratch.
8-9 correct (80-90%): Excellent! You have solid understanding. Practice on real datasets to reinforce concepts.
10 correct (100%): Outstanding! Youβve mastered neural network fundamentals. Ready for advanced topics (CNNs, RNNs, Transformers).
Compare Your ScoresΒΆ
Pre-Quiz Score: ___ / 10
Post-Quiz Score: ___ / 10
Improvement: +___ points
Typical improvement: 3-5 points
Strong improvement: 6+ points
Next StepsΒΆ
Score < 70%:
Re-watch Phase 5 videos
Redo the assignment
Work through challenges
Implement a neural network from scratch
Score β₯ 70%:
β Move to Phase 6 (Advanced topics)
Try the bonus challenges
Build a project using neural networks
Score = 100%:
Mentor others in the community
Contribute to the repository
Explore research papers on neural architectures
Congratulations on completing Phase 5! ππ§