Neural Networks: From Basics to TransformersΒΆ
Table of ContentsΒΆ
What is a Neural Network?ΒΆ
A neural network is a computational model inspired by the human brain that learns to perform tasks by considering examples, without being programmed with task-specific rules.
The Core IdeaΒΆ
Instead of writing explicit rules:
# Traditional programming
if word in positive_words:
sentiment = "positive"
else:
sentiment = "negative"
Neural networks learn patterns from data:
# Machine learning approach
model = train_on_examples(training_data)
sentiment = model.predict(new_text)
Why Neural Networks?ΒΆ
Problems they solve:
Recognize patterns in complex data
Handle non-linear relationships
Adapt to new patterns automatically
Scale to large datasets
Transfer knowledge between tasks
Real-world applications:
Image recognition (faces, objects, medical scans)
Natural language (translation, summarization, chatbots)
Speech recognition and synthesis
Recommendation systems
Game playing (Chess, Go, video games)
Drug discovery and protein folding
The Biological InspirationΒΆ
Human NeuronsΒΆ
In your brain:
Dendrites receive signals from other neurons
Cell body processes these signals
Axon sends output to other neurons if threshold is reached
Synapses connect neurons with varying strengths
Artificial Neurons (Perceptrons)ΒΆ
A mathematical approximation:
Inputs (xβ, xβ, β¦, xβ) come from previous layer
Weights (wβ, wβ, β¦, wβ) represent connection strength
Bias (b) represents neuronβs threshold
Activation function determines output
inputs
β
[xβ]ββwββββ
[xβ]ββwββββ€
[xβ]ββwβββββββ Ξ£(wα΅’xα΅’ + b) βββ activation βββ output
... β
[xβ]ββwββββ
Mathematical FormulaΒΆ
output = f(Ξ£(wα΅’ Γ xα΅’) + b)
Where:
- xα΅’ = input values
- wα΅’ = weights (learnable parameters)
- b = bias (learnable parameter)
- f = activation function
- Ξ£ = sum
Components of Neural NetworksΒΆ
1. LayersΒΆ
Input Layer:
Receives raw data
One neuron per feature
Example: For 28Γ28 image = 784 input neurons
Hidden Layers:
Process and transform data
Multiple layers = βdeepβ learning
Each layer learns increasingly abstract features
Output Layer:
Produces final prediction
Size depends on task:
Binary classification: 1 neuron
Multi-class (10 classes): 10 neurons
Regression: 1 neuron
Input β [Hidden Layer 1] β [Hidden Layer 2] β Output
784 [128 neurons] [64 neurons] 10
2. Weights and BiasesΒΆ
Weights: The βknowledgeβ of the network
Initially random
Updated during training
Determine strength of connections
Biases: Offset values
One per neuron
Allow neurons to activate even with zero input
Help model fit data better
Total parameters:
Layer 1: (784 Γ 128) + 128 = 100,480 parameters
Layer 2: (128 Γ 64) + 64 = 8,256 parameters
Output: (64 Γ 10) + 10 = 650 parameters
Total: 109,386 learnable parameters
3. ArchitectureΒΆ
Fully Connected (Dense) Layers:
Every neuron connects to all neurons in next layer
Most common in basic networks
Specialized Layers:
Convolutional (Conv): For images, spatial patterns
Recurrent (RNN, LSTM, GRU): For sequences, temporal patterns
Attention: For focusing on relevant information
Dropout: For regularization (randomly disable neurons)
Batch Normalization: For training stability
Forward PropagationΒΆ
Forward propagation is how data flows through the network to produce predictions.
Step-by-Step ProcessΒΆ
1. Input Layer β Hidden Layer 1
# For each neuron in hidden layer 1
z1 = W1 @ x + b1 # Linear transformation (matrix multiplication)
a1 = activation(z1) # Apply activation function
2. Hidden Layer 1 β Hidden Layer 2
z2 = W2 @ a1 + b2
a2 = activation(z2)
3. Hidden Layer 2 β Output
z3 = W3 @ a2 + b3
output = softmax(z3) # For classification
Example: 2-Layer NetworkΒΆ
import numpy as np
# Input: 4 features
x = np.array([1.0, 0.5, 0.2, 0.9])
# Layer 1: 4 β 3 neurons
W1 = np.random.randn(3, 4) # Shape: (3, 4)
b1 = np.random.randn(3)
z1 = W1 @ x + b1
a1 = np.maximum(0, z1) # ReLU activation
# Layer 2: 3 β 2 neurons (output)
W2 = np.random.randn(2, 3)
b2 = np.random.randn(2)
z2 = W2 @ a1 + b2
# a2 = softmax(z2) for classification
print(f"Input shape: {x.shape}")
print(f"Hidden activation shape: {a1.shape}")
print(f"Output shape: {z2.shape}")
Activation FunctionsΒΆ
Activation functions introduce non-linearity, allowing networks to learn complex patterns.
Why Non-linearity?ΒΆ
Without activation functions, multiple layers collapse into one:
f(W2 @ (W1 @ x + b1) + b2) = f((W2 @ W1) @ x + (W2 @ b1 + b2))
= f(W_combined @ x + b_combined)
# This is just a single linear layer!
Common Activation FunctionsΒΆ
1. ReLU (Rectified Linear Unit) βΒΆ
Formula: f(x) = max(0, x)
f(x) = { x if x > 0
{ 0 if x β€ 0
Graph:
β
β β±
β β±
β β±
βββΌβββββ
β
Pros:
β Computationally efficient
β Helps with vanishing gradient problem
β Sparse activation (many zeros)
Cons:
β βDying ReLUβ - neurons can get stuck at 0
Usage: Hidden layers in most modern networks
2. SigmoidΒΆ
Formula: f(x) = 1 / (1 + e^(-x))
Output range: (0, 1)
Graph:
1.0 β β±ββ
β β±
0.5 β β±
β β±
0.0 βββ±
ββββββββββ
Pros:
β Output interpretable as probability
β Smooth gradient
Cons:
β Vanishing gradients for large |x|
β Outputs not zero-centered
Usage: Binary classification output, gates in LSTM
3. Tanh (Hyperbolic Tangent)ΒΆ
Formula: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Output range: (-1, 1)
Graph:
1.0 β β±ββ
β β±
0.0 β β±
β β±
-1.0 βββ±
ββββββββββ
Pros:
β Zero-centered (better than sigmoid)
β Stronger gradients than sigmoid
Cons:
β Still suffers from vanishing gradients
Usage: RNN/LSTM hidden states
4. SoftmaxΒΆ
Formula: f(xα΅’) = e^(xα΅’) / Ξ£β±Ό e^(xβ±Ό)
Converts vector to probability distribution:
Input: [2.0, 1.0, 0.5]
Output: [0.659, 0.242, 0.099] # Sums to 1.0
Usage: Multi-class classification output layer
5. GELU (Gaussian Error Linear Unit)ΒΆ
Formula: f(x) = x * Ξ¦(x) where Ξ¦ is Gaussian CDF
Usage: Modern transformers (GPT, BERT)
Why better than ReLU:
Smooth, differentiable everywhere
Better gradient flow
Used in state-of-the-art models
Loss FunctionsΒΆ
Loss functions measure how wrong the modelβs predictions are.
Regression TasksΒΆ
Mean Squared Error (MSE)ΒΆ
L = (1/n) Ξ£ (Ε·α΅’ - yα΅’)Β²
Where:
- Ε·α΅’ = predicted value
- yα΅’ = actual value
- n = number of samples
Usage: Continuous value prediction (house prices, temperatures)
Mean Absolute Error (MAE)ΒΆ
L = (1/n) Ξ£ |Ε·α΅’ - yα΅’|
Benefit: Less sensitive to outliers than MSE
Classification TasksΒΆ
Binary Cross-EntropyΒΆ
L = -(1/n) Ξ£ [yα΅’ log(Ε·α΅’) + (1-yα΅’) log(1-Ε·α΅’)]
Usage: Binary classification (spam/not spam)
Categorical Cross-EntropyΒΆ
L = -(1/n) Ξ£α΅’ Ξ£β±Ό yα΅’β±Ό log(Ε·α΅’β±Ό)
Where:
- i = sample index
- j = class index
- y = one-hot encoded true labels
Usage: Multi-class classification (digit recognition, sentiment analysis)
Example:
# True label: class 2 (one-hot: [0, 0, 1, 0, 0])
y_true = [0, 0, 1, 0, 0]
# Predictions (probabilities)
y_pred = [0.1, 0.2, 0.5, 0.1, 0.1]
# Loss focuses on predicted probability for true class
loss = -log(0.5) = 0.693
Backward PropagationΒΆ
Backpropagation is how the network learns by computing gradients and updating weights.
The Core IdeaΒΆ
Goal: Minimize loss function by adjusting weights
Method: Use calculus chain rule to compute how much each weight contributes to the error
Chain RuleΒΆ
βL/βw = βL/βa Γ βa/βz Γ βz/βw
Where:
- L = loss
- w = weight
- z = pre-activation (w @ x + b)
- a = post-activation f(z)
Backward PassΒΆ
Step 1: Compute output gradient
# For classification with softmax + cross-entropy
d_output = predictions - true_labels
Step 2: Propagate through layer 2
d_W2 = d_output @ a1.T
d_b2 = d_output
d_a1 = W2.T @ d_output
Step 3: Apply activation gradient
# For ReLU: gradient is 1 if input > 0, else 0
d_z1 = d_a1 * (z1 > 0)
Step 4: Propagate through layer 1
d_W1 = d_z1 @ x.T
d_b1 = d_z1
Update WeightsΒΆ
# Simple gradient descent
learning_rate = 0.01
W1 -= learning_rate * d_W1
b1 -= learning_rate * d_b1
W2 -= learning_rate * d_W2
b2 -= learning_rate * d_b2
OptimizationΒΆ
Optimization algorithms update network weights to minimize loss.
Gradient Descent VariantsΒΆ
1. Stochastic Gradient Descent (SGD)ΒΆ
# Update after each sample
for x, y in dataset:
loss = compute_loss(model(x), y)
gradients = compute_gradients(loss)
weights -= learning_rate * gradients
Pros: Fast updates, can escape local minima Cons: Noisy updates, slow convergence
2. Mini-Batch Gradient DescentΒΆ
# Update after batch of samples
for batch_x, batch_y in dataloader:
loss = compute_loss(model(batch_x), batch_y)
gradients = compute_gradients(loss)
weights -= learning_rate * gradients
Common batch sizes: 32, 64, 128, 256
Pros: Balance between speed and stability
3. SGD with MomentumΒΆ
velocity = 0
for batch in dataset:
gradients = compute_gradients(batch)
velocity = momentum * velocity - learning_rate * gradients
weights += velocity
Benefit: Accelerates convergence, dampens oscillations
Modern OptimizersΒΆ
Adam (Adaptive Moment Estimation) βΒΆ
Most popular optimizer for deep learning:
# Combines momentum and adaptive learning rates
m = 0 # First moment (mean)
v = 0 # Second moment (variance)
for batch in dataset:
gradients = compute_gradients(batch)
m = beta1 * m + (1 - beta1) * gradients
v = beta2 * v + (1 - beta2) * gradients**2
m_hat = m / (1 - beta1**t) # Bias correction
v_hat = v / (1 - beta2**t)
weights -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)
Default hyperparameters:
learning_rate = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
Benefits:
β Adaptive learning rates per parameter
β Works well with minimal tuning
β Efficient for large datasets
OthersΒΆ
AdamW: Adam with better weight decay
RMSprop: Good for RNNs
AdaGrad: Adapts learning rate based on parameter frequency
Training ProcessΒΆ
Complete Training LoopΒΆ
import torch
import torch.nn as nn
import torch.optim as optim
# 1. Define model
model = NeuralNetwork()
# 2. Define loss function
criterion = nn.CrossEntropyLoss()
# 3. Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 4. Training loop
num_epochs = 10
for epoch in range(num_epochs):
for batch_x, batch_y in train_loader:
# Forward pass
predictions = model(batch_x)
loss = criterion(predictions, batch_y)
# Backward pass
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
# Validation
val_loss = evaluate(model, val_loader)
print(f"Epoch {epoch}: Train Loss={loss:.4f}, Val Loss={val_loss:.4f}")
Training Best PracticesΒΆ
1. Train/Validation/Test SplitΒΆ
Dataset β 70% Training (optimize weights)
β 15% Validation (tune hyperparameters)
β 15% Test (final evaluation)
2. NormalizationΒΆ
# Normalize inputs to zero mean, unit variance
X = (X - X.mean()) / X.std()
Why: Helps optimization converge faster
3. Weight InitializationΒΆ
# Xavier/Glorot initialization for tanh
W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)
# He initialization for ReLU
W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)
4. Learning Rate SchedulingΒΆ
# Reduce learning rate when validation loss plateaus
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)
scheduler.step(val_loss)
5. Early StoppingΒΆ
# Stop training if validation loss doesn't improve
best_val_loss = float('inf')
patience_counter = 0
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping!")
break
6. RegularizationΒΆ
Dropout:
# Randomly disable neurons during training
layer = nn.Linear(128, 64)
dropout = nn.Dropout(p=0.5) # Disable 50% of neurons
Weight Decay (L2 regularization):
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
Common ArchitecturesΒΆ
1. Feedforward Neural Network (FNN)ΒΆ
Input β Dense β ReLU β Dense β ReLU β Dense β Output
Use cases: Tabular data, simple classification
2. Convolutional Neural Network (CNN)ΒΆ
Image β Conv β ReLU β Pool β Conv β ReLU β Pool β Flatten β Dense β Output
Use cases: Image classification, object detection, computer vision
Key innovation: Learns spatial hierarchies (edges β shapes β objects)
3. Recurrent Neural Network (RNN)ΒΆ
Wordβ β [RNN] β Hidden State β Wordβ β [RNN] β ...
β β
Outputβ Outputβ
Use cases: Time series, text, sequential data
Problem: Vanishing gradients on long sequences
4. LSTM (Long Short-Term Memory)ΒΆ
Improved RNN with gates to control information flow:
Forget gate: What to forget from memory
Input gate: What new information to add
Output gate: What to output
Use cases: Machine translation, speech recognition
From RNNs to TransformersΒΆ
The Sequential Processing ProblemΒΆ
RNNs must process one token at a time:
"The cat sat on the mat"
Step 1: Process "The" β hidden_stateβ
Step 2: Process "cat" β hidden_stateβ (depends on step 1)
Step 3: Process "sat" β hidden_stateβ (depends on step 2)
...
β Cannot parallelize
β Long-range dependencies fade
β Slow training
The Attention RevolutionΒΆ
Key insight: What if we could look at ALL words simultaneously?
"The cat sat on the mat"
For predicting next word:
- Look at all positions at once
- Compute relevance scores (attention weights)
- Focus more on important words
- Process in parallel
β
Parallelizable
β
Long-range dependencies preserved
β
Fast training
Transformer BenefitsΒΆ
Parallel Processing: All positions computed simultaneously
Long Context: No vanishing gradients over distance
Flexibility: Same architecture for many tasks
Scalability: Can train on massive datasets
Transfer Learning: Pre-train once, fine-tune for many tasks
This is why transformers have become the dominant architecture for:
Natural Language Processing (GPT, BERT, T5)
Computer Vision (Vision Transformer)
Multi-modal models (CLIP, GPT-4)
Audio (Whisper)
Code generation (Codex, GitHub Copilot)
Next StepsΒΆ
Now that you understand neural network basics, proceed to:
attention_explained.md- Deep dive into attention mechanismtransformer_architecture.md- Complete transformer breakdownRun the Python examples - Hands-on implementation
The journey continues! π