Transformer Architecture: Complete GuideΒΆ
Table of ContentsΒΆ
OverviewΒΆ
The Transformer, introduced in βAttention Is All You Needβ (2017), revolutionized AI by replacing recurrent layers with attention mechanisms.
Key InnovationΒΆ
Before (RNN/LSTM):
Sequential processing β Slow training, gradient issues
After (Transformer):
Parallel attention β Fast training, better performance
Core ComponentsΒΆ
Multi-Head Self-Attention: Focus on relevant parts
Position-wise Feed-Forward: Process each position
Positional Encoding: Add sequence order information
Layer Normalization: Stabilize training
Residual Connections: Help gradient flow
The Big PictureΒΆ
Original Transformer (Encoder-Decoder)ΒΆ
Input Sequence β [ENCODER] β Context β [DECODER] β Output Sequence
"Hello" β β "Bonjour"
6x layers 6x layers
β β
Representation Generation
Architecture DiagramΒΆ
OUTPUT
β
[Linear + Softmax]
β
[Decoder Stack]
(N x Decoder)
/ β \
[Multi-Head [Feed [Add &
Attention] Forward] Norm]
β β β
[Encoder Stack Output] |
β |
[Encoder Stack] |
(N x Encoder) |
β |
[Input Embedding + Positional Encoding]
β
INPUT
Input ProcessingΒΆ
1. Token EmbeddingΒΆ
Convert tokens to dense vectors:
# Input: Token IDs
tokens = [2054, 2003, 1996] # "what is the"
# Embedding layer
embedding = nn.Embedding(vocab_size=30000, embedding_dim=512)
embedded = embedding(tokens)
# Shape: (seq_len, d_model) = (3, 512)
2. Positional EncodingΒΆ
Problem: Attention has no notion of order!
"cat sat on mat" == "mat on sat cat" (to attention mechanism)
Solution: Add position information to embeddings.
Formula (Sinusoidal Encoding):
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index (0, 1, 2, ..., d_model/2)
- d_model = embedding dimension (512 in original)
Implementation:
import torch
import math
def positional_encoding(seq_len, d_model):
"""Generate positional encodings"""
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1) # (seq_len, 1)
div_term = torch.exp(
torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model)
)
# Apply sin to even indices
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cos to odd indices
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# Generate positional encodings
pos_enc = positional_encoding(seq_len=100, d_model=512)
Why Sinusoidal?
β Unique encoding for each position
β Relative positions have consistent patterns
β Can extrapolate to longer sequences than seen during training
β No learned parameters (generalizes better)
Final Input:
input_representation = token_embedding + positional_encoding
# Both have shape (seq_len, d_model)
Encoder BlockΒΆ
Each encoder consists of two main components:
1. Multi-Head Self-AttentionΒΆ
class EncoderBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
# Multi-head attention
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
# Feed-forward network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Multi-head attention with residual connection
attn_output = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout1(attn_output))
# Feed-forward with residual connection
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout2(ffn_output))
return x
Components BreakdownΒΆ
Multi-Head Self-AttentionΒΆ
Input (seq_len, d_model)
β
[Linear projections to Q, K, V]
β
[Split into num_heads]
β
[Scaled dot-product attention per head]
β
[Concatenate heads]
β
[Linear projection]
β
Output (seq_len, d_model)
Parameters:
Original transformer: 8 heads, d_model=512, d_k=d_v=64
Add & Norm (Residual Connection + Layer Normalization)ΒΆ
Residual Connection:
output = x + sublayer(x)
Why?
Helps gradients flow during backpropagation
Allows training very deep networks
Model can learn identity function if needed
Layer Normalization:
# Normalize across features for each sample
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
normalized = (x - mean) / (std + epsilon)
output = gamma * normalized + beta # Learnable parameters
Why?
Stabilizes training
Reduces internal covariate shift
Allows higher learning rates
Position-wise Feed-Forward NetworkΒΆ
FFN(x) = max(0, x @ W1 + b1) @ W2 + b2
= ReLU(x @ W1 + b1) @ W2 + b2
Architecture:
Input (seq_len, d_model=512)
β
[Linear Layer: 512 β 2048]
β
[ReLU Activation]
β
[Linear Layer: 2048 β 512]
β
Output (seq_len, d_model=512)
Why?
Adds non-linearity and model capacity
Processes each position independently
d_ff typically 4x larger than d_model
Complete EncoderΒΆ
Stack N encoder blocks (N=6 in original):
class Encoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.layers = nn.ModuleList([
EncoderBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)
Decoder BlockΒΆ
ArchitectureΒΆ
Similar to encoder but with additional components:
class DecoderBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
# Masked multi-head self-attention
self.self_attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
# Cross-attention to encoder output
self.cross_attention = MultiHeadAttention(d_model, num_heads)
self.norm2 = nn.LayerNorm(d_model)
self.dropout2 = nn.Dropout(dropout)
# Feed-forward network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm3 = nn.LayerNorm(d_model)
self.dropout3 = nn.Dropout(dropout)
def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
# Masked self-attention
self_attn = self.self_attention(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout1(self_attn))
# Cross-attention to encoder
cross_attn = self.cross_attention(x, encoder_output, encoder_output, src_mask)
x = self.norm2(x + self.dropout2(cross_attn))
# Feed-forward
ffn_output = self.ffn(x)
x = self.norm3(x + self.dropout3(ffn_output))
return x
Three Attention MechanismsΒΆ
1. Masked Self-Attention (Causal Attention)ΒΆ
Purpose: Prevent decoder from seeing future tokens
When generating word 3, can only see words 0, 1, 2
Mask (lower triangular):
[[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]
0 positions get -β in attention scores,
resulting in 0 attention weight after softmax
Why?
During training: Model learns to generate sequentially
During inference: Matches training behavior
2. Cross-Attention (Encoder-Decoder Attention)ΒΆ
Purpose: Decoder attends to encoder outputs
Query: From decoder current state
Key & Value: From encoder output
Example (Translation):
English (encoder): "I love machine learning"
French (decoder): "J'adore"
When generating next French word:
- Query: Current decoder state for "J'adore"
- Attends to: All English words
- Focuses on: "love" and "machine learning"
- Generates: "l'apprentissage"
3. Feed-Forward NetworkΒΆ
Same as encoder FFN.
Three Transformer VariantsΒΆ
1. Encoder-Only (BERT)ΒΆ
Input β [Encoder Stack] β Output
Use cases:
- Text classification (sentiment, topic)
- Named entity recognition
- Question answering (extractive)
- Text similarity
Examples: BERT, RoBERTa, DistilBERT, ALBERT
Architecture:
class BERTModel:
def __init__(self):
self.embedding = Embedding()
self.encoder = Encoder(num_layers=12)
def forward(self, input_ids):
x = self.embedding(input_ids)
encoded = self.encoder(x)
return encoded # Use for downstream tasks
Training: Masked Language Modeling (MLM)
Input: "The [MASK] sat on the mat"
Target: "cat"
2. Decoder-Only (GPT)ΒΆ
Input β [Decoder Stack] β Next Token
Use cases:
- Text generation
- Completion
- Chatbots
- Code generation
Examples: GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral
Architecture:
class GPTModel:
def __init__(self):
self.embedding = Embedding()
self.decoder = Decoder(num_layers=12) # Only self-attention, no cross-attention
self.lm_head = nn.Linear(d_model, vocab_size)
def forward(self, input_ids):
x = self.embedding(input_ids)
decoded = self.decoder(x, causal_mask=True)
logits = self.lm_head(decoded)
return logits # Probabilities for next token
Training: Causal Language Modeling (CLM)
Input: "The cat sat"
Target: "cat sat on"
3. Encoder-Decoder (T5, BART)ΒΆ
Input β [Encoder] β Context β [Decoder] β Output
Use cases:
- Machine translation
- Summarization
- Question answering (generative)
- Text-to-text tasks
Examples: T5, BART, mT5, MarianMT
Architecture:
class Seq2SeqModel:
def __init__(self):
self.encoder = Encoder(num_layers=6)
self.decoder = Decoder(num_layers=6)
def forward(self, src_ids, tgt_ids):
# Encode source
encoder_output = self.encoder(src_ids)
# Decode target
decoder_output = self.decoder(tgt_ids, encoder_output)
return decoder_output
Training: Seq2Seq
Source: "Translate to French: I love AI"
Target: "J'adore l'IA"
Training and InferenceΒΆ
Training ProcessΒΆ
# 1. Forward pass
encoder_output = encoder(source_tokens)
decoder_output = decoder(target_tokens, encoder_output)
logits = output_layer(decoder_output)
# 2. Compute loss
loss = cross_entropy(logits, target_labels)
# 3. Backward pass
loss.backward()
# 4. Update weights
optimizer.step()
Inference (Generation)ΒΆ
Autoregressive Generation:
def generate(model, prompt, max_length=50):
"""Generate text token by token"""
tokens = tokenize(prompt)
for _ in range(max_length):
# Forward pass
logits = model(tokens)
# Get next token prediction
next_token_logits = logits[-1] # Last position
next_token = sample(next_token_logits) # Sample or argmax
# Append to sequence
tokens.append(next_token)
# Stop if EOS token
if next_token == EOS_TOKEN:
break
return detokenize(tokens)
Sampling Strategies:
Greedy: Always pick most likely token
next_token = torch.argmax(logits)
Top-k: Sample from top k most likely tokens
top_k_logits, top_k_indices = torch.topk(logits, k=50)
probs = softmax(top_k_logits)
next_token = sample(top_k_indices, probs)
Top-p (Nucleus): Sample from tokens comprising top p probability mass
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
nucleus = cumsum <= p
nucleus_probs = sorted_probs[nucleus]
next_token = sample(sorted_indices[nucleus], nucleus_probs)
Temperature: Control randomness
logits = logits / temperature # Higher temp = more random
probs = softmax(logits)
Modern InnovationsΒΆ
1. Efficient AttentionΒΆ
Problem: Standard attention is O(nΒ²) in sequence length
Solutions:
Sparse Attention (GPT-3):
Only attend to subset of positions
Local + strided patterns
O(nβn) complexity
Linear Attention:
Reformulate attention to be O(n)
Kernel-based methods
Examples: Performer, Linear Transformer
Flash Attention:
GPU memory-optimized
Faster training/inference
No approximation, exact attention
2. Architectural ImprovementsΒΆ
Pre-Norm vs Post-Norm:
# Post-Norm (original)
x = norm(x + sublayer(x))
# Pre-Norm (more stable)
x = x + sublayer(norm(x))
RMSNorm: Simpler, faster than LayerNorm
SwiGLU Activation: Better than ReLU for LLMs
3. Positional EncodingsΒΆ
Relative Positional Encoding (T5):
Attention scores modified based on relative distance
Better for variable-length sequences
Rotary Position Embedding (RoPE):
Used in GPT-NeoX, LLaMA
Encodes relative positions via rotation
Better extrapolation to longer sequences
ALiBi (Attention with Linear Biases):
Add distance-based bias to attention scores
Simple, effective for long sequences
4. Mixture of Experts (MoE)ΒΆ
Input β [Router] β Select K experts β Average outputs
Benefits:
- Larger model capacity
- Same inference cost (only K experts active)
- Used in GPT-4, Mixtral
5. Model SizesΒΆ
Evolution:
2018 BERT-base: 110M parameters
2019 GPT-2: 1.5B parameters
2020 GPT-3: 175B parameters
2023 GPT-4: ~1.7T parameters (rumored, MoE)
SummaryΒΆ
Key Takeaways:
Attention replaces recurrence for sequence modeling
Parallel processing makes training efficient
Positional encoding adds sequence order information
Residual connections enable deep networks
Layer normalization stabilizes training
Three variants:
Encoder-only: Understanding tasks (BERT)
Decoder-only: Generation tasks (GPT)
Encoder-decoder: Transformation tasks (T5)
Modern developments:
Efficient attention mechanisms
Better positional encodings
Architectural improvements
Scaling to trillions of parameters
Next Steps:
Implement transformer components in code
Fine-tune pre-trained models
Build applications with Hugging Face Transformers
The transformer architecture is the foundation of modern AI! π