Transformer Architecture: Complete GuideΒΆ

Table of ContentsΒΆ

  1. Overview

  2. The Big Picture

  3. Input Processing

  4. Encoder Block

  5. Decoder Block

  6. Three Transformer Variants

  7. Training and Inference

  8. Modern Innovations

OverviewΒΆ

The Transformer, introduced in β€œAttention Is All You Need” (2017), revolutionized AI by replacing recurrent layers with attention mechanisms.

Key InnovationΒΆ

Before (RNN/LSTM):
  Sequential processing β†’ Slow training, gradient issues

After (Transformer):
  Parallel attention β†’ Fast training, better performance

Core ComponentsΒΆ

  1. Multi-Head Self-Attention: Focus on relevant parts

  2. Position-wise Feed-Forward: Process each position

  3. Positional Encoding: Add sequence order information

  4. Layer Normalization: Stabilize training

  5. Residual Connections: Help gradient flow

The Big PictureΒΆ

Original Transformer (Encoder-Decoder)ΒΆ

Input Sequence β†’ [ENCODER] β†’ Context β†’ [DECODER] β†’ Output Sequence
   "Hello"         ↓                       ↓           "Bonjour"
                6x layers              6x layers
                   ↓                       ↓
              Representation          Generation

Architecture DiagramΒΆ

                    OUTPUT
                      ↑
                [Linear + Softmax]
                      ↑
                [Decoder Stack]
                 (N x Decoder)
              /      ↑      \
    [Multi-Head  [Feed     [Add &
     Attention]  Forward]   Norm]
         ↑          ↑          ↑
    [Encoder Stack Output]    |
         ↓                     |
    [Encoder Stack]            |
     (N x Encoder)             |
         ↑                     |
    [Input Embedding + Positional Encoding]
         ↑
      INPUT

Input ProcessingΒΆ

1. Token EmbeddingΒΆ

Convert tokens to dense vectors:

# Input: Token IDs
tokens = [2054, 2003, 1996]  # "what is the"

# Embedding layer
embedding = nn.Embedding(vocab_size=30000, embedding_dim=512)
embedded = embedding(tokens)
# Shape: (seq_len, d_model) = (3, 512)

2. Positional EncodingΒΆ

Problem: Attention has no notion of order!

"cat sat on mat" == "mat on sat cat" (to attention mechanism)

Solution: Add position information to embeddings.

Formula (Sinusoidal Encoding):

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index (0, 1, 2, ..., d_model/2)
- d_model = embedding dimension (512 in original)

Implementation:

import torch
import math

def positional_encoding(seq_len, d_model):
    """Generate positional encodings"""
    pe = torch.zeros(seq_len, d_model)
    
    position = torch.arange(0, seq_len).unsqueeze(1)  # (seq_len, 1)
    div_term = torch.exp(
        torch.arange(0, d_model, 2) * 
        -(math.log(10000.0) / d_model)
    )
    
    # Apply sin to even indices
    pe[:, 0::2] = torch.sin(position * div_term)
    # Apply cos to odd indices
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe

# Generate positional encodings
pos_enc = positional_encoding(seq_len=100, d_model=512)

Why Sinusoidal?

  • βœ… Unique encoding for each position

  • βœ… Relative positions have consistent patterns

  • βœ… Can extrapolate to longer sequences than seen during training

  • βœ… No learned parameters (generalizes better)

Final Input:

input_representation = token_embedding + positional_encoding
# Both have shape (seq_len, d_model)

Encoder BlockΒΆ

Each encoder consists of two main components:

1. Multi-Head Self-AttentionΒΆ

class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Multi-head attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        
        return x

Components BreakdownΒΆ

Multi-Head Self-AttentionΒΆ

Input (seq_len, d_model)
    ↓
[Linear projections to Q, K, V]
    ↓
[Split into num_heads]
    ↓
[Scaled dot-product attention per head]
    ↓
[Concatenate heads]
    ↓
[Linear projection]
    ↓
Output (seq_len, d_model)

Parameters:

  • Original transformer: 8 heads, d_model=512, d_k=d_v=64

Add & Norm (Residual Connection + Layer Normalization)ΒΆ

Residual Connection:

output = x + sublayer(x)

Why?

  • Helps gradients flow during backpropagation

  • Allows training very deep networks

  • Model can learn identity function if needed

Layer Normalization:

# Normalize across features for each sample
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
normalized = (x - mean) / (std + epsilon)
output = gamma * normalized + beta  # Learnable parameters

Why?

  • Stabilizes training

  • Reduces internal covariate shift

  • Allows higher learning rates

Position-wise Feed-Forward NetworkΒΆ

FFN(x) = max(0, x @ W1 + b1) @ W2 + b2
       = ReLU(x @ W1 + b1) @ W2 + b2

Architecture:

Input (seq_len, d_model=512)
    ↓
[Linear Layer: 512 β†’ 2048]
    ↓
[ReLU Activation]
    ↓
[Linear Layer: 2048 β†’ 512]
    ↓
Output (seq_len, d_model=512)

Why?

  • Adds non-linearity and model capacity

  • Processes each position independently

  • d_ff typically 4x larger than d_model

Complete EncoderΒΆ

Stack N encoder blocks (N=6 in original):

class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

Decoder BlockΒΆ

ArchitectureΒΆ

Similar to encoder but with additional components:

class DecoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Masked multi-head self-attention
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        
        # Cross-attention to encoder output
        self.cross_attention = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)
    
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # Masked self-attention
        self_attn = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout1(self_attn))
        
        # Cross-attention to encoder
        cross_attn = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout2(cross_attn))
        
        # Feed-forward
        ffn_output = self.ffn(x)
        x = self.norm3(x + self.dropout3(ffn_output))
        
        return x

Three Attention MechanismsΒΆ

1. Masked Self-Attention (Causal Attention)ΒΆ

Purpose: Prevent decoder from seeing future tokens

When generating word 3, can only see words 0, 1, 2

Mask (lower triangular):
  [[1, 0, 0, 0],
   [1, 1, 0, 0],
   [1, 1, 1, 0],
   [1, 1, 1, 1]]

0 positions get -∞ in attention scores, 
resulting in 0 attention weight after softmax

Why?

  • During training: Model learns to generate sequentially

  • During inference: Matches training behavior

2. Cross-Attention (Encoder-Decoder Attention)ΒΆ

Purpose: Decoder attends to encoder outputs

Query: From decoder current state
Key & Value: From encoder output

Example (Translation):
English (encoder): "I love machine learning"
French (decoder): "J'adore"

When generating next French word:
- Query: Current decoder state for "J'adore"
- Attends to: All English words
- Focuses on: "love" and "machine learning"
- Generates: "l'apprentissage"

3. Feed-Forward NetworkΒΆ

Same as encoder FFN.

Three Transformer VariantsΒΆ

1. Encoder-Only (BERT)ΒΆ

Input β†’ [Encoder Stack] β†’ Output

Use cases:
- Text classification (sentiment, topic)
- Named entity recognition
- Question answering (extractive)
- Text similarity

Examples: BERT, RoBERTa, DistilBERT, ALBERT

Architecture:

class BERTModel:
    def __init__(self):
        self.embedding = Embedding()
        self.encoder = Encoder(num_layers=12)
        
    def forward(self, input_ids):
        x = self.embedding(input_ids)
        encoded = self.encoder(x)
        return encoded  # Use for downstream tasks

Training: Masked Language Modeling (MLM)

Input:  "The [MASK] sat on the mat"
Target: "cat"

2. Decoder-Only (GPT)ΒΆ

Input β†’ [Decoder Stack] β†’ Next Token

Use cases:
- Text generation
- Completion
- Chatbots
- Code generation

Examples: GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral

Architecture:

class GPTModel:
    def __init__(self):
        self.embedding = Embedding()
        self.decoder = Decoder(num_layers=12)  # Only self-attention, no cross-attention
        self.lm_head = nn.Linear(d_model, vocab_size)
        
    def forward(self, input_ids):
        x = self.embedding(input_ids)
        decoded = self.decoder(x, causal_mask=True)
        logits = self.lm_head(decoded)
        return logits  # Probabilities for next token

Training: Causal Language Modeling (CLM)

Input:  "The cat sat"
Target: "cat sat on"

3. Encoder-Decoder (T5, BART)ΒΆ

Input β†’ [Encoder] β†’ Context β†’ [Decoder] β†’ Output

Use cases:
- Machine translation
- Summarization
- Question answering (generative)
- Text-to-text tasks

Examples: T5, BART, mT5, MarianMT

Architecture:

class Seq2SeqModel:
    def __init__(self):
        self.encoder = Encoder(num_layers=6)
        self.decoder = Decoder(num_layers=6)
        
    def forward(self, src_ids, tgt_ids):
        # Encode source
        encoder_output = self.encoder(src_ids)
        
        # Decode target
        decoder_output = self.decoder(tgt_ids, encoder_output)
        return decoder_output

Training: Seq2Seq

Source: "Translate to French: I love AI"
Target: "J'adore l'IA"

Training and InferenceΒΆ

Training ProcessΒΆ

# 1. Forward pass
encoder_output = encoder(source_tokens)
decoder_output = decoder(target_tokens, encoder_output)
logits = output_layer(decoder_output)

# 2. Compute loss
loss = cross_entropy(logits, target_labels)

# 3. Backward pass
loss.backward()

# 4. Update weights
optimizer.step()

Inference (Generation)ΒΆ

Autoregressive Generation:

def generate(model, prompt, max_length=50):
    """Generate text token by token"""
    tokens = tokenize(prompt)
    
    for _ in range(max_length):
        # Forward pass
        logits = model(tokens)
        
        # Get next token prediction
        next_token_logits = logits[-1]  # Last position
        next_token = sample(next_token_logits)  # Sample or argmax
        
        # Append to sequence
        tokens.append(next_token)
        
        # Stop if EOS token
        if next_token == EOS_TOKEN:
            break
    
    return detokenize(tokens)

Sampling Strategies:

  1. Greedy: Always pick most likely token

next_token = torch.argmax(logits)
  1. Top-k: Sample from top k most likely tokens

top_k_logits, top_k_indices = torch.topk(logits, k=50)
probs = softmax(top_k_logits)
next_token = sample(top_k_indices, probs)
  1. Top-p (Nucleus): Sample from tokens comprising top p probability mass

sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
nucleus = cumsum <= p
nucleus_probs = sorted_probs[nucleus]
next_token = sample(sorted_indices[nucleus], nucleus_probs)
  1. Temperature: Control randomness

logits = logits / temperature  # Higher temp = more random
probs = softmax(logits)

Modern InnovationsΒΆ

1. Efficient AttentionΒΆ

Problem: Standard attention is O(nΒ²) in sequence length

Solutions:

Sparse Attention (GPT-3):

  • Only attend to subset of positions

  • Local + strided patterns

  • O(n√n) complexity

Linear Attention:

  • Reformulate attention to be O(n)

  • Kernel-based methods

  • Examples: Performer, Linear Transformer

Flash Attention:

  • GPU memory-optimized

  • Faster training/inference

  • No approximation, exact attention

2. Architectural ImprovementsΒΆ

Pre-Norm vs Post-Norm:

# Post-Norm (original)
x = norm(x + sublayer(x))

# Pre-Norm (more stable)
x = x + sublayer(norm(x))

RMSNorm: Simpler, faster than LayerNorm

SwiGLU Activation: Better than ReLU for LLMs

3. Positional EncodingsΒΆ

Relative Positional Encoding (T5):

  • Attention scores modified based on relative distance

  • Better for variable-length sequences

Rotary Position Embedding (RoPE):

  • Used in GPT-NeoX, LLaMA

  • Encodes relative positions via rotation

  • Better extrapolation to longer sequences

ALiBi (Attention with Linear Biases):

  • Add distance-based bias to attention scores

  • Simple, effective for long sequences

4. Mixture of Experts (MoE)ΒΆ

Input β†’ [Router] β†’ Select K experts β†’ Average outputs

Benefits:
- Larger model capacity
- Same inference cost (only K experts active)
- Used in GPT-4, Mixtral

5. Model SizesΒΆ

Evolution:

  • 2018 BERT-base: 110M parameters

  • 2019 GPT-2: 1.5B parameters

  • 2020 GPT-3: 175B parameters

  • 2023 GPT-4: ~1.7T parameters (rumored, MoE)

SummaryΒΆ

Key Takeaways:

  1. Attention replaces recurrence for sequence modeling

  2. Parallel processing makes training efficient

  3. Positional encoding adds sequence order information

  4. Residual connections enable deep networks

  5. Layer normalization stabilizes training

Three variants:

  • Encoder-only: Understanding tasks (BERT)

  • Decoder-only: Generation tasks (GPT)

  • Encoder-decoder: Transformation tasks (T5)

Modern developments:

  • Efficient attention mechanisms

  • Better positional encodings

  • Architectural improvements

  • Scaling to trillions of parameters

Next Steps:

  • Implement transformer components in code

  • Fine-tune pre-trained models

  • Build applications with Hugging Face Transformers

The transformer architecture is the foundation of modern AI! πŸš€