Run this notebook: Open in Colab Open in Kaggle

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

1. GPT vs BERT¶

Feature	GPT	BERT
Architecture	Decoder-only	Encoder-only
Attention	Causal (masked)	Bidirectional
Objective	Next token prediction	MLM + NSP
Use case	Generation	Understanding

Autoregressive Language Modeling¶

\[P(x_{1:T}) = \prod_{t=1}^T P(x_t | x_{<t})\]

Training: Maximize log-likelihood $$\mathcal{L} = \sum_{t=1}^T \log P(x_t | x_{<t})$$

1.5. Causal Masking: Mathematical Foundation¶

Why Causal Masking?¶

In autoregressive generation, we must prevent “information leakage” from future tokens:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\]

where the causal mask $M$ is:

\[\begin{split}M_{ij} = \begin{cases} 0 & \text{if } i \geq j \text{ (can attend)} \\ -\infty & \text{if } i < j \text{ (cannot attend)} \end{cases}\end{split}\]

Interpretation:

Token at position $i$ can only attend to positions $j \leq i$
After softmax, $\exp(-\infty) = 0$ → future positions get zero weight
Ensures $P(x_t | x_{<t})$ only uses past context

Comparison of Attention Patterns¶

Model	Attention Type	Mask Pattern	Training Objective
GPT	Causal (unidirectional)	Lower triangular	Next token prediction
BERT	Bidirectional	Full matrix (some masked)	MLM + NSP
Prefix LM	Prefix bidirectional + causal	Block matrix	Span prediction

Mathematical Consequences¶

1. Training Efficiency:

For sequence length $T$, we get $T$ training signals per sequence:

\[\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_1, \ldots, x_{t-1})\]

Each position learns to predict the next token given all previous context.

2. Generation Constraint:

At inference, must generate sequentially:

Cannot parallelize generation across time
Complexity: $O(T)$ forward passes for $T$ tokens
But: Can batch generate multiple sequences

3. Context Window:

Effective context is cumulative:

Position 1: sees nothing (just BOS)
Position $t$: sees $t-1$ previous tokens
Position $T$: sees full sequence

Implementation Details¶

# Creating causal mask
def create_causal_mask(seq_len):
    """Lower triangular matrix for causal attention."""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    # Convert 0s to -inf, 1s to 0
    mask = mask.masked_fill(mask == 0, float('-inf'))
    mask = mask.masked_fill(mask == 1, 0.0)
    return mask

# Example: 5x5 causal mask
mask = create_causal_mask(5)
print("Causal mask pattern:")
print(mask)

Attention scores before/after masking:

Before:  [[0.8, 0.6, 0.5, 0.4, 0.3],    After softmax with mask:
          [0.7, 0.9, 0.6, 0.5, 0.4],     [[1.0, 0.0, 0.0, 0.0, 0.0],
          [0.6, 0.5, 0.8, 0.7, 0.5],  →   [0.4, 0.6, 0.0, 0.0, 0.0],
          [0.5, 0.6, 0.7, 0.9, 0.8],      [0.2, 0.3, 0.5, 0.0, 0.0],
          [0.4, 0.5, 0.6, 0.7, 1.0]]      [0.1, 0.2, 0.2, 0.3, 0.2]]

Causal vs Bidirectional Trade-offs¶

GPT (Causal) Advantages: ✅ Natural for generation tasks ✅ Coherent long-form text ✅ Simple training objective ✅ No special tokens needed

GPT Disadvantages: ❌ Cannot see future context ❌ Less effective for classification ❌ Slower inference (sequential)

BERT (Bidirectional) Advantages: ✅ Full context for understanding ✅ Better for classification/NER ✅ Parallel inference

BERT Disadvantages: ❌ Cannot generate naturally ❌ Needs [MASK] tokens ❌ Two-stage training (MLM → task)

GPT Components¶

GPT uses a stack of Transformer decoder layers with causal (left-to-right) self-attention: each token can only attend to itself and preceding tokens, enforced by an upper-triangular attention mask. This autoregressive structure means the model factorizes the joint probability of a sequence as $p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1})$. The architecture is otherwise similar to BERT – layer normalization, multi-head attention, and feed-forward layers with residual connections – but the causal mask is what enables GPT to generate text token by token. Modern GPT variants (GPT-2, GPT-3, GPT-4) scale this architecture to billions of parameters, demonstrating that autoregressive language models exhibit emergent capabilities with scale.

2.5. Positional Encoding: Theory and Variants¶

The Position Problem¶

Self-attention is permutation invariant - it doesn’t know token order!

\[\text{Attention}(\text{permute}(X)) = \text{permute}(\text{Attention}(X))\]

We need to inject positional information.

Absolute Positional Encoding (GPT-2, GPT-3)¶

Learned embeddings:

\[x_t' = x_t + p_t\]

where $p_t \in \mathbb{R}^d$ is a learned position embedding.

Advantages:

Simple and effective
Can learn task-specific patterns
Works well in practice

Disadvantages:

Fixed maximum length
Poor extrapolation beyond training length
No explicit distance notion

Sinusoidal Positional Encoding (Original Transformer)¶

Formula:

\[\text{PE}_{(t, 2i)} = \sin\left(\frac{t}{10000^{2i/d}}\right)\]

\[\text{PE}_{(t, 2i+1)} = \cos\left(\frac{t}{10000^{2i/d}}\right)\]

Key Properties:

Deterministic: No learning required
Unbounded length: Works for any sequence length
Relative position: $\text{PE}_{t+k}$ is linear function of $\text{PE}_t$

Mathematical intuition:

Different dimensions oscillate at different frequencies:

Low dimensions: fast oscillation (capture local patterns)
High dimensions: slow oscillation (capture global structure)

Relative Positional Encoding¶

Instead of absolute positions, encode relative distances:

\[\text{Attention}_{ij} = \frac{x_i W_Q (x_j W_K + r_{i-j})^T}{\sqrt{d}}\]

where $r_{i-j}$ is relative position encoding for distance $i-j$.

Advantages:

Better length generalization
Explicit relative distance modeling
Used in T5, DeBERTa

Rotary Position Embedding (RoPE) - Modern Approach¶

Used in: GPT-Neo, GPT-J, LLaMA, PaLM

Idea: Rotate query and key vectors based on position

\[\begin{split}f_{\text{RoPE}}(x, t) = \begin{bmatrix} \cos(t\theta_1) & -\sin(t\theta_1) \\ \sin(t\theta_1) & \cos(t\theta_1) \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\end{split}\]

Applied to each 2D subspace with different frequencies $\theta_i = 10000^{-2i/d}$.

Key property: Inner product encodes relative position!

\[\langle f(q, m), f(k, n) \rangle = g(q, k, m-n)\]

where $m, n$ are positions. The dot product only depends on relative distance $m-n$.

Advantages:

Perfect length extrapolation
Efficient computation (element-wise)
No learned parameters
State-of-the-art for LLMs

ALiBi (Attention with Linear Biases)¶

Used in: BLOOM, MPT

Add bias to attention scores based on distance:

\[\text{Attention}_{ij} = \text{softmax}\left(\frac{q_i k_j^T}{\sqrt{d}} - m \cdot |i-j|\right)\]

where $m$ is a head-specific slope (each head has different $m$).

Advantages:

No positional embeddings needed
Excellent extrapolation
Simple implementation
Saves parameters

Comparison Table¶

Method	Parameters	Extrapolation	Complexity	Used In
Learned Absolute	$O(L \cdot d)$	❌ Poor	$O(1)$	GPT-2, GPT-3
Sinusoidal	0	⚠️ Moderate	$O(1)$	Original Transformer
Relative	$O(L \cdot d)$	✅ Good	$O(L)$	T5, DeBERTa
RoPE	0	✅ Excellent	$O(1)$	LLaMA, GPT-J
ALiBi	0	✅ Excellent	$O(1)$	BLOOM, MPT

Practical Considerations¶

When to use each:

Learned Absolute - When sequence length is fixed and known
Sinusoidal - Simple baseline, no parameters
RoPE - Best for LLMs - extrapolates well, efficient
ALiBi - When extrapolation critical, simplest implementation
Relative - When relative positions are semantically important

class CausalSelfAttention(nn.Module):
    """Masked self-attention for autoregressive generation."""
    def __init__(self, d_model, n_heads, dropout=0.1, max_len=512):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # QKV projections
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)
        
        # Causal mask
        self.register_buffer('mask', torch.tril(torch.ones(max_len, max_len)).view(1, 1, max_len, max_len))
    
    def forward(self, x):
        B, T, C = x.size()
        
        # QKV
        qkv = self.qkv(x)
        q, k, v = qkv.split(self.d_model, dim=2)
        
        # Reshape for multi-head
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        
        # Attention
        scores = (q @ k.transpose(-2, -1)) / np.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.attn_dropout(attn)
        
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.resid_dropout(self.out_proj(out))
        
        return out

class FeedForward(nn.Module):
    """Position-wise feed-forward network."""
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.dropout(self.linear2(F.gelu(self.linear1(x))))

class GPTBlock(nn.Module):
    """GPT transformer block."""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff, dropout)
    
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

class GPT(nn.Module):
    """GPT language model."""
    def __init__(self, vocab_size, d_model=256, n_layers=6, n_heads=8, 
                 d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying
        self.token_emb.weight = self.head.weight
        
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx):
        B, T = idx.size()
        
        # Embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
        x = self.dropout(self.token_emb(idx) + self.pos_emb(pos))
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x)
        
        x = self.ln_f(x)
        logits = self.head(x)
        
        return logits
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate text autoregressively."""
        for _ in range(max_new_tokens):
            # Crop to max length
            idx_cond = idx if idx.size(1) <= 512 else idx[:, -512:]
            
            # Forward
            logits = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            
            # Top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        
        return idx

# Create model
vocab_size = 1000
model = GPT(vocab_size, d_model=128, n_layers=4, n_heads=4, d_ff=512).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Training¶

GPT is trained with a causal language modeling objective: predict the next token given all previous tokens. The loss is cross-entropy averaged over all positions in the sequence: $\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log p(x_t | x_{<t})$. Because the causal mask allows all positions to be predicted in a single forward pass (unlike autoregressive generation which processes one token at a time), training is highly parallelizable and efficient on GPUs. Key training hyperparameters include learning rate schedule (typically with linear warm-up and cosine decay), weight decay, and gradient clipping. Monitoring both training loss and validation perplexity helps detect overfitting, which becomes less of a concern as model and dataset scale increase.

3.5. Scaling Laws for Language Models¶

Kaplan et al. (2020): Power Law Scaling¶

Empirical Discovery: Model performance follows predictable power laws!

\[L(N, D, C) \sim \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C}\]

where:

$L$ = test loss
$N$ = number of parameters
$D$ = dataset size (tokens)
$C$ = compute budget (FLOPs)
$\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$

Key Findings¶

1. Performance improves as power laws:

\[L(N) \sim N^{-0.076} \quad \text{(when compute not limited)}\]

More parameters → lower loss (until convergence).

2. Data and model size should scale together:

For compute budget $C$:

\[N_{\text{optimal}} \propto C^{0.73}\]

\[D_{\text{optimal}} \propto C^{0.27}\]

Insight: Should scale model size faster than data size.

3. Extremely large models are undertrained:

GPT-3 (175B parameters) with 300B tokens is not optimal!

According to scaling laws:

Should have used ~400B+ tokens
Or used smaller model with same compute

Chinchilla Scaling Laws (Hoffmann et al. 2022)¶

Re-analysis with more data points:

\[N_{\text{optimal}} \propto C^{0.50}\]

\[D_{\text{optimal}} \propto C^{0.50}\]

Key difference: Model size and data should scale equally!

Practical implications:

Compute Budget	Kaplan et al.	Chinchilla	Reality
1e23 FLOPs	400B params, 200B tokens	70B params, 1.4T tokens	Chinchilla: 70B, 1.4T
GPT-3	175B params, 300B tokens	Should be ~30B, 2T tokens	Overtrained model, undertrained data

Result: Chinchilla (70B) outperforms GPT-3 (175B) with 4× fewer parameters!

LLaMA Approach¶

Meta’s strategy based on Chinchilla:

LLaMA-7B: Trained on 1T tokens
LLaMA-13B: Trained on 1T tokens
LLaMA-65B: Trained on 1.4T tokens

Conclusion: Smaller, longer-trained models > larger, shorter-trained models

Practical Scaling Recommendations¶

For fixed compute budget $C$:

\[N \approx 0.5 \sqrt{C/6} \quad \text{(parameters)}\]

\[D \approx 0.5 \sqrt{6C} \quad \text{(tokens)}\]

Example: $C = 10^{23}$ FLOPs

$N \approx 41B$ parameters
$D \approx 1.2T$ tokens

Scaling Law Visualization¶

Loss vs Parameters (log-log):

Loss (log) |     
           |  ╲
           |   ╲  Slope ≈ -0.076
           |    ╲
           |     ╲___
           |         ╲___
           |             ╲___
           |________________╲___________
                     Parameters (log)

Compute-Optimal Frontier:

Data (tokens) |         
              |        ╱ Chinchilla
              |      ╱   (equal scaling)
              |    ╱
              |  ╱  Kaplan
              |╱    (data scales slower)
              |_________________________
                   Parameters

Emergent Abilities¶

Surprising finding: Some abilities appear suddenly at scale!

Ability	Emergence Threshold
Arithmetic	~13B parameters
Multi-step reasoning	~60B parameters
In-context learning	~10B parameters
Code generation	~20B parameters

Not smooth scaling - discrete jumps in capability!

Bitter Lesson (Sutton, 2019)¶

Historical pattern:

Initially: Hand-crafted features/architecture wins
Eventually: Scale + compute + simple methods win

GPT validates this:

Simple architecture (transformer)
Simple objective (next token prediction)
Massive scale → SOTA on many tasks

Open Questions¶

When do scaling laws break?
- Do they hold to 1T+ parameters?
- What about multimodal models?
Sample efficiency:
- Humans learn from far less data
- Can we improve data efficiency?
Emergent abilities:
- Can we predict them?
- What causes discrete jumps?
Optimal architecture:
- Does depth/width ratio matter at scale?
- Sparse vs dense models?

# Dummy training
def train_gpt(model, data_loader, epochs=3):
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    model.train()
    
    losses = []
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for batch_idx, batch in enumerate(data_loader):
            batch = batch.to(device)
            
            # Forward
            logits = model(batch)
            
            # Loss: predict next token
            loss = F.cross_entropy(logits[:, :-1].reshape(-1, vocab_size), 
                                  batch[:, 1:].reshape(-1))
            
            # Backward
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            epoch_loss += loss.item()
            
            if batch_idx % 10 == 0:
                print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")
        
        avg_loss = epoch_loss / len(data_loader)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1} avg loss: {avg_loss:.4f}")
    
    return losses

# Create dummy data
dummy_data = torch.randint(0, vocab_size, (200, 64))
data_loader = torch.utils.data.DataLoader(dummy_data, batch_size=8, shuffle=True)

losses = train_gpt(model, data_loader, epochs=2)

plt.figure(figsize=(8, 5))
plt.plot(losses, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('GPT Training')
plt.grid(True, alpha=0.3)
plt.show()

Generation Strategies¶

Once trained, GPT generates text by sampling from the predicted next-token distribution one token at a time. Several decoding strategies control the quality-diversity trade-off:

Greedy decoding: Always pick the most probable token. Deterministic but repetitive.
Temperature sampling: Scale logits by $1/T$ before softmax. $T < 1$ sharpens the distribution (more deterministic), $T > 1$ flattens it (more diverse).
Top-k sampling: Restrict to the $k$ most probable tokens, then sample.
Nucleus (top-p) sampling: Restrict to the smallest set of tokens whose cumulative probability exceeds $p$, then sample.

Top-p sampling is generally preferred in practice because it adapts the candidate set size to the entropy of each distribution, avoiding both repetitive and incoherent outputs.

4.5. Advanced Generation Strategies: Complete Analysis¶

Decoding Algorithms Comparison¶

Strategy	Deterministic	Diversity	Quality	Speed	Use Case
Greedy	✅ Yes	❌ Low	⚠️ Medium	⚡⚡⚡ Fast	Fast prototyping
Beam Search	✅ Yes	❌ Low	✅ High	⚡⚡ Medium	Translation, summarization
Sampling	❌ No	✅ High	⚠️ Varies	⚡⚡⚡ Fast	Creative generation
Top-k	❌ No	⚠️ Medium	✅ Good	⚡⚡⚡ Fast	General purpose
Nucleus (Top-p)	❌ No	✅ High	✅ High	⚡⚡ Medium	Best overall

1. Greedy Decoding¶

\[x_t = \arg\max_{x} P(x | x_{<t})\]

Problems:

Repetition: Gets stuck in loops
Generic text: Lacks creativity
Local optima: Misses better global sequences

Example:

Input: "The cat"
Greedy: "The cat sat on the mat and the cat sat on the mat and the cat..."

2. Beam Search¶

Keep top-$k$ most probable sequences at each step.

Algorithm:

Beam size = k
At each step:
  1. Expand each beam with all vocab tokens
  2. Score: log P(x_1,...,x_t) = sum of log probs
  3. Keep top-k sequences
  4. Repeat until EOS or max length

Mathematical formulation:

\[\text{Score}(x_{1:t}) = \sum_{i=1}^t \log P(x_i | x_{<i})\]

Keep top-$k$ sequences by score.

Length normalization:

Raw scores favor shorter sequences! Apply:

\[\text{Score}_{\text{norm}} = \frac{1}{t^\alpha} \sum_{i=1}^t \log P(x_i | x_{<i})\]

where $\alpha \in [0.6, 0.7]$ typically.

Problems:

Still deterministic → boring outputs
Computationally expensive ($k$ forward passes)
Generic “safe” text

3. Temperature Sampling¶

Adjust logits before sampling:

\[P(x_t | x_{<t}) = \frac{\exp(z_t / T)}{\sum_x \exp(z_x / T)}\]

where $z_t$ is the logit for token $t$, and $T$ is temperature.

Effect of temperature:

$T$	Behavior	Distribution Shape
$T \to 0$	Greedy (deterministic)	Sharp peak
$T = 1$	Original distribution	Normal
$T > 1$	More random	Flattened
$T \to \infty$	Uniform sampling	Flat

Mathematical intuition:

Temperature controls entropy:

\[H = -\sum_x P(x) \log P(x)\]

Low $T$ → Low entropy → Confident
High $T$ → High entropy → Uncertain

4. Top-k Sampling¶

Sample from top-$k$ most probable tokens:

\[\begin{split}P_{\text{top-k}}(x_t) = \begin{cases} \frac{P(x_t)}{\sum_{x \in V_k} P(x)} & \text{if } x_t \in V_k \\ 0 & \text{otherwise} \end{cases}\end{split}\]

where $V_k$ is set of top-$k$ tokens.

Problems:

Fixed $k$ doesn’t adapt to distribution shape
When should $k=5$ vs $k=50$?

Example:

Flat distribution:     Peaked distribution:
P = [0.2, 0.19, 0.18,  P = [0.8, 0.05, 0.04, 
     0.17, 0.16, ...]       0.03, 0.02, ...]

k=5: Good (covers 90%) k=5: Bad (includes noise)

5. Nucleus Sampling (Top-p)¶

Best modern approach (Holtzman et al., 2019)

Sample from smallest set of tokens with cumulative probability ≥ $p$:

\[V_p = \min \left\{ V : \sum_{x \in V} P(x) \geq p \right\}\]

Then sample from renormalized distribution over $V_p$.

Algorithm:

def nucleus_sampling(logits, p=0.9):
    # Sort by probability
    sorted_probs, sorted_indices = torch.sort(
        torch.softmax(logits, dim=-1), 
        descending=True
    )
    
    # Cumulative probabilities
    cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Find cutoff
    cutoff_index = (cumsum_probs > p).nonzero()[0]
    
    # Keep only nucleus
    nucleus_probs = sorted_probs[:cutoff_index+1]
    nucleus_indices = sorted_indices[:cutoff_index+1]
    
    # Renormalize and sample
    nucleus_probs /= nucleus_probs.sum()
    next_token = nucleus_indices[
        torch.multinomial(nucleus_probs, 1)
    ]
    
    return next_token

Advantages: ✅ Adaptive: Vocabulary size changes based on distribution ✅ Quality: Filters low-probability garbage ✅ Diversity: Still allows exploration ✅ Consistent: Works across different contexts

Typical values: $p \in [0.9, 0.95]$

6. Repetition Penalty¶

Problem: Models often repeat phrases.

Solution: Penalize tokens that already appeared:

\[\begin{split}\text{score}(x_t) = \begin{cases} \frac{\text{logit}(x_t)}{\theta} & \text{if } x_t \in x_{<t} \\ \text{logit}(x_t) & \text{otherwise} \end{cases}\end{split}\]

where $\theta > 1$ is repetition penalty (typically 1.2).

7. Contrastive Search (SimCTG)¶

Idea: Balance likelihood and diversity.

\[x_t = \arg\max_x \left[ (1-\alpha) \cdot \log P(x|x_{<t}) - \alpha \cdot \max_{i<t} \text{sim}(h_x, h_i) \right]\]

First term: Language model score
Second term: Penalty for similarity to previous tokens
$\alpha$: Trade-off parameter

Result: Coherent + diverse text!

Practical Recommendations¶

For different tasks:

Task	Best Strategy	Parameters
Creative writing	Nucleus + temp	$p=0.95$, $T=0.8$
Translation	Beam search	$k=5$, length penalty
Code generation	Nucleus	$p=0.9$, $T=0.2$
Chatbot	Nucleus + rep penalty	$p=0.9$, $\theta=1.2$
Summarization	Beam search	$k=4$
Question answering	Greedy or beam	$k=1$ or $k=3$

Implementation Example¶

def advanced_generate(model, context, max_len=100,
                     strategy='nucleus', p=0.9, temp=1.0, 
                     k=50, rep_penalty=1.0):
    """Unified generation with multiple strategies."""
    
    for _ in range(max_len):
        logits = model(context)[:, -1, :] / temp
        
        # Apply repetition penalty
        if rep_penalty != 1.0:
            for token in context[0]:
                logits[0, token] /= rep_penalty
        
        probs = F.softmax(logits, dim=-1)
        
        if strategy == 'greedy':
            next_token = probs.argmax(dim=-1, keepdim=True)
        
        elif strategy == 'top_k':
            top_probs, top_indices = probs.topk(k)
            next_token = top_indices.gather(
                -1, torch.multinomial(top_probs, 1)
            )
        
        elif strategy == 'nucleus':
            sorted_probs, sorted_indices = probs.sort(
                descending=True
            )
            cumsum = sorted_probs.cumsum(dim=-1)
            cutoff = (cumsum > p).nonzero()[0, 0] + 1
            nucleus_probs = sorted_probs[:, :cutoff]
            nucleus_probs /= nucleus_probs.sum()
            idx = torch.multinomial(nucleus_probs, 1)
            next_token = sorted_indices.gather(-1, idx)
        
        context = torch.cat([context, next_token], dim=1)
        
        if next_token.item() == EOS_TOKEN:
            break
    
    return context

# Test generation
model.eval()
context = torch.randint(0, vocab_size, (1, 10)).to(device)

print("Context:", context[0].tolist())

# Greedy
generated = model.generate(context, max_new_tokens=20, temperature=1.0)
print("\nGreedy:", generated[0].tolist())

# Temperature sampling
generated = model.generate(context, max_new_tokens=20, temperature=0.8)
print("\nTemp=0.8:", generated[0].tolist())

# Top-k
generated = model.generate(context, max_new_tokens=20, temperature=1.0, top_k=10)
print("\nTop-k=10:", generated[0].tolist())

Summary¶

GPT Architecture:¶

Decoder-only: Causal masked attention

Autoregressive: $P(x_t | x_{<t})$

Key Components:¶

Token + position embeddings
Causal self-attention (masked)
Feed-forward networks
Layer normalization

Training:¶

\[\mathcal{L} = -\sum_t \log P(x_t | x_{<t})\]

Generation:¶

Greedy: argmax
Temperature: control randomness
Top-k: sample from top k tokens
Top-p (nucleus): cumulative probability threshold

GPT Variants:¶

GPT-2: 1.5B parameters
GPT-3: 175B parameters
GPT-4: Multimodal
InstructGPT: RLHF fine-tuning

Applications:¶

Text generation
Code completion
Question answering
Translation

Next Steps:¶

12_bert_architecture.ipynb - Compare encoder
Explore instruction tuning
Fine-tune on specific tasks

Method	Parameters	Extrapolation	Complexity	Used In
Learned Absolute	\(O(L \cdot d)\)	❌ Poor	\(O(1)\)	GPT-2, GPT-3
Sinusoidal	0	⚠️ Moderate	\(O(1)\)	Original Transformer
Relative	\(O(L \cdot d)\)	✅ Good	\(O(L)\)	T5, DeBERTa
RoPE	0	✅ Excellent	\(O(1)\)	LLaMA, GPT-J
ALiBi	0	✅ Excellent	\(O(1)\)	BLOOM, MPT

\(T\)	Behavior	Distribution Shape
\(T \to 0\)	Greedy (deterministic)	Sharp peak
\(T = 1\)	Original distribution	Normal
\(T > 1\)	More random	Flattened
\(T \to \infty\)	Uniform sampling	Flat

Task	Best Strategy	Parameters
Creative writing	Nucleus + temp	\(p=0.95\), \(T=0.8\)
Translation	Beam search	\(k=5\), length penalty
Code generation	Nucleus	\(p=0.9\), \(T=0.2\)
Chatbot	Nucleus + rep penalty	\(p=0.9\), \(\theta=1.2\)
Summarization	Beam search	\(k=4\)
Question answering	Greedy or beam	\(k=1\) or \(k=3\)