import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

1. GPT vs BERTΒΆ

Feature

GPT

BERT

Architecture

Decoder-only

Encoder-only

Attention

Causal (masked)

Bidirectional

Objective

Next token prediction

MLM + NSP

Use case

Generation

Understanding

Autoregressive Language ModelingΒΆ

\[P(x_{1:T}) = \prod_{t=1}^T P(x_t | x_{<t})\]

Training: Maximize log-likelihood $\(\mathcal{L} = \sum_{t=1}^T \log P(x_t | x_{<t})\)$

1.5. Causal Masking: Mathematical FoundationΒΆ

Why Causal Masking?ΒΆ

In autoregressive generation, we must prevent β€œinformation leakage” from future tokens:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\]

where the causal mask \(M\) is:

\[\begin{split}M_{ij} = \begin{cases} 0 & \text{if } i \geq j \text{ (can attend)} \\ -\infty & \text{if } i < j \text{ (cannot attend)} \end{cases}\end{split}\]

Interpretation:

  • Token at position \(i\) can only attend to positions \(j \leq i\)

  • After softmax, \(\exp(-\infty) = 0\) β†’ future positions get zero weight

  • Ensures \(P(x_t | x_{<t})\) only uses past context

Comparison of Attention PatternsΒΆ

Model

Attention Type

Mask Pattern

Training Objective

GPT

Causal (unidirectional)

Lower triangular

Next token prediction

BERT

Bidirectional

Full matrix (some masked)

MLM + NSP

Prefix LM

Prefix bidirectional + causal

Block matrix

Span prediction

Mathematical ConsequencesΒΆ

1. Training Efficiency:

For sequence length \(T\), we get \(T\) training signals per sequence:

\[\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_1, \ldots, x_{t-1})\]

Each position learns to predict the next token given all previous context.

2. Generation Constraint:

At inference, must generate sequentially:

  • Cannot parallelize generation across time

  • Complexity: \(O(T)\) forward passes for \(T\) tokens

  • But: Can batch generate multiple sequences

3. Context Window:

Effective context is cumulative:

  • Position 1: sees nothing (just BOS)

  • Position \(t\): sees \(t-1\) previous tokens

  • Position \(T\): sees full sequence

Implementation DetailsΒΆ

# Creating causal mask
def create_causal_mask(seq_len):
    """Lower triangular matrix for causal attention."""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    # Convert 0s to -inf, 1s to 0
    mask = mask.masked_fill(mask == 0, float('-inf'))
    mask = mask.masked_fill(mask == 1, 0.0)
    return mask

# Example: 5x5 causal mask
mask = create_causal_mask(5)
print("Causal mask pattern:")
print(mask)

Attention scores before/after masking:

Before:  [[0.8, 0.6, 0.5, 0.4, 0.3],    After softmax with mask:
          [0.7, 0.9, 0.6, 0.5, 0.4],     [[1.0, 0.0, 0.0, 0.0, 0.0],
          [0.6, 0.5, 0.8, 0.7, 0.5],  β†’   [0.4, 0.6, 0.0, 0.0, 0.0],
          [0.5, 0.6, 0.7, 0.9, 0.8],      [0.2, 0.3, 0.5, 0.0, 0.0],
          [0.4, 0.5, 0.6, 0.7, 1.0]]      [0.1, 0.2, 0.2, 0.3, 0.2]]

Causal vs Bidirectional Trade-offsΒΆ

GPT (Causal) Advantages: βœ… Natural for generation tasks βœ… Coherent long-form text βœ… Simple training objective βœ… No special tokens needed

GPT Disadvantages: ❌ Cannot see future context ❌ Less effective for classification ❌ Slower inference (sequential)

BERT (Bidirectional) Advantages: βœ… Full context for understanding βœ… Better for classification/NER βœ… Parallel inference

BERT Disadvantages: ❌ Cannot generate naturally ❌ Needs [MASK] tokens ❌ Two-stage training (MLM β†’ task)

GPT ComponentsΒΆ

GPT uses a stack of Transformer decoder layers with causal (left-to-right) self-attention: each token can only attend to itself and preceding tokens, enforced by an upper-triangular attention mask. This autoregressive structure means the model factorizes the joint probability of a sequence as \(p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1})\). The architecture is otherwise similar to BERT – layer normalization, multi-head attention, and feed-forward layers with residual connections – but the causal mask is what enables GPT to generate text token by token. Modern GPT variants (GPT-2, GPT-3, GPT-4) scale this architecture to billions of parameters, demonstrating that autoregressive language models exhibit emergent capabilities with scale.

2.5. Positional Encoding: Theory and VariantsΒΆ

The Position ProblemΒΆ

Self-attention is permutation invariant - it doesn’t know token order!

\[\text{Attention}(\text{permute}(X)) = \text{permute}(\text{Attention}(X))\]

We need to inject positional information.

Absolute Positional Encoding (GPT-2, GPT-3)ΒΆ

Learned embeddings:

\[x_t' = x_t + p_t\]

where \(p_t \in \mathbb{R}^d\) is a learned position embedding.

Advantages:

  • Simple and effective

  • Can learn task-specific patterns

  • Works well in practice

Disadvantages:

  • Fixed maximum length

  • Poor extrapolation beyond training length

  • No explicit distance notion

Sinusoidal Positional Encoding (Original Transformer)ΒΆ

Formula:

\[\text{PE}_{(t, 2i)} = \sin\left(\frac{t}{10000^{2i/d}}\right)\]
\[\text{PE}_{(t, 2i+1)} = \cos\left(\frac{t}{10000^{2i/d}}\right)\]

Key Properties:

  1. Deterministic: No learning required

  2. Unbounded length: Works for any sequence length

  3. Relative position: \(\text{PE}_{t+k}\) is linear function of \(\text{PE}_t\)

Mathematical intuition:

Different dimensions oscillate at different frequencies:

  • Low dimensions: fast oscillation (capture local patterns)

  • High dimensions: slow oscillation (capture global structure)

Relative Positional EncodingΒΆ

Instead of absolute positions, encode relative distances:

\[\text{Attention}_{ij} = \frac{x_i W_Q (x_j W_K + r_{i-j})^T}{\sqrt{d}}\]

where \(r_{i-j}\) is relative position encoding for distance \(i-j\).

Advantages:

  • Better length generalization

  • Explicit relative distance modeling

  • Used in T5, DeBERTa

Rotary Position Embedding (RoPE) - Modern ApproachΒΆ

Used in: GPT-Neo, GPT-J, LLaMA, PaLM

Idea: Rotate query and key vectors based on position

\[\begin{split}f_{\text{RoPE}}(x, t) = \begin{bmatrix} \cos(t\theta_1) & -\sin(t\theta_1) \\ \sin(t\theta_1) & \cos(t\theta_1) \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\end{split}\]

Applied to each 2D subspace with different frequencies \(\theta_i = 10000^{-2i/d}\).

Key property: Inner product encodes relative position!

\[\langle f(q, m), f(k, n) \rangle = g(q, k, m-n)\]

where \(m, n\) are positions. The dot product only depends on relative distance \(m-n\).

Advantages:

  • Perfect length extrapolation

  • Efficient computation (element-wise)

  • No learned parameters

  • State-of-the-art for LLMs

ALiBi (Attention with Linear Biases)ΒΆ

Used in: BLOOM, MPT

Add bias to attention scores based on distance:

\[\text{Attention}_{ij} = \text{softmax}\left(\frac{q_i k_j^T}{\sqrt{d}} - m \cdot |i-j|\right)\]

where \(m\) is a head-specific slope (each head has different \(m\)).

Advantages:

  • No positional embeddings needed

  • Excellent extrapolation

  • Simple implementation

  • Saves parameters

Comparison TableΒΆ

Method

Parameters

Extrapolation

Complexity

Used In

Learned Absolute

\(O(L \cdot d)\)

❌ Poor

\(O(1)\)

GPT-2, GPT-3

Sinusoidal

0

⚠️ Moderate

\(O(1)\)

Original Transformer

Relative

\(O(L \cdot d)\)

βœ… Good

\(O(L)\)

T5, DeBERTa

RoPE

0

βœ… Excellent

\(O(1)\)

LLaMA, GPT-J

ALiBi

0

βœ… Excellent

\(O(1)\)

BLOOM, MPT

Practical ConsiderationsΒΆ

When to use each:

  1. Learned Absolute - When sequence length is fixed and known

  2. Sinusoidal - Simple baseline, no parameters

  3. RoPE - Best for LLMs - extrapolates well, efficient

  4. ALiBi - When extrapolation critical, simplest implementation

  5. Relative - When relative positions are semantically important

class CausalSelfAttention(nn.Module):
    """Masked self-attention for autoregressive generation."""
    def __init__(self, d_model, n_heads, dropout=0.1, max_len=512):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # QKV projections
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)
        
        # Causal mask
        self.register_buffer('mask', torch.tril(torch.ones(max_len, max_len)).view(1, 1, max_len, max_len))
    
    def forward(self, x):
        B, T, C = x.size()
        
        # QKV
        qkv = self.qkv(x)
        q, k, v = qkv.split(self.d_model, dim=2)
        
        # Reshape for multi-head
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        
        # Attention
        scores = (q @ k.transpose(-2, -1)) / np.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.attn_dropout(attn)
        
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.resid_dropout(self.out_proj(out))
        
        return out

class FeedForward(nn.Module):
    """Position-wise feed-forward network."""
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.dropout(self.linear2(F.gelu(self.linear1(x))))

class GPTBlock(nn.Module):
    """GPT transformer block."""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff, dropout)
    
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x
class GPT(nn.Module):
    """GPT language model."""
    def __init__(self, vocab_size, d_model=256, n_layers=6, n_heads=8, 
                 d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying
        self.token_emb.weight = self.head.weight
        
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx):
        B, T = idx.size()
        
        # Embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
        x = self.dropout(self.token_emb(idx) + self.pos_emb(pos))
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x)
        
        x = self.ln_f(x)
        logits = self.head(x)
        
        return logits
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate text autoregressively."""
        for _ in range(max_new_tokens):
            # Crop to max length
            idx_cond = idx if idx.size(1) <= 512 else idx[:, -512:]
            
            # Forward
            logits = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            
            # Top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        
        return idx

# Create model
vocab_size = 1000
model = GPT(vocab_size, d_model=128, n_layers=4, n_heads=4, d_ff=512).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

TrainingΒΆ

GPT is trained with a causal language modeling objective: predict the next token given all previous tokens. The loss is cross-entropy averaged over all positions in the sequence: \(\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log p(x_t | x_{<t})\). Because the causal mask allows all positions to be predicted in a single forward pass (unlike autoregressive generation which processes one token at a time), training is highly parallelizable and efficient on GPUs. Key training hyperparameters include learning rate schedule (typically with linear warm-up and cosine decay), weight decay, and gradient clipping. Monitoring both training loss and validation perplexity helps detect overfitting, which becomes less of a concern as model and dataset scale increase.

3.5. Scaling Laws for Language ModelsΒΆ

Kaplan et al. (2020): Power Law ScalingΒΆ

Empirical Discovery: Model performance follows predictable power laws!

\[L(N, D, C) \sim \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C}\]

where:

  • \(L\) = test loss

  • \(N\) = number of parameters

  • \(D\) = dataset size (tokens)

  • \(C\) = compute budget (FLOPs)

  • \(\alpha_N \approx 0.076\), \(\alpha_D \approx 0.095\), \(\alpha_C \approx 0.050\)

Key FindingsΒΆ

1. Performance improves as power laws:

\[L(N) \sim N^{-0.076} \quad \text{(when compute not limited)}\]

More parameters β†’ lower loss (until convergence).

2. Data and model size should scale together:

For compute budget \(C\):

\[N_{\text{optimal}} \propto C^{0.73}\]
\[D_{\text{optimal}} \propto C^{0.27}\]

Insight: Should scale model size faster than data size.

3. Extremely large models are undertrained:

GPT-3 (175B parameters) with 300B tokens is not optimal!

According to scaling laws:

  • Should have used ~400B+ tokens

  • Or used smaller model with same compute

Chinchilla Scaling Laws (Hoffmann et al. 2022)ΒΆ

Re-analysis with more data points:

\[N_{\text{optimal}} \propto C^{0.50}\]
\[D_{\text{optimal}} \propto C^{0.50}\]

Key difference: Model size and data should scale equally!

Practical implications:

Compute Budget

Kaplan et al.

Chinchilla

Reality

1e23 FLOPs

400B params, 200B tokens

70B params, 1.4T tokens

Chinchilla: 70B, 1.4T

GPT-3

175B params, 300B tokens

Should be ~30B, 2T tokens

Overtrained model, undertrained data

Result: Chinchilla (70B) outperforms GPT-3 (175B) with 4Γ— fewer parameters!

LLaMA ApproachΒΆ

Meta’s strategy based on Chinchilla:

  • LLaMA-7B: Trained on 1T tokens

  • LLaMA-13B: Trained on 1T tokens

  • LLaMA-65B: Trained on 1.4T tokens

Conclusion: Smaller, longer-trained models > larger, shorter-trained models

Practical Scaling RecommendationsΒΆ

For fixed compute budget \(C\):

\[N \approx 0.5 \sqrt{C/6} \quad \text{(parameters)}\]
\[D \approx 0.5 \sqrt{6C} \quad \text{(tokens)}\]

Example: \(C = 10^{23}\) FLOPs

  • \(N \approx 41B\) parameters

  • \(D \approx 1.2T\) tokens

Scaling Law VisualizationΒΆ

Loss vs Parameters (log-log):

Loss (log) |     
           |  β•²
           |   β•²  Slope β‰ˆ -0.076
           |    β•²
           |     β•²___
           |         β•²___
           |             β•²___
           |________________β•²___________
                     Parameters (log)

Compute-Optimal Frontier:

Data (tokens) |         
              |        β•± Chinchilla
              |      β•±   (equal scaling)
              |    β•±
              |  β•±  Kaplan
              |β•±    (data scales slower)
              |_________________________
                   Parameters

Emergent AbilitiesΒΆ

Surprising finding: Some abilities appear suddenly at scale!

Ability

Emergence Threshold

Arithmetic

~13B parameters

Multi-step reasoning

~60B parameters

In-context learning

~10B parameters

Code generation

~20B parameters

Not smooth scaling - discrete jumps in capability!

Bitter Lesson (Sutton, 2019)ΒΆ

Historical pattern:

  • Initially: Hand-crafted features/architecture wins

  • Eventually: Scale + compute + simple methods win

GPT validates this:

  • Simple architecture (transformer)

  • Simple objective (next token prediction)

  • Massive scale β†’ SOTA on many tasks

Open QuestionsΒΆ

  1. When do scaling laws break?

    • Do they hold to 1T+ parameters?

    • What about multimodal models?

  2. Sample efficiency:

    • Humans learn from far less data

    • Can we improve data efficiency?

  3. Emergent abilities:

    • Can we predict them?

    • What causes discrete jumps?

  4. Optimal architecture:

    • Does depth/width ratio matter at scale?

    • Sparse vs dense models?

# Dummy training
def train_gpt(model, data_loader, epochs=3):
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    model.train()
    
    losses = []
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for batch_idx, batch in enumerate(data_loader):
            batch = batch.to(device)
            
            # Forward
            logits = model(batch)
            
            # Loss: predict next token
            loss = F.cross_entropy(logits[:, :-1].reshape(-1, vocab_size), 
                                  batch[:, 1:].reshape(-1))
            
            # Backward
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            epoch_loss += loss.item()
            
            if batch_idx % 10 == 0:
                print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")
        
        avg_loss = epoch_loss / len(data_loader)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1} avg loss: {avg_loss:.4f}")
    
    return losses

# Create dummy data
dummy_data = torch.randint(0, vocab_size, (200, 64))
data_loader = torch.utils.data.DataLoader(dummy_data, batch_size=8, shuffle=True)

losses = train_gpt(model, data_loader, epochs=2)

plt.figure(figsize=(8, 5))
plt.plot(losses, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('GPT Training')
plt.grid(True, alpha=0.3)
plt.show()

Generation StrategiesΒΆ

Once trained, GPT generates text by sampling from the predicted next-token distribution one token at a time. Several decoding strategies control the quality-diversity trade-off:

  • Greedy decoding: Always pick the most probable token. Deterministic but repetitive.

  • Temperature sampling: Scale logits by \(1/T\) before softmax. \(T < 1\) sharpens the distribution (more deterministic), \(T > 1\) flattens it (more diverse).

  • Top-k sampling: Restrict to the \(k\) most probable tokens, then sample.

  • Nucleus (top-p) sampling: Restrict to the smallest set of tokens whose cumulative probability exceeds \(p\), then sample.

Top-p sampling is generally preferred in practice because it adapts the candidate set size to the entropy of each distribution, avoiding both repetitive and incoherent outputs.

4.5. Advanced Generation Strategies: Complete AnalysisΒΆ

Decoding Algorithms ComparisonΒΆ

Strategy

Deterministic

Diversity

Quality

Speed

Use Case

Greedy

βœ… Yes

❌ Low

⚠️ Medium

⚑⚑⚑ Fast

Fast prototyping

Beam Search

βœ… Yes

❌ Low

βœ… High

⚑⚑ Medium

Translation, summarization

Sampling

❌ No

βœ… High

⚠️ Varies

⚑⚑⚑ Fast

Creative generation

Top-k

❌ No

⚠️ Medium

βœ… Good

⚑⚑⚑ Fast

General purpose

Nucleus (Top-p)

❌ No

βœ… High

βœ… High

⚑⚑ Medium

Best overall

1. Greedy DecodingΒΆ

\[x_t = \arg\max_{x} P(x | x_{<t})\]

Problems:

  • Repetition: Gets stuck in loops

  • Generic text: Lacks creativity

  • Local optima: Misses better global sequences

Example:

Input: "The cat"
Greedy: "The cat sat on the mat and the cat sat on the mat and the cat..."

3. Temperature SamplingΒΆ

Adjust logits before sampling:

\[P(x_t | x_{<t}) = \frac{\exp(z_t / T)}{\sum_x \exp(z_x / T)}\]

where \(z_t\) is the logit for token \(t\), and \(T\) is temperature.

Effect of temperature:

\(T\)

Behavior

Distribution Shape

\(T \to 0\)

Greedy (deterministic)

Sharp peak

\(T = 1\)

Original distribution

Normal

\(T > 1\)

More random

Flattened

\(T \to \infty\)

Uniform sampling

Flat

Mathematical intuition:

Temperature controls entropy:

\[H = -\sum_x P(x) \log P(x)\]
  • Low \(T\) β†’ Low entropy β†’ Confident

  • High \(T\) β†’ High entropy β†’ Uncertain

4. Top-k SamplingΒΆ

Sample from top-\(k\) most probable tokens:

\[\begin{split}P_{\text{top-k}}(x_t) = \begin{cases} \frac{P(x_t)}{\sum_{x \in V_k} P(x)} & \text{if } x_t \in V_k \\ 0 & \text{otherwise} \end{cases}\end{split}\]

where \(V_k\) is set of top-\(k\) tokens.

Problems:

  • Fixed \(k\) doesn’t adapt to distribution shape

  • When should \(k=5\) vs \(k=50\)?

Example:

Flat distribution:     Peaked distribution:
P = [0.2, 0.19, 0.18,  P = [0.8, 0.05, 0.04, 
     0.17, 0.16, ...]       0.03, 0.02, ...]

k=5: Good (covers 90%) k=5: Bad (includes noise)

5. Nucleus Sampling (Top-p)ΒΆ

Best modern approach (Holtzman et al., 2019)

Sample from smallest set of tokens with cumulative probability β‰₯ \(p\):

\[V_p = \min \left\{ V : \sum_{x \in V} P(x) \geq p \right\}\]

Then sample from renormalized distribution over \(V_p\).

Algorithm:

def nucleus_sampling(logits, p=0.9):
    # Sort by probability
    sorted_probs, sorted_indices = torch.sort(
        torch.softmax(logits, dim=-1), 
        descending=True
    )
    
    # Cumulative probabilities
    cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Find cutoff
    cutoff_index = (cumsum_probs > p).nonzero()[0]
    
    # Keep only nucleus
    nucleus_probs = sorted_probs[:cutoff_index+1]
    nucleus_indices = sorted_indices[:cutoff_index+1]
    
    # Renormalize and sample
    nucleus_probs /= nucleus_probs.sum()
    next_token = nucleus_indices[
        torch.multinomial(nucleus_probs, 1)
    ]
    
    return next_token

Advantages: βœ… Adaptive: Vocabulary size changes based on distribution βœ… Quality: Filters low-probability garbage βœ… Diversity: Still allows exploration βœ… Consistent: Works across different contexts

Typical values: \(p \in [0.9, 0.95]\)

6. Repetition PenaltyΒΆ

Problem: Models often repeat phrases.

Solution: Penalize tokens that already appeared:

\[\begin{split}\text{score}(x_t) = \begin{cases} \frac{\text{logit}(x_t)}{\theta} & \text{if } x_t \in x_{<t} \\ \text{logit}(x_t) & \text{otherwise} \end{cases}\end{split}\]

where \(\theta > 1\) is repetition penalty (typically 1.2).

7. Contrastive Search (SimCTG)ΒΆ

Idea: Balance likelihood and diversity.

\[x_t = \arg\max_x \left[ (1-\alpha) \cdot \log P(x|x_{<t}) - \alpha \cdot \max_{i<t} \text{sim}(h_x, h_i) \right]\]
  • First term: Language model score

  • Second term: Penalty for similarity to previous tokens

  • \(\alpha\): Trade-off parameter

Result: Coherent + diverse text!

Practical RecommendationsΒΆ

For different tasks:

Task

Best Strategy

Parameters

Creative writing

Nucleus + temp

\(p=0.95\), \(T=0.8\)

Translation

Beam search

\(k=5\), length penalty

Code generation

Nucleus

\(p=0.9\), \(T=0.2\)

Chatbot

Nucleus + rep penalty

\(p=0.9\), \(\theta=1.2\)

Summarization

Beam search

\(k=4\)

Question answering

Greedy or beam

\(k=1\) or \(k=3\)

Implementation ExampleΒΆ

def advanced_generate(model, context, max_len=100,
                     strategy='nucleus', p=0.9, temp=1.0, 
                     k=50, rep_penalty=1.0):
    """Unified generation with multiple strategies."""
    
    for _ in range(max_len):
        logits = model(context)[:, -1, :] / temp
        
        # Apply repetition penalty
        if rep_penalty != 1.0:
            for token in context[0]:
                logits[0, token] /= rep_penalty
        
        probs = F.softmax(logits, dim=-1)
        
        if strategy == 'greedy':
            next_token = probs.argmax(dim=-1, keepdim=True)
        
        elif strategy == 'top_k':
            top_probs, top_indices = probs.topk(k)
            next_token = top_indices.gather(
                -1, torch.multinomial(top_probs, 1)
            )
        
        elif strategy == 'nucleus':
            sorted_probs, sorted_indices = probs.sort(
                descending=True
            )
            cumsum = sorted_probs.cumsum(dim=-1)
            cutoff = (cumsum > p).nonzero()[0, 0] + 1
            nucleus_probs = sorted_probs[:, :cutoff]
            nucleus_probs /= nucleus_probs.sum()
            idx = torch.multinomial(nucleus_probs, 1)
            next_token = sorted_indices.gather(-1, idx)
        
        context = torch.cat([context, next_token], dim=1)
        
        if next_token.item() == EOS_TOKEN:
            break
    
    return context
# Test generation
model.eval()
context = torch.randint(0, vocab_size, (1, 10)).to(device)

print("Context:", context[0].tolist())

# Greedy
generated = model.generate(context, max_new_tokens=20, temperature=1.0)
print("\nGreedy:", generated[0].tolist())

# Temperature sampling
generated = model.generate(context, max_new_tokens=20, temperature=0.8)
print("\nTemp=0.8:", generated[0].tolist())

# Top-k
generated = model.generate(context, max_new_tokens=20, temperature=1.0, top_k=10)
print("\nTop-k=10:", generated[0].tolist())

SummaryΒΆ

GPT Architecture:ΒΆ

Decoder-only: Causal masked attention

Autoregressive: \(P(x_t | x_{<t})\)

Key Components:ΒΆ

  • Token + position embeddings

  • Causal self-attention (masked)

  • Feed-forward networks

  • Layer normalization

Training:ΒΆ

\[\mathcal{L} = -\sum_t \log P(x_t | x_{<t})\]

Generation:ΒΆ

  • Greedy: argmax

  • Temperature: control randomness

  • Top-k: sample from top k tokens

  • Top-p (nucleus): cumulative probability threshold

GPT Variants:ΒΆ

  • GPT-2: 1.5B parameters

  • GPT-3: 175B parameters

  • GPT-4: Multimodal

  • InstructGPT: RLHF fine-tuning

Applications:ΒΆ

  • Text generation

  • Code completion

  • Question answering

  • Translation

Next Steps:ΒΆ

  • 12_bert_architecture.ipynb - Compare encoder

  • Explore instruction tuning

  • Fine-tune on specific tasks