import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
1. GPT vs BERTΒΆ
Feature |
GPT |
BERT |
|---|---|---|
Architecture |
Decoder-only |
Encoder-only |
Attention |
Causal (masked) |
Bidirectional |
Objective |
Next token prediction |
MLM + NSP |
Use case |
Generation |
Understanding |
Autoregressive Language ModelingΒΆ
Training: Maximize log-likelihood $\(\mathcal{L} = \sum_{t=1}^T \log P(x_t | x_{<t})\)$
1.5. Causal Masking: Mathematical FoundationΒΆ
Why Causal Masking?ΒΆ
In autoregressive generation, we must prevent βinformation leakageβ from future tokens:
where the causal mask \(M\) is:
Interpretation:
Token at position \(i\) can only attend to positions \(j \leq i\)
After softmax, \(\exp(-\infty) = 0\) β future positions get zero weight
Ensures \(P(x_t | x_{<t})\) only uses past context
Comparison of Attention PatternsΒΆ
Model |
Attention Type |
Mask Pattern |
Training Objective |
|---|---|---|---|
GPT |
Causal (unidirectional) |
Lower triangular |
Next token prediction |
BERT |
Bidirectional |
Full matrix (some masked) |
MLM + NSP |
Prefix LM |
Prefix bidirectional + causal |
Block matrix |
Span prediction |
Mathematical ConsequencesΒΆ
1. Training Efficiency:
For sequence length \(T\), we get \(T\) training signals per sequence:
Each position learns to predict the next token given all previous context.
2. Generation Constraint:
At inference, must generate sequentially:
Cannot parallelize generation across time
Complexity: \(O(T)\) forward passes for \(T\) tokens
But: Can batch generate multiple sequences
3. Context Window:
Effective context is cumulative:
Position 1: sees nothing (just BOS)
Position \(t\): sees \(t-1\) previous tokens
Position \(T\): sees full sequence
Implementation DetailsΒΆ
# Creating causal mask
def create_causal_mask(seq_len):
"""Lower triangular matrix for causal attention."""
mask = torch.tril(torch.ones(seq_len, seq_len))
# Convert 0s to -inf, 1s to 0
mask = mask.masked_fill(mask == 0, float('-inf'))
mask = mask.masked_fill(mask == 1, 0.0)
return mask
# Example: 5x5 causal mask
mask = create_causal_mask(5)
print("Causal mask pattern:")
print(mask)
Attention scores before/after masking:
Before: [[0.8, 0.6, 0.5, 0.4, 0.3], After softmax with mask:
[0.7, 0.9, 0.6, 0.5, 0.4], [[1.0, 0.0, 0.0, 0.0, 0.0],
[0.6, 0.5, 0.8, 0.7, 0.5], β [0.4, 0.6, 0.0, 0.0, 0.0],
[0.5, 0.6, 0.7, 0.9, 0.8], [0.2, 0.3, 0.5, 0.0, 0.0],
[0.4, 0.5, 0.6, 0.7, 1.0]] [0.1, 0.2, 0.2, 0.3, 0.2]]
Causal vs Bidirectional Trade-offsΒΆ
GPT (Causal) Advantages: β Natural for generation tasks β Coherent long-form text β Simple training objective β No special tokens needed
GPT Disadvantages: β Cannot see future context β Less effective for classification β Slower inference (sequential)
BERT (Bidirectional) Advantages: β Full context for understanding β Better for classification/NER β Parallel inference
BERT Disadvantages: β Cannot generate naturally β Needs [MASK] tokens β Two-stage training (MLM β task)
GPT ComponentsΒΆ
GPT uses a stack of Transformer decoder layers with causal (left-to-right) self-attention: each token can only attend to itself and preceding tokens, enforced by an upper-triangular attention mask. This autoregressive structure means the model factorizes the joint probability of a sequence as \(p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1})\). The architecture is otherwise similar to BERT β layer normalization, multi-head attention, and feed-forward layers with residual connections β but the causal mask is what enables GPT to generate text token by token. Modern GPT variants (GPT-2, GPT-3, GPT-4) scale this architecture to billions of parameters, demonstrating that autoregressive language models exhibit emergent capabilities with scale.
2.5. Positional Encoding: Theory and VariantsΒΆ
The Position ProblemΒΆ
Self-attention is permutation invariant - it doesnβt know token order!
We need to inject positional information.
Absolute Positional Encoding (GPT-2, GPT-3)ΒΆ
Learned embeddings:
where \(p_t \in \mathbb{R}^d\) is a learned position embedding.
Advantages:
Simple and effective
Can learn task-specific patterns
Works well in practice
Disadvantages:
Fixed maximum length
Poor extrapolation beyond training length
No explicit distance notion
Sinusoidal Positional Encoding (Original Transformer)ΒΆ
Formula:
Key Properties:
Deterministic: No learning required
Unbounded length: Works for any sequence length
Relative position: \(\text{PE}_{t+k}\) is linear function of \(\text{PE}_t\)
Mathematical intuition:
Different dimensions oscillate at different frequencies:
Low dimensions: fast oscillation (capture local patterns)
High dimensions: slow oscillation (capture global structure)
Relative Positional EncodingΒΆ
Instead of absolute positions, encode relative distances:
where \(r_{i-j}\) is relative position encoding for distance \(i-j\).
Advantages:
Better length generalization
Explicit relative distance modeling
Used in T5, DeBERTa
Rotary Position Embedding (RoPE) - Modern ApproachΒΆ
Used in: GPT-Neo, GPT-J, LLaMA, PaLM
Idea: Rotate query and key vectors based on position
Applied to each 2D subspace with different frequencies \(\theta_i = 10000^{-2i/d}\).
Key property: Inner product encodes relative position!
where \(m, n\) are positions. The dot product only depends on relative distance \(m-n\).
Advantages:
Perfect length extrapolation
Efficient computation (element-wise)
No learned parameters
State-of-the-art for LLMs
ALiBi (Attention with Linear Biases)ΒΆ
Used in: BLOOM, MPT
Add bias to attention scores based on distance:
where \(m\) is a head-specific slope (each head has different \(m\)).
Advantages:
No positional embeddings needed
Excellent extrapolation
Simple implementation
Saves parameters
Comparison TableΒΆ
Method |
Parameters |
Extrapolation |
Complexity |
Used In |
|---|---|---|---|---|
Learned Absolute |
\(O(L \cdot d)\) |
β Poor |
\(O(1)\) |
GPT-2, GPT-3 |
Sinusoidal |
0 |
β οΈ Moderate |
\(O(1)\) |
Original Transformer |
Relative |
\(O(L \cdot d)\) |
β Good |
\(O(L)\) |
T5, DeBERTa |
RoPE |
0 |
β Excellent |
\(O(1)\) |
LLaMA, GPT-J |
ALiBi |
0 |
β Excellent |
\(O(1)\) |
BLOOM, MPT |
Practical ConsiderationsΒΆ
When to use each:
Learned Absolute - When sequence length is fixed and known
Sinusoidal - Simple baseline, no parameters
RoPE - Best for LLMs - extrapolates well, efficient
ALiBi - When extrapolation critical, simplest implementation
Relative - When relative positions are semantically important
class CausalSelfAttention(nn.Module):
"""Masked self-attention for autoregressive generation."""
def __init__(self, d_model, n_heads, dropout=0.1, max_len=512):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
# QKV projections
self.qkv = nn.Linear(d_model, 3 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
self.attn_dropout = nn.Dropout(dropout)
self.resid_dropout = nn.Dropout(dropout)
# Causal mask
self.register_buffer('mask', torch.tril(torch.ones(max_len, max_len)).view(1, 1, max_len, max_len))
def forward(self, x):
B, T, C = x.size()
# QKV
qkv = self.qkv(x)
q, k, v = qkv.split(self.d_model, dim=2)
# Reshape for multi-head
q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
# Attention
scores = (q @ k.transpose(-2, -1)) / np.sqrt(self.d_k)
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
out = attn @ v
out = out.transpose(1, 2).contiguous().view(B, T, C)
out = self.resid_dropout(self.out_proj(out))
return out
class FeedForward(nn.Module):
"""Position-wise feed-forward network."""
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.dropout(self.linear2(F.gelu(self.linear1(x))))
class GPTBlock(nn.Module):
"""GPT transformer block."""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, n_heads, dropout)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff, dropout)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
class GPT(nn.Module):
"""GPT language model."""
def __init__(self, vocab_size, d_model=256, n_layers=6, n_heads=8,
d_ff=1024, max_len=512, dropout=0.1):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.dropout = nn.Dropout(dropout)
self.blocks = nn.ModuleList([
GPTBlock(d_model, n_heads, d_ff, dropout)
for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying
self.token_emb.weight = self.head.weight
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx):
B, T = idx.size()
# Embeddings
pos = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
x = self.dropout(self.token_emb(idx) + self.pos_emb(pos))
# Transformer blocks
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x)
return logits
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""Generate text autoregressively."""
for _ in range(max_new_tokens):
# Crop to max length
idx_cond = idx if idx.size(1) <= 512 else idx[:, -512:]
# Forward
logits = self(idx_cond)
logits = logits[:, -1, :] / temperature
# Top-k sampling
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
# Create model
vocab_size = 1000
model = GPT(vocab_size, d_model=128, n_layers=4, n_heads=4, d_ff=512).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
TrainingΒΆ
GPT is trained with a causal language modeling objective: predict the next token given all previous tokens. The loss is cross-entropy averaged over all positions in the sequence: \(\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log p(x_t | x_{<t})\). Because the causal mask allows all positions to be predicted in a single forward pass (unlike autoregressive generation which processes one token at a time), training is highly parallelizable and efficient on GPUs. Key training hyperparameters include learning rate schedule (typically with linear warm-up and cosine decay), weight decay, and gradient clipping. Monitoring both training loss and validation perplexity helps detect overfitting, which becomes less of a concern as model and dataset scale increase.
3.5. Scaling Laws for Language ModelsΒΆ
Kaplan et al. (2020): Power Law ScalingΒΆ
Empirical Discovery: Model performance follows predictable power laws!
where:
\(L\) = test loss
\(N\) = number of parameters
\(D\) = dataset size (tokens)
\(C\) = compute budget (FLOPs)
\(\alpha_N \approx 0.076\), \(\alpha_D \approx 0.095\), \(\alpha_C \approx 0.050\)
Key FindingsΒΆ
1. Performance improves as power laws:
More parameters β lower loss (until convergence).
2. Data and model size should scale together:
For compute budget \(C\):
Insight: Should scale model size faster than data size.
3. Extremely large models are undertrained:
GPT-3 (175B parameters) with 300B tokens is not optimal!
According to scaling laws:
Should have used ~400B+ tokens
Or used smaller model with same compute
Chinchilla Scaling Laws (Hoffmann et al. 2022)ΒΆ
Re-analysis with more data points:
Key difference: Model size and data should scale equally!
Practical implications:
Compute Budget |
Kaplan et al. |
Chinchilla |
Reality |
|---|---|---|---|
1e23 FLOPs |
400B params, 200B tokens |
70B params, 1.4T tokens |
Chinchilla: 70B, 1.4T |
GPT-3 |
175B params, 300B tokens |
Should be ~30B, 2T tokens |
Overtrained model, undertrained data |
Result: Chinchilla (70B) outperforms GPT-3 (175B) with 4Γ fewer parameters!
LLaMA ApproachΒΆ
Metaβs strategy based on Chinchilla:
LLaMA-7B: Trained on 1T tokens
LLaMA-13B: Trained on 1T tokens
LLaMA-65B: Trained on 1.4T tokens
Conclusion: Smaller, longer-trained models > larger, shorter-trained models
Practical Scaling RecommendationsΒΆ
For fixed compute budget \(C\):
Example: \(C = 10^{23}\) FLOPs
\(N \approx 41B\) parameters
\(D \approx 1.2T\) tokens
Scaling Law VisualizationΒΆ
Loss vs Parameters (log-log):
Loss (log) |
| β²
| β² Slope β -0.076
| β²
| β²___
| β²___
| β²___
|________________β²___________
Parameters (log)
Compute-Optimal Frontier:
Data (tokens) |
| β± Chinchilla
| β± (equal scaling)
| β±
| β± Kaplan
|β± (data scales slower)
|_________________________
Parameters
Emergent AbilitiesΒΆ
Surprising finding: Some abilities appear suddenly at scale!
Ability |
Emergence Threshold |
|---|---|
Arithmetic |
~13B parameters |
Multi-step reasoning |
~60B parameters |
In-context learning |
~10B parameters |
Code generation |
~20B parameters |
Not smooth scaling - discrete jumps in capability!
Bitter Lesson (Sutton, 2019)ΒΆ
Historical pattern:
Initially: Hand-crafted features/architecture wins
Eventually: Scale + compute + simple methods win
GPT validates this:
Simple architecture (transformer)
Simple objective (next token prediction)
Massive scale β SOTA on many tasks
Open QuestionsΒΆ
When do scaling laws break?
Do they hold to 1T+ parameters?
What about multimodal models?
Sample efficiency:
Humans learn from far less data
Can we improve data efficiency?
Emergent abilities:
Can we predict them?
What causes discrete jumps?
Optimal architecture:
Does depth/width ratio matter at scale?
Sparse vs dense models?
# Dummy training
def train_gpt(model, data_loader, epochs=3):
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
model.train()
losses = []
for epoch in range(epochs):
epoch_loss = 0
for batch_idx, batch in enumerate(data_loader):
batch = batch.to(device)
# Forward
logits = model(batch)
# Loss: predict next token
loss = F.cross_entropy(logits[:, :-1].reshape(-1, vocab_size),
batch[:, 1:].reshape(-1))
# Backward
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
epoch_loss += loss.item()
if batch_idx % 10 == 0:
print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")
avg_loss = epoch_loss / len(data_loader)
losses.append(avg_loss)
print(f"Epoch {epoch+1} avg loss: {avg_loss:.4f}")
return losses
# Create dummy data
dummy_data = torch.randint(0, vocab_size, (200, 64))
data_loader = torch.utils.data.DataLoader(dummy_data, batch_size=8, shuffle=True)
losses = train_gpt(model, data_loader, epochs=2)
plt.figure(figsize=(8, 5))
plt.plot(losses, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('GPT Training')
plt.grid(True, alpha=0.3)
plt.show()
Generation StrategiesΒΆ
Once trained, GPT generates text by sampling from the predicted next-token distribution one token at a time. Several decoding strategies control the quality-diversity trade-off:
Greedy decoding: Always pick the most probable token. Deterministic but repetitive.
Temperature sampling: Scale logits by \(1/T\) before softmax. \(T < 1\) sharpens the distribution (more deterministic), \(T > 1\) flattens it (more diverse).
Top-k sampling: Restrict to the \(k\) most probable tokens, then sample.
Nucleus (top-p) sampling: Restrict to the smallest set of tokens whose cumulative probability exceeds \(p\), then sample.
Top-p sampling is generally preferred in practice because it adapts the candidate set size to the entropy of each distribution, avoiding both repetitive and incoherent outputs.
4.5. Advanced Generation Strategies: Complete AnalysisΒΆ
Decoding Algorithms ComparisonΒΆ
Strategy |
Deterministic |
Diversity |
Quality |
Speed |
Use Case |
|---|---|---|---|---|---|
Greedy |
β Yes |
β Low |
β οΈ Medium |
β‘β‘β‘ Fast |
Fast prototyping |
Beam Search |
β Yes |
β Low |
β High |
β‘β‘ Medium |
Translation, summarization |
Sampling |
β No |
β High |
β οΈ Varies |
β‘β‘β‘ Fast |
Creative generation |
Top-k |
β No |
β οΈ Medium |
β Good |
β‘β‘β‘ Fast |
General purpose |
Nucleus (Top-p) |
β No |
β High |
β High |
β‘β‘ Medium |
Best overall |
1. Greedy DecodingΒΆ
Problems:
Repetition: Gets stuck in loops
Generic text: Lacks creativity
Local optima: Misses better global sequences
Example:
Input: "The cat"
Greedy: "The cat sat on the mat and the cat sat on the mat and the cat..."
2. Beam SearchΒΆ
Keep top-\(k\) most probable sequences at each step.
Algorithm:
Beam size = k
At each step:
1. Expand each beam with all vocab tokens
2. Score: log P(x_1,...,x_t) = sum of log probs
3. Keep top-k sequences
4. Repeat until EOS or max length
Mathematical formulation:
Keep top-\(k\) sequences by score.
Length normalization:
Raw scores favor shorter sequences! Apply:
where \(\alpha \in [0.6, 0.7]\) typically.
Problems:
Still deterministic β boring outputs
Computationally expensive (\(k\) forward passes)
Generic βsafeβ text
3. Temperature SamplingΒΆ
Adjust logits before sampling:
where \(z_t\) is the logit for token \(t\), and \(T\) is temperature.
Effect of temperature:
\(T\) |
Behavior |
Distribution Shape |
|---|---|---|
\(T \to 0\) |
Greedy (deterministic) |
Sharp peak |
\(T = 1\) |
Original distribution |
Normal |
\(T > 1\) |
More random |
Flattened |
\(T \to \infty\) |
Uniform sampling |
Flat |
Mathematical intuition:
Temperature controls entropy:
Low \(T\) β Low entropy β Confident
High \(T\) β High entropy β Uncertain
4. Top-k SamplingΒΆ
Sample from top-\(k\) most probable tokens:
where \(V_k\) is set of top-\(k\) tokens.
Problems:
Fixed \(k\) doesnβt adapt to distribution shape
When should \(k=5\) vs \(k=50\)?
Example:
Flat distribution: Peaked distribution:
P = [0.2, 0.19, 0.18, P = [0.8, 0.05, 0.04,
0.17, 0.16, ...] 0.03, 0.02, ...]
k=5: Good (covers 90%) k=5: Bad (includes noise)
5. Nucleus Sampling (Top-p)ΒΆ
Best modern approach (Holtzman et al., 2019)
Sample from smallest set of tokens with cumulative probability β₯ \(p\):
Then sample from renormalized distribution over \(V_p\).
Algorithm:
def nucleus_sampling(logits, p=0.9):
# Sort by probability
sorted_probs, sorted_indices = torch.sort(
torch.softmax(logits, dim=-1),
descending=True
)
# Cumulative probabilities
cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
# Find cutoff
cutoff_index = (cumsum_probs > p).nonzero()[0]
# Keep only nucleus
nucleus_probs = sorted_probs[:cutoff_index+1]
nucleus_indices = sorted_indices[:cutoff_index+1]
# Renormalize and sample
nucleus_probs /= nucleus_probs.sum()
next_token = nucleus_indices[
torch.multinomial(nucleus_probs, 1)
]
return next_token
Advantages: β Adaptive: Vocabulary size changes based on distribution β Quality: Filters low-probability garbage β Diversity: Still allows exploration β Consistent: Works across different contexts
Typical values: \(p \in [0.9, 0.95]\)
6. Repetition PenaltyΒΆ
Problem: Models often repeat phrases.
Solution: Penalize tokens that already appeared:
where \(\theta > 1\) is repetition penalty (typically 1.2).
7. Contrastive Search (SimCTG)ΒΆ
Idea: Balance likelihood and diversity.
First term: Language model score
Second term: Penalty for similarity to previous tokens
\(\alpha\): Trade-off parameter
Result: Coherent + diverse text!
Practical RecommendationsΒΆ
For different tasks:
Task |
Best Strategy |
Parameters |
|---|---|---|
Creative writing |
Nucleus + temp |
\(p=0.95\), \(T=0.8\) |
Translation |
Beam search |
\(k=5\), length penalty |
Code generation |
Nucleus |
\(p=0.9\), \(T=0.2\) |
Chatbot |
Nucleus + rep penalty |
\(p=0.9\), \(\theta=1.2\) |
Summarization |
Beam search |
\(k=4\) |
Question answering |
Greedy or beam |
\(k=1\) or \(k=3\) |
Implementation ExampleΒΆ
def advanced_generate(model, context, max_len=100,
strategy='nucleus', p=0.9, temp=1.0,
k=50, rep_penalty=1.0):
"""Unified generation with multiple strategies."""
for _ in range(max_len):
logits = model(context)[:, -1, :] / temp
# Apply repetition penalty
if rep_penalty != 1.0:
for token in context[0]:
logits[0, token] /= rep_penalty
probs = F.softmax(logits, dim=-1)
if strategy == 'greedy':
next_token = probs.argmax(dim=-1, keepdim=True)
elif strategy == 'top_k':
top_probs, top_indices = probs.topk(k)
next_token = top_indices.gather(
-1, torch.multinomial(top_probs, 1)
)
elif strategy == 'nucleus':
sorted_probs, sorted_indices = probs.sort(
descending=True
)
cumsum = sorted_probs.cumsum(dim=-1)
cutoff = (cumsum > p).nonzero()[0, 0] + 1
nucleus_probs = sorted_probs[:, :cutoff]
nucleus_probs /= nucleus_probs.sum()
idx = torch.multinomial(nucleus_probs, 1)
next_token = sorted_indices.gather(-1, idx)
context = torch.cat([context, next_token], dim=1)
if next_token.item() == EOS_TOKEN:
break
return context
# Test generation
model.eval()
context = torch.randint(0, vocab_size, (1, 10)).to(device)
print("Context:", context[0].tolist())
# Greedy
generated = model.generate(context, max_new_tokens=20, temperature=1.0)
print("\nGreedy:", generated[0].tolist())
# Temperature sampling
generated = model.generate(context, max_new_tokens=20, temperature=0.8)
print("\nTemp=0.8:", generated[0].tolist())
# Top-k
generated = model.generate(context, max_new_tokens=20, temperature=1.0, top_k=10)
print("\nTop-k=10:", generated[0].tolist())
SummaryΒΆ
GPT Architecture:ΒΆ
Decoder-only: Causal masked attention
Autoregressive: \(P(x_t | x_{<t})\)
Key Components:ΒΆ
Token + position embeddings
Causal self-attention (masked)
Feed-forward networks
Layer normalization
Training:ΒΆ
Generation:ΒΆ
Greedy: argmax
Temperature: control randomness
Top-k: sample from top k tokens
Top-p (nucleus): cumulative probability threshold
GPT Variants:ΒΆ
GPT-2: 1.5B parameters
GPT-3: 175B parameters
GPT-4: Multimodal
InstructGPT: RLHF fine-tuning
Applications:ΒΆ
Text generation
Code completion
Question answering
Translation
Next Steps:ΒΆ
12_bert_architecture.ipynb - Compare encoder
Explore instruction tuning
Fine-tune on specific tasks