Attention Mechanism: The Breakthrough InnovationΒΆ
Table of ContentsΒΆ
Why Attention Was InventedΒΆ
The Problem with RNNsΒΆ
Before attention, sequence models (RNNs, LSTMs) had a fundamental limitation:
Sentence: "The cat that ate the mouse that lived in the barn ran away"
Problem: By the time model gets to "ran", information about "cat"
has been compressed through many timesteps and starts to fade.
Information flow:
cat β that β ate β the β mouse β ... β ran
β β
[compressed in hidden state, fading...]
RNN limitations:
β Sequential processing (slow)
β Information bottleneck (fixed-size hidden state)
β Vanishing gradients for long sequences
β Cannot look back at earlier words directly
The Solution: AttentionΒΆ
Key Innovation: Let the model look back at ANY previous word when processing current word.
When predicting "ran":
- Attention can directly look at "cat" (high attention)
- Also look at "that", "ate", etc. (lower attention)
- Weights determine how much to focus on each word
No information loss!
Attention benefits:
β Direct access to any previous position
β Parallel processing possible
β No vanishing gradients
β Model learns what to focus on
The Core ConceptΒΆ
The Human AnalogyΒΆ
When you read this sentence: βThe Eiffel Tower is in Paris, which is the capital of France.β
To answer βWhat city has the Eiffel Tower?β, you:
Scan the sentence
Attend to relevant words: βEiffel Towerβ, βParisβ
Ignore less relevant: βwhichβ, βisβ, βtheβ, βofβ
Form answer from attended information
Attention mechanism does the same thing!
The IntuitionΒΆ
Question: "Where is the Eiffel Tower?"
Context: "The Eiffel Tower is located in Paris, France."
Attention weights:
The [0.05] β
Eiffel [0.30] ββββββββ
Tower [0.25] βββββββ
is [0.02] β
located [0.05] β
in [0.03] β
Paris [0.25] βββββββ
France [0.05] β
Model focuses heavily on: "Eiffel", "Tower", "Paris"
From Attention to Self-AttentionΒΆ
Regular Attention: Query one sequence, attend to another
Used in encoder-decoder models (translation)
Example: Query = βWo ist der Eiffelturm?β, Attend to = βThe Eiffel Tower is in Parisβ
Self-Attention: Query and attend to the SAME sequence
Used in transformers (BERT, GPT)
Example: Each word attends to every word in same sentence
Helps understand relationships within text
Self-Attention Step by StepΒΆ
Letβs build intuition with a simple example.
Example SentenceΒΆ
"The cat sat"
Goal: For each word, compute attention to all words (including itself).
Step 1: Embed WordsΒΆ
Convert words to vectors (from Phase 2: Embeddings):
# Simplified 4-dimensional embeddings
embeddings = {
"The": [0.2, 0.1, 0.5, 0.3],
"cat": [0.5, 0.8, 0.2, 0.1],
"sat": [0.1, 0.3, 0.9, 0.4]
}
# Stack into matrix
X = [[0.2, 0.1, 0.5, 0.3], # The
[0.5, 0.8, 0.2, 0.1], # cat
[0.1, 0.3, 0.9, 0.4]] # sat
# Shape: (3, 4) - 3 words, 4 dimensions
Step 2: Create Q, K, V MatricesΒΆ
We need three transformations of our input:
# Weight matrices (learned during training)
W_q = random_matrix(4, 4) # Query weights
W_k = random_matrix(4, 4) # Key weights
W_v = random_matrix(4, 4) # Value weights
# Transform embeddings
Q = X @ W_q # Query: What am I looking for?
K = X @ W_k # Key: What do I contain?
V = X @ W_v # Value: What do I output?
# All have shape (3, 4)
Intuition:
Query (Q): βWhat information do I need?β
Key (K): βWhat information do I have?β
Value (V): βWhat information do I output?β
Step 3: Compute Attention ScoresΒΆ
Measure similarity between queries and keys:
scores = Q @ K.T # Matrix multiplication
# Shape: (3, 3)
# Example result:
scores = [[2.1, 1.5, 0.8], # The attends to: The, cat, sat
[1.5, 3.2, 1.9], # cat attends to: The, cat, sat
[0.8, 1.9, 2.7]] # sat attends to: The, cat, sat
Interpretation:
scores[1, 0] = 1.5means βcatβ has score 1.5 when attending to βTheβscores[1, 1] = 3.2means βcatβ has highest score when attending to itselfHigher score = more relevant
Step 4: Scale ScoresΒΆ
Divide by square root of dimension to stabilize gradients:
d_k = 4 # Dimension of keys
scaled_scores = scores / sqrt(d_k)
Why scale?
Prevents very large values
Keeps softmax gradients well-behaved
Becomes more important with larger dimensions
Step 5: Apply SoftmaxΒΆ
Convert scores to probability distribution:
attention_weights = softmax(scaled_scores)
# Example result:
attention_weights = [
[0.55, 0.32, 0.13], # The: 55% to itself, 32% to cat, 13% to sat
[0.15, 0.60, 0.25], # cat: 15% to The, 60% to itself, 25% to sat
[0.10, 0.30, 0.60] # sat: 10% to The, 30% to cat, 60% to itself
]
Properties:
Each row sums to 1.0
Represents how much to attend to each position
These are the famous βattention weightsβ
Step 6: Apply Weights to ValuesΒΆ
Compute weighted sum of values:
output = attention_weights @ V
# Shape: (3, 4)
# For word "cat" (row 1):
output[1] = 0.15 * V[0] + 0.60 * V[1] + 0.25 * V[2]
# β β β
# from "The" from "cat" from "sat"
Result: Each wordβs output is a weighted combination of all wordsβ values.
Complete FormulaΒΆ
Putting it all together:
Attention(Q, K, V) = softmax(QK^T / βd_k) V
Where:
- Q = queries (what I'm looking for)
- K = keys (what I have)
- V = values (what I output)
- d_k = dimension of keys
- / βd_k = scaling factor
- softmax = converts to probabilities
Query, Key, Value (QKV)ΒΆ
The Database AnalogyΒΆ
Think of attention like a database lookup:
# Database with (key, value) pairs
database = {
"Paris": "Capital of France, home to Eiffel Tower",
"London": "Capital of UK, home to Big Ben",
"Tokyo": "Capital of Japan, largest city"
}
# Query
query = "Where is the Eiffel Tower?"
# Step 1: Match query to keys
scores = {
"Paris": 0.85, # High match!
"London": 0.15,
"Tokyo": 0.10
}
# Step 2: Softmax to get attention weights
attention = softmax(scores) # [0.70, 0.17, 0.13]
# Step 3: Retrieve weighted combination of values
result = 0.70 * database["Paris"] +
0.17 * database["London"] +
0.13 * database["Tokyo"]
# Mostly Paris information!
In Neural NetworksΒΆ
# Input embeddings
X = word_embeddings # Shape: (seq_len, d_model)
# Linear transformations (learned)
Q = X @ W_q # "What to search for"
K = X @ W_k # "How to identify relevant info"
V = X @ W_v # "What info to pass forward"
# Attention computation
scores = Q @ K.T # Similarity
weights = softmax(scores) # Probability
output = weights @ V # Weighted combination
Why Three Matrices?ΒΆ
Question: Why not just use X directly?
Answer: Flexibility and expressiveness!
Different transformations learn different aspects
Q, K, V can focus on different features
Allows model to learn complex relationships
Gives model more parameters to optimize
Example:
Q might learn to look for βsubjectsβ
K might learn to identify βverbsβ
V might learn to extract βsemantic meaningβ
Scaled Dot-Product AttentionΒΆ
The Complete MechanismΒΆ
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Args:
Q: Queries (batch, seq_len, d_k)
K: Keys (batch, seq_len, d_k)
V: Values (batch, seq_len, d_v)
mask: Optional mask (batch, seq_len, seq_len)
Returns:
output: (batch, seq_len, d_v)
attention_weights: (batch, seq_len, seq_len)
"""
d_k = Q.shape[-1]
# Compute attention scores
scores = Q @ K.transpose(-2, -1) / sqrt(d_k)
# Apply mask (for padding or causality)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Softmax to get attention weights
attention_weights = softmax(scores, dim=-1)
# Apply weights to values
output = attention_weights @ V
return output, attention_weights
MaskingΒΆ
Padding Mask: Ignore padded tokens
# Sentence: "The cat <PAD> <PAD>"
mask = [[1, 1, 0, 0]] # Only attend to real words
Causal Mask: Prevent looking at future tokens (for GPT)
# When predicting word 2, can only see words 0, 1
mask = [[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]
# Lower triangular matrix
Multi-Head AttentionΒΆ
The Problem with Single AttentionΒΆ
Single attention head can only capture one type of relationship:
"The cat sat on the mat"
Single head might learn: Subject-Verb relationships
- "cat" β "sat" (subject-verb)
But misses:
- Spatial relationships: "sat" β "on"
- Object relationships: "on" β "mat"
The Solution: Multiple HeadsΒΆ
Run attention multiple times in parallel, each learning different patterns:
Head 1: Subject-Verb relationships
"The" β [0.1, 0.1, 0.8, 0.0, 0.0, 0.0] # Focuses on "sat"
"cat" β [0.1, 0.7, 0.2, 0.0, 0.0, 0.0] # Focuses on itself and "sat"
Head 2: Object-Preposition relationships
"sat" β [0.0, 0.0, 0.1, 0.8, 0.1, 0.0] # Focuses on "on"
"on" β [0.0, 0.0, 0.0, 0.1, 0.2, 0.7] # Focuses on "mat"
Head 3: Positional relationships
Each word attends to neighbors
... up to 8 or 12 heads
ImplementationΒΆ
def multi_head_attention(X, num_heads=8):
"""
Args:
X: Input (batch, seq_len, d_model)
num_heads: Number of attention heads
Returns:
output: (batch, seq_len, d_model)
"""
d_model = X.shape[-1]
d_k = d_model // num_heads # Split dimensions across heads
# Create Q, K, V for all heads at once
Q = X @ W_q # (batch, seq_len, d_model)
K = X @ W_k
V = X @ W_v
# Reshape to separate heads
# (batch, seq_len, num_heads, d_k)
Q = Q.reshape(batch, seq_len, num_heads, d_k)
K = K.reshape(batch, seq_len, num_heads, d_k)
V = V.reshape(batch, seq_len, num_heads, d_k)
# Transpose to (batch, num_heads, seq_len, d_k)
Q = Q.transpose(1, 2)
K = K.transpose(1, 2)
V = V.transpose(1, 2)
# Apply attention for each head in parallel
attention_output = scaled_dot_product_attention(Q, K, V)
# Shape: (batch, num_heads, seq_len, d_k)
# Concatenate heads
attention_output = attention_output.transpose(1, 2)
# Shape: (batch, seq_len, num_heads, d_k)
attention_output = attention_output.reshape(batch, seq_len, d_model)
# Shape: (batch, seq_len, d_model)
# Final linear transformation
output = attention_output @ W_o
return output
Why It WorksΒΆ
Ensemble Effect: Multiple heads vote on whatβs important
Specialized Roles: Each head can specialize:
Syntactic relationships (grammar)
Semantic relationships (meaning)
Positional relationships (nearby words)
Long-range dependencies (distant words)
Empirical Success:
GPT-3: 96 attention heads
BERT-base: 12 heads
BERT-large: 16 heads
Practical ExamplesΒΆ
Example 1: Machine TranslationΒΆ
English: "I love machine learning"
French: "J'adore l'apprentissage automatique"
When generating "apprentissage":
Attention weights to English words:
I [0.05]
love [0.10]
machine [0.35] ββββββββ
learning [0.50] ββββββββββββ
Model attends heavily to "machine learning" when generating "apprentissage"
Example 2: Question AnsweringΒΆ
Question: "Who invented the transformer?"
Context: "The transformer architecture was invented by Vaswani et al.
in 2017 at Google Brain. The paper 'Attention is All You Need'
introduced this revolutionary architecture."
Attention when answering:
architecture [0.20] ββββ
invented [0.30] ββββββββ
by [0.05] β
Vaswani [0.35] βββββββββ
et [0.05] β
al [0.05] β
Answer: "Vaswani et al."
Example 3: Sentiment AnalysisΒΆ
Review: "The movie was good but the ending was terrible"
When predicting sentiment:
Attention weights:
The [0.05]
movie [0.15] βββ
was [0.02]
good [0.25] ββββββ
but [0.08] ββ
the [0.02]
ending [0.18] ββββ
was [0.02]
terrible [0.23] ββββββ
Model focuses on: "good" and "terrible"
Result: Mixed sentiment (conflicting signals)
Why It Works So WellΒΆ
1. Parallel ProcessingΒΆ
RNN (Sequential):
Time: Tβ β Tβ β Tβ β Tβ β Tβ
Must wait for each step to complete
Attention (Parallel):
All positions computed simultaneously
Time: Tβ (single forward pass for entire sequence)
100x faster training on GPUs
2. No Information BottleneckΒΆ
RNN: Compresses everything into fixed-size hidden state Attention: Direct access to all positions, no compression needed
3. Better GradientsΒΆ
RNN: Gradients must flow through many timesteps (vanishing/exploding) Attention: Direct paths from output to any input (stable gradients)
4. InterpretabilityΒΆ
Can visualize attention weights to see what model focuses on:
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(attention_weights,
xticklabels=words,
yticklabels=words)
plt.title("Attention Weights")
plt.show()
5. Transfer LearningΒΆ
Pre-trained attention models (BERT, GPT) transfer well to new tasks:
Learn general language understanding
Fine-tune on specific tasks
Requires less task-specific data
SummaryΒΆ
Attention is the key innovation that enabled:
Modern language models (GPT, BERT, T5)
Vision transformers
Multi-modal models
State-of-the-art results across domains
Core concepts to remember:
Attention weights determine what to focus on
QKV mechanism allows flexible learning
Multi-head attention captures multiple relationships
Parallel processing makes training fast
Direct connections enable long-range dependencies
Next: See how attention is used in the complete transformer architecture β transformer_architecture.md