Attention Mechanism: The Breakthrough InnovationΒΆ

Table of ContentsΒΆ

  1. Why Attention Was Invented

  2. The Core Concept

  3. Self-Attention Step by Step

  4. Query, Key, Value (QKV)

  5. Scaled Dot-Product Attention

  6. Multi-Head Attention

  7. Practical Examples

  8. Why It Works So Well

Why Attention Was InventedΒΆ

The Problem with RNNsΒΆ

Before attention, sequence models (RNNs, LSTMs) had a fundamental limitation:

Sentence: "The cat that ate the mouse that lived in the barn ran away"

Problem: By the time model gets to "ran", information about "cat" 
has been compressed through many timesteps and starts to fade.

Information flow:
  cat β†’ that β†’ ate β†’ the β†’ mouse β†’ ... β†’ ran
  ↓                                       ↓
[compressed in hidden state, fading...]

RNN limitations:

  • ❌ Sequential processing (slow)

  • ❌ Information bottleneck (fixed-size hidden state)

  • ❌ Vanishing gradients for long sequences

  • ❌ Cannot look back at earlier words directly

The Solution: AttentionΒΆ

Key Innovation: Let the model look back at ANY previous word when processing current word.

When predicting "ran":
- Attention can directly look at "cat" (high attention)
- Also look at "that", "ate", etc. (lower attention)
- Weights determine how much to focus on each word

No information loss!

Attention benefits:

  • βœ… Direct access to any previous position

  • βœ… Parallel processing possible

  • βœ… No vanishing gradients

  • βœ… Model learns what to focus on

The Core ConceptΒΆ

The Human AnalogyΒΆ

When you read this sentence: β€œThe Eiffel Tower is in Paris, which is the capital of France.”

To answer β€œWhat city has the Eiffel Tower?”, you:

  1. Scan the sentence

  2. Attend to relevant words: β€œEiffel Tower”, β€œParis”

  3. Ignore less relevant: β€œwhich”, β€œis”, β€œthe”, β€œof”

  4. Form answer from attended information

Attention mechanism does the same thing!

The IntuitionΒΆ

Question: "Where is the Eiffel Tower?"
Context:  "The Eiffel Tower is located in Paris, France."

Attention weights:
  The      [0.05]  ▁
  Eiffel   [0.30]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  Tower    [0.25]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  is       [0.02]  ▁
  located  [0.05]  β–ˆ
  in       [0.03]  ▁
  Paris    [0.25]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  France   [0.05]  β–ˆ

Model focuses heavily on: "Eiffel", "Tower", "Paris"

From Attention to Self-AttentionΒΆ

Regular Attention: Query one sequence, attend to another

  • Used in encoder-decoder models (translation)

  • Example: Query = β€œWo ist der Eiffelturm?”, Attend to = β€œThe Eiffel Tower is in Paris”

Self-Attention: Query and attend to the SAME sequence

  • Used in transformers (BERT, GPT)

  • Example: Each word attends to every word in same sentence

  • Helps understand relationships within text

Self-Attention Step by StepΒΆ

Let’s build intuition with a simple example.

Example SentenceΒΆ

"The cat sat"

Goal: For each word, compute attention to all words (including itself).

Step 1: Embed WordsΒΆ

Convert words to vectors (from Phase 2: Embeddings):

# Simplified 4-dimensional embeddings
embeddings = {
    "The": [0.2, 0.1, 0.5, 0.3],
    "cat": [0.5, 0.8, 0.2, 0.1],
    "sat": [0.1, 0.3, 0.9, 0.4]
}

# Stack into matrix
X = [[0.2, 0.1, 0.5, 0.3],  # The
     [0.5, 0.8, 0.2, 0.1],  # cat
     [0.1, 0.3, 0.9, 0.4]]  # sat
# Shape: (3, 4) - 3 words, 4 dimensions

Step 2: Create Q, K, V MatricesΒΆ

We need three transformations of our input:

# Weight matrices (learned during training)
W_q = random_matrix(4, 4)  # Query weights
W_k = random_matrix(4, 4)  # Key weights
W_v = random_matrix(4, 4)  # Value weights

# Transform embeddings
Q = X @ W_q  # Query: What am I looking for?
K = X @ W_k  # Key: What do I contain?
V = X @ W_v  # Value: What do I output?

# All have shape (3, 4)

Intuition:

  • Query (Q): β€œWhat information do I need?”

  • Key (K): β€œWhat information do I have?”

  • Value (V): β€œWhat information do I output?”

Step 3: Compute Attention ScoresΒΆ

Measure similarity between queries and keys:

scores = Q @ K.T  # Matrix multiplication
# Shape: (3, 3)

# Example result:
scores = [[2.1, 1.5, 0.8],  # The attends to: The, cat, sat
          [1.5, 3.2, 1.9],  # cat attends to: The, cat, sat
          [0.8, 1.9, 2.7]]  # sat attends to: The, cat, sat

Interpretation:

  • scores[1, 0] = 1.5 means β€œcat” has score 1.5 when attending to β€œThe”

  • scores[1, 1] = 3.2 means β€œcat” has highest score when attending to itself

  • Higher score = more relevant

Step 4: Scale ScoresΒΆ

Divide by square root of dimension to stabilize gradients:

d_k = 4  # Dimension of keys
scaled_scores = scores / sqrt(d_k)

Why scale?

  • Prevents very large values

  • Keeps softmax gradients well-behaved

  • Becomes more important with larger dimensions

Step 5: Apply SoftmaxΒΆ

Convert scores to probability distribution:

attention_weights = softmax(scaled_scores)

# Example result:
attention_weights = [
    [0.55, 0.32, 0.13],  # The: 55% to itself, 32% to cat, 13% to sat
    [0.15, 0.60, 0.25],  # cat: 15% to The, 60% to itself, 25% to sat
    [0.10, 0.30, 0.60]   # sat: 10% to The, 30% to cat, 60% to itself
]

Properties:

  • Each row sums to 1.0

  • Represents how much to attend to each position

  • These are the famous β€œattention weights”

Step 6: Apply Weights to ValuesΒΆ

Compute weighted sum of values:

output = attention_weights @ V
# Shape: (3, 4)

# For word "cat" (row 1):
output[1] = 0.15 * V[0] + 0.60 * V[1] + 0.25 * V[2]
#           ↑            ↑             ↑
#        from "The"  from "cat"   from "sat"

Result: Each word’s output is a weighted combination of all words’ values.

Complete FormulaΒΆ

Putting it all together:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
- Q = queries (what I'm looking for)
- K = keys (what I have)
- V = values (what I output)
- d_k = dimension of keys
- / √d_k = scaling factor
- softmax = converts to probabilities

Query, Key, Value (QKV)ΒΆ

The Database AnalogyΒΆ

Think of attention like a database lookup:

# Database with (key, value) pairs
database = {
    "Paris": "Capital of France, home to Eiffel Tower",
    "London": "Capital of UK, home to Big Ben",
    "Tokyo": "Capital of Japan, largest city"
}

# Query
query = "Where is the Eiffel Tower?"

# Step 1: Match query to keys
scores = {
    "Paris": 0.85,   # High match!
    "London": 0.15,
    "Tokyo": 0.10
}

# Step 2: Softmax to get attention weights
attention = softmax(scores)  # [0.70, 0.17, 0.13]

# Step 3: Retrieve weighted combination of values
result = 0.70 * database["Paris"] + 
         0.17 * database["London"] + 
         0.13 * database["Tokyo"]
# Mostly Paris information!

In Neural NetworksΒΆ

# Input embeddings
X = word_embeddings  # Shape: (seq_len, d_model)

# Linear transformations (learned)
Q = X @ W_q  # "What to search for"
K = X @ W_k  # "How to identify relevant info"
V = X @ W_v  # "What info to pass forward"

# Attention computation
scores = Q @ K.T           # Similarity
weights = softmax(scores)  # Probability
output = weights @ V       # Weighted combination

Why Three Matrices?ΒΆ

Question: Why not just use X directly?

Answer: Flexibility and expressiveness!

  • Different transformations learn different aspects

  • Q, K, V can focus on different features

  • Allows model to learn complex relationships

  • Gives model more parameters to optimize

Example:

  • Q might learn to look for β€œsubjects”

  • K might learn to identify β€œverbs”

  • V might learn to extract β€œsemantic meaning”

Scaled Dot-Product AttentionΒΆ

The Complete MechanismΒΆ

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q: Queries (batch, seq_len, d_k)
        K: Keys (batch, seq_len, d_k)
        V: Values (batch, seq_len, d_v)
        mask: Optional mask (batch, seq_len, seq_len)
    
    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.transpose(-2, -1) / sqrt(d_k)
    
    # Apply mask (for padding or causality)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)
    
    # Apply weights to values
    output = attention_weights @ V
    
    return output, attention_weights

MaskingΒΆ

Padding Mask: Ignore padded tokens

# Sentence: "The cat <PAD> <PAD>"
mask = [[1, 1, 0, 0]]  # Only attend to real words

Causal Mask: Prevent looking at future tokens (for GPT)

# When predicting word 2, can only see words 0, 1
mask = [[1, 0, 0, 0],
        [1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 1]]
# Lower triangular matrix

Multi-Head AttentionΒΆ

The Problem with Single AttentionΒΆ

Single attention head can only capture one type of relationship:

"The cat sat on the mat"

Single head might learn: Subject-Verb relationships
- "cat" β†’ "sat" (subject-verb)

But misses:
- Spatial relationships: "sat" β†’ "on"
- Object relationships: "on" β†’ "mat"

The Solution: Multiple HeadsΒΆ

Run attention multiple times in parallel, each learning different patterns:

Head 1: Subject-Verb relationships
  "The" β†’ [0.1, 0.1, 0.8, 0.0, 0.0, 0.0]  # Focuses on "sat"
  "cat" β†’ [0.1, 0.7, 0.2, 0.0, 0.0, 0.0]  # Focuses on itself and "sat"

Head 2: Object-Preposition relationships
  "sat" β†’ [0.0, 0.0, 0.1, 0.8, 0.1, 0.0]  # Focuses on "on"
  "on"  β†’ [0.0, 0.0, 0.0, 0.1, 0.2, 0.7]  # Focuses on "mat"

Head 3: Positional relationships
  Each word attends to neighbors

... up to 8 or 12 heads

ImplementationΒΆ

def multi_head_attention(X, num_heads=8):
    """
    Args:
        X: Input (batch, seq_len, d_model)
        num_heads: Number of attention heads
    
    Returns:
        output: (batch, seq_len, d_model)
    """
    d_model = X.shape[-1]
    d_k = d_model // num_heads  # Split dimensions across heads
    
    # Create Q, K, V for all heads at once
    Q = X @ W_q  # (batch, seq_len, d_model)
    K = X @ W_k
    V = X @ W_v
    
    # Reshape to separate heads
    # (batch, seq_len, num_heads, d_k)
    Q = Q.reshape(batch, seq_len, num_heads, d_k)
    K = K.reshape(batch, seq_len, num_heads, d_k)
    V = V.reshape(batch, seq_len, num_heads, d_k)
    
    # Transpose to (batch, num_heads, seq_len, d_k)
    Q = Q.transpose(1, 2)
    K = K.transpose(1, 2)
    V = V.transpose(1, 2)
    
    # Apply attention for each head in parallel
    attention_output = scaled_dot_product_attention(Q, K, V)
    # Shape: (batch, num_heads, seq_len, d_k)
    
    # Concatenate heads
    attention_output = attention_output.transpose(1, 2)
    # Shape: (batch, seq_len, num_heads, d_k)
    
    attention_output = attention_output.reshape(batch, seq_len, d_model)
    # Shape: (batch, seq_len, d_model)
    
    # Final linear transformation
    output = attention_output @ W_o
    
    return output

Why It WorksΒΆ

Ensemble Effect: Multiple heads vote on what’s important

Specialized Roles: Each head can specialize:

  • Syntactic relationships (grammar)

  • Semantic relationships (meaning)

  • Positional relationships (nearby words)

  • Long-range dependencies (distant words)

Empirical Success:

  • GPT-3: 96 attention heads

  • BERT-base: 12 heads

  • BERT-large: 16 heads

Practical ExamplesΒΆ

Example 1: Machine TranslationΒΆ

English: "I love machine learning"
French:  "J'adore l'apprentissage automatique"

When generating "apprentissage":
Attention weights to English words:
  I          [0.05]
  love       [0.10]
  machine    [0.35] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  learning   [0.50] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

Model attends heavily to "machine learning" when generating "apprentissage"

Example 2: Question AnsweringΒΆ

Question: "Who invented the transformer?"
Context:  "The transformer architecture was invented by Vaswani et al. 
           in 2017 at Google Brain. The paper 'Attention is All You Need' 
           introduced this revolutionary architecture."

Attention when answering:
  architecture [0.20] β–ˆβ–ˆβ–ˆβ–ˆ
  invented     [0.30] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  by           [0.05] β–ˆ
  Vaswani      [0.35] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  et           [0.05] β–ˆ
  al           [0.05] β–ˆ

Answer: "Vaswani et al."

Example 3: Sentiment AnalysisΒΆ

Review: "The movie was good but the ending was terrible"

When predicting sentiment:
Attention weights:
  The      [0.05]
  movie    [0.15] β–ˆβ–ˆβ–ˆ
  was      [0.02]
  good     [0.25] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  but      [0.08] β–ˆβ–ˆ
  the      [0.02]
  ending   [0.18] β–ˆβ–ˆβ–ˆβ–ˆ
  was      [0.02]
  terrible [0.23] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

Model focuses on: "good" and "terrible"
Result: Mixed sentiment (conflicting signals)

Why It Works So WellΒΆ

1. Parallel ProcessingΒΆ

RNN (Sequential):

Time: T₁ β†’ Tβ‚‚ β†’ T₃ β†’ Tβ‚„ β†’ Tβ‚…
Must wait for each step to complete

Attention (Parallel):

All positions computed simultaneously
Time: T₁ (single forward pass for entire sequence)
100x faster training on GPUs

2. No Information BottleneckΒΆ

RNN: Compresses everything into fixed-size hidden state Attention: Direct access to all positions, no compression needed

3. Better GradientsΒΆ

RNN: Gradients must flow through many timesteps (vanishing/exploding) Attention: Direct paths from output to any input (stable gradients)

4. InterpretabilityΒΆ

Can visualize attention weights to see what model focuses on:

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(attention_weights, 
            xticklabels=words, 
            yticklabels=words)
plt.title("Attention Weights")
plt.show()

5. Transfer LearningΒΆ

Pre-trained attention models (BERT, GPT) transfer well to new tasks:

  • Learn general language understanding

  • Fine-tune on specific tasks

  • Requires less task-specific data

SummaryΒΆ

Attention is the key innovation that enabled:

  • Modern language models (GPT, BERT, T5)

  • Vision transformers

  • Multi-modal models

  • State-of-the-art results across domains

Core concepts to remember:

  1. Attention weights determine what to focus on

  2. QKV mechanism allows flexible learning

  3. Multi-head attention captures multiple relationships

  4. Parallel processing makes training fast

  5. Direct connections enable long-range dependencies

Next: See how attention is used in the complete transformer architecture β†’ transformer_architecture.md