Chapter 7: Attention Mechanism Deep DiveΒΆ
The Full Attention Formula: Scaled Dot-Product AttentionΒΆ
The complete self-attention computation is captured in one elegant formula:
Here \(Q\), \(K\), and \(V\) are matrices where each row corresponds to one tokenβs query, key, or value vector respectively. The product \(QK^T\) computes all pairwise attention scores simultaneously β entry \((i, j)\) measures how much token \(i\) should attend to token \(j\). The division by \(\sqrt{d_k}\) (the square root of the key dimension) is a critical scaling factor: without it, the dot products grow large as the dimension increases, pushing the softmax into saturated regions where gradients vanish. After softmax normalization (each row sums to 1), the result is multiplied by \(V\) to produce the weighted combination of values.
This entire operation is a matrix multiplication pipeline that runs efficiently on GPUs, which is one reason Transformers can scale to sequences of thousands of tokens and models with billions of parameters.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)
Multi-Head Attention: Learning Diverse Patterns in ParallelΒΆ
A single attention head can only capture one type of relationship at a time. Multi-head attention runs \(h\) independent attention computations in parallel, each with its own learned \(W_Q^{(i)}\), \(W_K^{(i)}\), \(W_V^{(i)}\) projection matrices. The outputs are concatenated and linearly projected:
Each head operates in a lower-dimensional subspace (dimension \(d_k / h\) instead of \(d_k\)), so the total computational cost is similar to single-head attention at full dimension. The power comes from specialization: empirically, different heads learn to capture different linguistic patterns. In GPT-style models, researchers have found heads that specialize in subject-verb agreement across long distances, heads that track coreference (which βitβ refers to), heads that copy rare tokens, and heads that handle positional relationships. This distributed, parallel processing of diverse linguistic features is a major reason why Transformers outperform previous architectures on language tasks.