Run this notebook: Open in Colab Open in Kaggle

Chapter 7: Attention Mechanism Deep Dive ¶

The Full Attention Formula: Scaled Dot-Product Attention¶

The complete self-attention computation is captured in one elegant formula:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

Here \(Q\), \(K\), and \(V\) are matrices where each row corresponds to one token’s query, key, or value vector respectively. The product \(QK^T\) computes all pairwise attention scores simultaneously – entry \((i, j)\) measures how much token \(i\) should attend to token \(j\). The division by \(\sqrt{d_k}\) (the square root of the key dimension) is a critical scaling factor: without it, the dot products grow large as the dimension increases, pushing the softmax into saturated regions where gradients vanish. After softmax normalization (each row sums to 1), the result is multiplied by \(V\) to produce the weighted combination of values.

This entire operation is a matrix multiplication pipeline that runs efficiently on GPUs, which is one reason Transformers can scale to sequences of thousands of tokens and models with billions of parameters.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Multi-Head Attention: Learning Diverse Patterns in Parallel¶

A single attention head can only capture one type of relationship at a time. Multi-head attention runs \(h\) independent attention computations in parallel, each with its own learned \(W_Q^{(i)}\), \(W_K^{(i)}\), \(W_V^{(i)}\) projection matrices. The outputs are concatenated and linearly projected:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O\]

Each head operates in a lower-dimensional subspace (dimension \(d_k / h\) instead of \(d_k\)), so the total computational cost is similar to single-head attention at full dimension. The power comes from specialization: empirically, different heads learn to capture different linguistic patterns. In GPT-style models, researchers have found heads that specialize in subject-verb agreement across long distances, heads that track coreference (which “it” refers to), heads that copy rare tokens, and heads that handle positional relationships. This distributed, parallel processing of diverse linguistic features is a major reason why Transformers outperform previous architectures on language tasks.

Chapter 7: Attention Mechanism Deep Dive¶

The Full Attention Formula: Scaled Dot-Product Attention¶

Multi-Head Attention: Learning Diverse Patterns in Parallel¶

Chapter 7: Attention Mechanism Deep Dive ¶