Chapter 6: Attention in TransformersΒΆ
The Attention Mechanism: How Tokens CommunicateΒΆ
The breakthrough idea of the Transformer architecture is self-attention: a mechanism that allows every token in a sequence to directly attend to (and gather information from) every other token, regardless of distance. In the sentence βThe cat sat on the mat,β the token βsatβ needs to know about βcatβ (its subject) to make useful predictions. In a traditional recurrent network, this information must be passed step by step through many hidden states. Attention creates a direct highway for information flow.
Mathematically, attention computes a weighted average of all token representations, where the weights are determined by the compatibility between tokens. High-attention-weight pairs like (βsatβ, βcatβ) indicate that βsatβ should incorporate information from βcatβ into its updated representation. This dynamic, context-dependent information routing is what gives Transformers their power β the same architecture can learn to route syntactic information (subject-verb agreement), semantic information (word meaning in context), and even factual associations, all through learned attention patterns.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)
Query, Key, Value: The Three Roles Every Token PlaysΒΆ
Each tokenβs embedding is linearly projected into three separate vectors using learned weight matrices \(W_Q\), \(W_K\), and \(W_V\):
Query (\(Q = XW_Q\)): βWhat information am I looking for?β β represents what this token needs from other tokens.
Key (\(K = XW_K\)): βWhat information do I have to offer?β β represents what this token can provide to others.
Value (\(V = XW_V\)): βWhat do I actually communicate when attended to?β β the content that gets passed along.
The attention score between token \(i\) and token \(j\) is computed as the dot product \(Q_i \cdot K_j\), measuring compatibility between what token \(i\) is looking for and what token \(j\) offers. High dot products mean high relevance. These scores are normalized via softmax to form a probability distribution (attention weights), then used to compute a weighted sum of the Value vectors. The result: each tokenβs representation is updated with a contextually-appropriate blend of information from all other tokens in the sequence.