Run this notebook: Open in Colab Open in Kaggle

Chapter 6: Attention in Transformers ¶

The Attention Mechanism: How Tokens Communicate¶

The breakthrough idea of the Transformer architecture is self-attention: a mechanism that allows every token in a sequence to directly attend to (and gather information from) every other token, regardless of distance. In the sentence “The cat sat on the mat,” the token “sat” needs to know about “cat” (its subject) to make useful predictions. In a traditional recurrent network, this information must be passed step by step through many hidden states. Attention creates a direct highway for information flow.

Mathematically, attention computes a weighted average of all token representations, where the weights are determined by the compatibility between tokens. High-attention-weight pairs like (“sat”, “cat”) indicate that “sat” should incorporate information from “cat” into its updated representation. This dynamic, context-dependent information routing is what gives Transformers their power – the same architecture can learn to route syntactic information (subject-verb agreement), semantic information (word meaning in context), and even factual associations, all through learned attention patterns.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Query, Key, Value: The Three Roles Every Token Plays¶

Each token’s embedding is linearly projected into three separate vectors using learned weight matrices \(W_Q\), \(W_K\), and \(W_V\):

Query (\(Q = XW_Q\)): “What information am I looking for?” – represents what this token needs from other tokens.
Key (\(K = XW_K\)): “What information do I have to offer?” – represents what this token can provide to others.
Value (\(V = XW_V\)): “What do I actually communicate when attended to?” – the content that gets passed along.

The attention score between token \(i\) and token \(j\) is computed as the dot product \(Q_i \cdot K_j\), measuring compatibility between what token \(i\) is looking for and what token \(j\) offers. High dot products mean high relevance. These scores are normalized via softmax to form a probability distribution (attention weights), then used to compute a weighted sum of the Value vectors. The result: each token’s representation is updated with a contextually-appropriate blend of information from all other tokens in the sequence.

Chapter 6: Attention in Transformers¶

The Attention Mechanism: How Tokens Communicate¶

Query, Key, Value: The Three Roles Every Token Plays¶

Chapter 6: Attention in Transformers ¶