Run this notebook: Open in Colab Open in Kaggle

Chapter 8: How GPT Stores Facts ¶

Where Knowledge Lives Inside a Neural Network¶

When GPT correctly completes “The capital of France is ___” with “Paris,” it is retrieving a fact. But unlike a database where facts are stored in neat rows and columns, a neural network’s knowledge is distributed across millions of weight parameters. No single neuron or weight “knows” that Paris is the capital of France. Instead, the answer emerges from the coordinated interaction of many neurons across many layers.

This raises a profound question for AI interpretability: can we locate where specific facts are stored? The emerging field of mechanistic interpretability aims to reverse-engineer neural networks to answer this question. Early results suggest that factual knowledge is primarily stored in the MLP (feedforward) layers of the Transformer, which act as key-value memories. Attention heads route queries to the right MLP layers, which then retrieve and inject the relevant factual information. Understanding this internal structure is crucial for building AI systems that are reliable, safe, and whose knowledge can be audited or corrected.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Locating Facts: From Early Layers to Late Layers¶

Research in mechanistic interpretability has revealed a rough division of labor across the Transformer’s layers:

Early layers (1-6 in a 12-layer model): Detect basic syntactic patterns, part-of-speech tagging, and local context. These layers build a “parse” of the input.
Middle layers (6-9): Perform factual association and entity recognition. This is where the MLP layers act as learned lookup tables – the input pattern “Michael Jordan plays” activates MLP neurons that encode the association with “basketball.”
Late layers (9-12): Refine the prediction for the specific task, incorporating context to select among candidate answers and format the output distribution.

Techniques like causal tracing (Meng et al., 2022) can pinpoint which layers and which specific weight matrices are responsible for a given factual recall. This has led to methods like ROME (Rank-One Model Editing) that can surgically update individual facts in a trained model by modifying a small number of weights, without retraining the entire network.

Chapter 8: How GPT Stores Facts¶

Where Knowledge Lives Inside a Neural Network¶

Locating Facts: From Early Layers to Late Layers¶

Chapter 8: How GPT Stores Facts ¶