Chapter 8: How GPT Stores FactsΒΆ

Where Knowledge Lives Inside a Neural NetworkΒΆ

When GPT correctly completes β€œThe capital of France is ___” with β€œParis,” it is retrieving a fact. But unlike a database where facts are stored in neat rows and columns, a neural network’s knowledge is distributed across millions of weight parameters. No single neuron or weight β€œknows” that Paris is the capital of France. Instead, the answer emerges from the coordinated interaction of many neurons across many layers.

This raises a profound question for AI interpretability: can we locate where specific facts are stored? The emerging field of mechanistic interpretability aims to reverse-engineer neural networks to answer this question. Early results suggest that factual knowledge is primarily stored in the MLP (feedforward) layers of the Transformer, which act as key-value memories. Attention heads route queries to the right MLP layers, which then retrieve and inject the relevant factual information. Understanding this internal structure is crucial for building AI systems that are reliable, safe, and whose knowledge can be audited or corrected.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Locating Facts: From Early Layers to Late LayersΒΆ

Research in mechanistic interpretability has revealed a rough division of labor across the Transformer’s layers:

  • Early layers (1-6 in a 12-layer model): Detect basic syntactic patterns, part-of-speech tagging, and local context. These layers build a β€œparse” of the input.

  • Middle layers (6-9): Perform factual association and entity recognition. This is where the MLP layers act as learned lookup tables – the input pattern β€œMichael Jordan plays” activates MLP neurons that encode the association with β€œbasketball.”

  • Late layers (9-12): Refine the prediction for the specific task, incorporating context to select among candidate answers and format the output distribution.

Techniques like causal tracing (Meng et al., 2022) can pinpoint which layers and which specific weight matrices are responsible for a given factual recall. This has led to methods like ROME (Rank-One Model Editing) that can surgically update individual facts in a trained model by modifying a small number of weights, without retraining the entire network.