Chapter 5: GPT and Large Language ModelsΒΆ
From Neural Networks to Language: How GPT WorksΒΆ
GPT (Generative Pre-trained Transformer) is a neural network trained on a deceptively simple task: given a sequence of words (or tokens), predict the next one. Despite this simplicity, scaling this objective to billions of parameters and trillions of training tokens produces models that can write essays, solve math problems, translate languages, and reason about complex scenarios.
The βGenerativeβ means it produces text one token at a time. The βPre-trainedβ means it learns general language understanding from vast text corpora before being fine-tuned for specific tasks. The βTransformerβ refers to the specific neural network architecture (introduced in the 2017 paper βAttention Is All You Needβ) that replaced recurrent networks with a mechanism called self-attention, enabling parallel processing of entire sequences and dramatically improving training efficiency. GPT-3 has 175 billion parameters; GPT-4 is believed to be substantially larger. These parameters collectively encode statistical patterns of language at every scale, from character-level spelling to paragraph-level reasoning.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)
Next-Token Prediction: Simple Objective, Emergent IntelligenceΒΆ
The training objective is autoregressive language modeling: given the sequence βThe cat sat on the,β the model must output a probability distribution over the entire vocabulary for the next token. The correct answer might assign high probability to βmat,β βfloor,β βchair,β and lower probability to unlikely continuations. The loss function is cross-entropy, which measures how well the predicted distribution matches the actual next token in the training data.
What makes this remarkable is that to predict well, the model must implicitly learn grammar, facts about the world, logical reasoning, and even aspects of common sense. If the context is βThe capital of France is,β predicting βParisβ requires factual knowledge. If the context involves a logical argument, predicting the conclusion requires reasoning ability. The model does not explicitly learn rules β it discovers statistical patterns that approximate these capabilities, distributed across billions of weight parameters.
Tokenization and Embeddings: Representing Language as VectorsΒΆ
Before text enters the Transformer, it must be converted to numbers. Tokenization splits text into sub-word units (tokens) β common words like βtheβ get their own token, while rare words are broken into pieces (e.g., βunbelievableβ might become βunβ + βbelievβ + βableβ). GPT-style models typically use vocabularies of 50,000-100,000 tokens.
Each token is then mapped to a high-dimensional embedding vector (typically 768 to 12,288 dimensions depending on model size) via a learned lookup table. These embeddings are not random β through training, tokens with similar meanings end up with similar vectors. The embedding space develops rich geometric structure: vector arithmetic like \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\) emerges naturally. The code below demonstrates basic tokenization using str.split() as a simplified stand-in for the actual byte-pair encoding (BPE) algorithm used in practice.
# Simple tokenization
text = "The quick brown fox"
tokens = text.split()
print("Tokens:", tokens)
print(f"Count: {len(tokens)}")