Run this notebook: Open in Colab Open in Kaggle

Chapter 5: GPT and Large Language Models ¶

From Neural Networks to Language: How GPT Works¶

GPT (Generative Pre-trained Transformer) is a neural network trained on a deceptively simple task: given a sequence of words (or tokens), predict the next one. Despite this simplicity, scaling this objective to billions of parameters and trillions of training tokens produces models that can write essays, solve math problems, translate languages, and reason about complex scenarios.

The “Generative” means it produces text one token at a time. The “Pre-trained” means it learns general language understanding from vast text corpora before being fine-tuned for specific tasks. The “Transformer” refers to the specific neural network architecture (introduced in the 2017 paper “Attention Is All You Need”) that replaced recurrent networks with a mechanism called self-attention, enabling parallel processing of entire sequences and dramatically improving training efficiency. GPT-3 has 175 billion parameters; GPT-4 is believed to be substantially larger. These parameters collectively encode statistical patterns of language at every scale, from character-level spelling to paragraph-level reasoning.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Next-Token Prediction: Simple Objective, Emergent Intelligence¶

The training objective is autoregressive language modeling: given the sequence “The cat sat on the,” the model must output a probability distribution over the entire vocabulary for the next token. The correct answer might assign high probability to “mat,” “floor,” “chair,” and lower probability to unlikely continuations. The loss function is cross-entropy, which measures how well the predicted distribution matches the actual next token in the training data.

What makes this remarkable is that to predict well, the model must implicitly learn grammar, facts about the world, logical reasoning, and even aspects of common sense. If the context is “The capital of France is,” predicting “Paris” requires factual knowledge. If the context involves a logical argument, predicting the conclusion requires reasoning ability. The model does not explicitly learn rules – it discovers statistical patterns that approximate these capabilities, distributed across billions of weight parameters.

Tokenization and Embeddings: Representing Language as Vectors¶

Before text enters the Transformer, it must be converted to numbers. Tokenization splits text into sub-word units (tokens) – common words like “the” get their own token, while rare words are broken into pieces (e.g., “unbelievable” might become “un” + “believ” + “able”). GPT-style models typically use vocabularies of 50,000-100,000 tokens.

Each token is then mapped to a high-dimensional embedding vector (typically 768 to 12,288 dimensions depending on model size) via a learned lookup table. These embeddings are not random – through training, tokens with similar meanings end up with similar vectors. The embedding space develops rich geometric structure: vector arithmetic like \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\) emerges naturally. The code below demonstrates basic tokenization using str.split() as a simplified stand-in for the actual byte-pair encoding (BPE) algorithm used in practice.

# Simple tokenization
text = "The quick brown fox"
tokens = text.split()
print("Tokens:", tokens)
print(f"Count: {len(tokens)}")

Chapter 5: GPT and Large Language Models¶

From Neural Networks to Language: How GPT Works¶

Next-Token Prediction: Simple Objective, Emergent Intelligence¶

Tokenization and Embeddings: Representing Language as Vectors¶

Chapter 5: GPT and Large Language Models ¶