Phase 4: TokenizationΒΆ
Goal: Understand how text is converted into numbers that language models can process. This is the foundational step before embeddings and neural networks.
Why Tokenization MattersΒΆ
Every LLM interaction starts with tokenization. When you send a prompt to GPT-4 or Claude, it is first split into tokens β sub-word units that the model actually processes. Understanding tokenization helps you:
Write better prompts: Avoid patterns that waste tokens and increase cost
Estimate costs accurately: APIs charge per token, not per word
Debug model behavior: Some languages tokenize less efficiently than English
Build production systems: Fast tokenization is critical for throughput
Key fact: βtokenizationβ is 4 tokens in GPT-4. βHelloβ is 1 token. An average English word is ~1.3 tokens.
Notebooks β Work in This OrderΒΆ
# |
Notebook |
What You Learn |
Time |
|---|---|---|---|
1 |
HuggingFace tokenizers API, encode/decode, special tokens |
45 min |
|
2 |
OpenAIβs TikToken library, count tokens in prompts |
30 min |
|
3 |
Googleβs SentencePiece (used in T5, LLaMA) |
30 min |
|
4 |
Train a BPE tokenizer on your own data |
60 min |
|
5 |
WordPiece, Unigram, and special handling |
45 min |
|
6 |
Tokenization as part of the full NLP pipeline |
45 min |
|
7 |
Hands-on exploration and visualization |
30 min |
|
8 |
Practice problems with solutions |
45 min |
Key ConceptsΒΆ
The Three Main Tokenization AlgorithmsΒΆ
BPE (Byte Pair Encoding) β Used by: GPT-2, GPT-3, GPT-4, RoBERTa
Starts with individual bytes/characters
Iteratively merges the most frequent pairs
Result: common words are single tokens, rare words are split
WordPiece β Used by: BERT, DistilBERT
Similar to BPE but uses likelihood instead of frequency
Unknown words become
##suffixparts
SentencePiece / Unigram β Used by: T5, LLaMA, Mistral, Gemma
Language-agnostic, treats the text as raw bytes
Works well for multilingual models
Token Vocabulary SizeΒΆ
Model |
Vocab Size |
Algorithm |
|---|---|---|
GPT-2 |
50,257 |
BPE |
GPT-4 / TikToken |
100,277 |
BPE |
BERT |
30,522 |
WordPiece |
LLaMA 3 |
128,256 |
BPE (tiktoken-based) |
T5 |
32,100 |
SentencePiece |
Larger vocabulary = fewer tokens per sentence = faster inference, but larger embedding table.
Reference GuidesΒΆ
intro.md β Conceptual introduction to tokenization
huggingface_tokenizers_guide.md β Comprehensive HuggingFace guide
README_TOKENIZERS.md β HuggingFace tokenizers library deep dive
README_TIKTOKEN.md β OpenAI TikToken library guide
04_production_guide.md β Using tokenizers in production
05_tokenizer_comparison.md β Comparing different tokenizers
06_integration_guide.md β Integrating tokenizers into pipelines
Practice ProjectsΒΆ
Token Counter Tool: Build a CLI tool that counts tokens in any text file for a given model
Tokenization Visualizer: Color-code tokens in a Streamlit app
Multilingual Efficiency Analyzer: Compare token efficiency across English, Spanish, Chinese, Arabic
What to Learn NextΒΆ
After tokenization, move to 05-embeddings/ to learn how tokens become meaningful vectors.
External ResourcesΒΆ
Resource |
Type |
Link |
|---|---|---|
Karpathy: Letβs build the GPT Tokenizer |
Video (90 min) |
|
HuggingFace Tokenizers Docs |
Docs |
|
TikToken GitHub |
Repo |
|
SentencePiece GitHub |
Repo |
|
Tiktokenizer (web tool) |
Tool |