Run this notebook: Open in Colab Open in Kaggle

Phase 4: Tokenization — Start Here¶

Understand how LLMs convert text into numbers — the foundation of everything in NLP.

What Is Tokenization?¶

Before any LLM can process text, it must be split into tokens — integer IDs from a fixed vocabulary. This phase covers how that works and why it matters.

Notebooks in This Phase¶

Notebook	Topic
`01_tokenizers_quickstart.ipynb`	BPE, WordPiece, SentencePiece overview
`02_tokenizers_training.ipynb`	Train your own tokenizer on custom data
`03_advanced_training_methods.ipynb`	Unigram LM, tokenizer design decisions
`06_pipeline_components.ipynb`	Hugging Face tokenizer pipeline internals
`tiktoken_example.ipynb`	OpenAI tiktoken — fast BPE for GPT models
`sentencepiece_example.ipynb`	Google SentencePiece for multilingual models
`token_exploration.ipynb`	Visualize how different models tokenize text
`token_exercises.ipynb`	Hands-on exercises to solidify understanding

Prerequisites¶

Python basics (Phase 01)
Some familiarity with LLMs is helpful but not required

Key Concepts You’ll Learn¶

BPE (Byte Pair Encoding) — used by GPT models
WordPiece — used by BERT
SentencePiece / Unigram — used by T5, LLaMA
tiktoken — OpenAI’s fast tokenizer
Why token count matters for cost and context limits
How tokenization affects multilingual models

Learning Path¶

01_tokenizers_quickstart.ipynb    ← Start here
tiktoken_example.ipynb            ← OpenAI API users: do this early
sentencepiece_example.ipynb
02_tokenizers_training.ipynb
03_advanced_training_methods.ipynb
06_pipeline_components.ipynb
token_exploration.ipynb
token_exercises.ipynb             ← Practice