Phase 4: Tokenization β Start HereΒΆ
Understand how LLMs convert text into numbers β the foundation of everything in NLP.
What Is Tokenization?ΒΆ
Before any LLM can process text, it must be split into tokens β integer IDs from a fixed vocabulary. This phase covers how that works and why it matters.
Notebooks in This PhaseΒΆ
Notebook |
Topic |
|---|---|
|
BPE, WordPiece, SentencePiece overview |
|
Train your own tokenizer on custom data |
|
Unigram LM, tokenizer design decisions |
|
Hugging Face tokenizer pipeline internals |
|
OpenAI tiktoken β fast BPE for GPT models |
|
Google SentencePiece for multilingual models |
|
Visualize how different models tokenize text |
|
Hands-on exercises to solidify understanding |
PrerequisitesΒΆ
Python basics (Phase 01)
Some familiarity with LLMs is helpful but not required
Key Concepts Youβll LearnΒΆ
BPE (Byte Pair Encoding) β used by GPT models
WordPiece β used by BERT
SentencePiece / Unigram β used by T5, LLaMA
tiktoken β OpenAIβs fast tokenizer
Why token count matters for cost and context limits
How tokenization affects multilingual models
Learning PathΒΆ
01_tokenizers_quickstart.ipynb β Start here
tiktoken_example.ipynb β OpenAI API users: do this early
sentencepiece_example.ipynb
02_tokenizers_training.ipynb
03_advanced_training_methods.ipynb
06_pipeline_components.ipynb
token_exploration.ipynb
token_exercises.ipynb β Practice