Phase 4: Tokenization β€” Start HereΒΆ

Understand how LLMs convert text into numbers β€” the foundation of everything in NLP.

What Is Tokenization?ΒΆ

Before any LLM can process text, it must be split into tokens β€” integer IDs from a fixed vocabulary. This phase covers how that works and why it matters.

Notebooks in This PhaseΒΆ

Notebook

Topic

01_tokenizers_quickstart.ipynb

BPE, WordPiece, SentencePiece overview

02_tokenizers_training.ipynb

Train your own tokenizer on custom data

03_advanced_training_methods.ipynb

Unigram LM, tokenizer design decisions

06_pipeline_components.ipynb

Hugging Face tokenizer pipeline internals

tiktoken_example.ipynb

OpenAI tiktoken β€” fast BPE for GPT models

sentencepiece_example.ipynb

Google SentencePiece for multilingual models

token_exploration.ipynb

Visualize how different models tokenize text

token_exercises.ipynb

Hands-on exercises to solidify understanding

PrerequisitesΒΆ

  • Python basics (Phase 01)

  • Some familiarity with LLMs is helpful but not required

Key Concepts You’ll LearnΒΆ

  • BPE (Byte Pair Encoding) β€” used by GPT models

  • WordPiece β€” used by BERT

  • SentencePiece / Unigram β€” used by T5, LLaMA

  • tiktoken β€” OpenAI’s fast tokenizer

  • Why token count matters for cost and context limits

  • How tokenization affects multilingual models

Learning PathΒΆ

01_tokenizers_quickstart.ipynb    ← Start here
tiktoken_example.ipynb            ← OpenAI API users: do this early
sentencepiece_example.ipynb
02_tokenizers_training.ipynb
03_advanced_training_methods.ipynb
06_pipeline_components.ipynb
token_exploration.ipynb
token_exercises.ipynb             ← Practice