Phase 1: Understanding Tokens¶
Learning Time: 1-2 hours | Difficulty: Beginner
What You’ll Learn¶
By the end of this module, you will understand:
✅ What tokens are and why they’re fundamental to LLMs
✅ How tokenization algorithms work (BPE, WordPiece)
✅ The difference between tokens, characters, and words
✅ How to use tiktoken to encode and decode text
✅ Why token count matters for costs and context limits
✅ How different languages and contexts affect tokenization
Quick Start¶
# Install required package
pip install tiktoken
# Run the basic example
python tiktoken_example.py
# Run advanced examples
python token_exploration.py
# Try interactive exercises
python token_exercises.py
Files in This Module¶
File |
Description |
Run Time |
|---|---|---|
|
Comprehensive theory and concepts |
Read: 15-20 min |
|
Basic tokenization example |
1 min |
|
Advanced examples and comparisons |
3-5 min |
|
Interactive practice exercises |
10-15 min |
|
Sample output for reference |
- |
Learning Path¶
Step 1: Read the Theory (15-20 minutes)¶
Start with intro.md to understand:
What tokens are and why they exist
How tokenization works under the hood
Practical implications for cost and performance
Step 2: Run Basic Example (5 minutes)¶
python tiktoken_example.py
See tokenization in action with a simple example.
Step 3: Explore Advanced Concepts (10 minutes)¶
python token_exploration.py
Discover how different types of text get tokenized:
Common vs uncommon words
Code and special characters
Different languages
Context-dependent tokenization
Step 4: Practice with Exercises (15-20 minutes)¶
python token_exercises.py
Interactive exercises to test your understanding:
Predict token counts
Compare tokenization strategies
Calculate API costs
Optimize prompts for token efficiency
Key Concepts¶
Token Fundamentals¶
# Text → Tokens → Token IDs → Model Processing
"Hello world" → ["Hello", " world"] → [9906, 1917] → [embeddings...]
Token: A unit of text (word, subword, or character) that a model processes
Token ID: A unique integer representing a token in the model’s vocabulary
Encoding: Converting text into token IDs Decoding: Converting token IDs back to text
Why Tokens Matter¶
Cost Calculation
APIs charge per token (not per word or character)
Example: GPT-4 = $0.03 per 1K input tokens
Context Limits
Models have token limits (e.g., 8K, 32K, 128K tokens)
Must fit prompt + response within limit
Processing Speed
Fewer tokens = faster inference
Efficient tokenization improves performance
Multilingual Support
Subword tokenization handles any language
No need for language-specific dictionaries
Token Count Rules of Thumb¶
Text Type |
Token Ratio |
|---|---|
English text |
1 token ≈ 4 characters |
English text |
1 token ≈ ¾ word |
Code |
1 token ≈ 3-4 characters |
Non-English |
1 token ≈ 2-3 characters |
Numbers |
Often 1-2 digits per token |
Examples:
“Hello world!” = 3 tokens
“GPT-4 is amazing” = 5 tokens
“supercalifragilisticexpialidocious” = 7 tokens
Common Pitfalls¶
❌ Mistake 1: Assuming 1 token = 1 word¶
# Wrong assumption
text = "I love programming" # 3 words
# Actual: 3 tokens ✅ (happens to match here)
text = "I love ML" # 3 words
# Actual: 4 tokens (M and L are separate) ❌
❌ Mistake 2: Ignoring leading spaces¶
# These tokenize differently!
"red" # Token: 1171
" red" # Token: 2266 (with space)
❌ Mistake 3: Underestimating non-English costs¶
# English: "How are you?" = 4 tokens
# Chinese: "你好吗?" = 6 tokens (for 3 characters!)
❌ Mistake 4: Not counting special tokens¶
# Many models add special tokens:
# <|start|> text <|end|>
# These count toward your token limit!
Advanced Topics¶
Different Tokenization Algorithms¶
Byte-Pair Encoding (BPE) - Used by GPT models
Merges frequent character pairs iteratively
Good compression, handles any text
WordPiece - Used by BERT
Similar to BPE but uses likelihood-based merging
Optimized for vocabulary size
SentencePiece - Used by LLaMA, T5
Language-agnostic, treats text as raw bytes
No pre-tokenization (no word boundaries)
Note: Learn more at https://huggingface.co/docs/transformers/en/tokenizer_summary https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/ https://medium.com/@lmpo/from-text-to-tokens-understanding-bpe-wordpiece-and-sentencepiece-in-nlp-1367d9d610af https://medium.com/@lmpo/a-brief-history-of-ai-with-deep-learning-26f7948bc87b https://medium.com/@lmpo/the-evolution-of-artificial-neurons-90619f224f63
Model-Specific Encodings¶
Model |
Encoding |
Vocab Size |
Best For |
|---|---|---|---|
GPT-4 |
cl100k_base |
~100K |
General purpose, code |
GPT-3.5 |
cl100k_base |
~100K |
General purpose |
GPT-3 |
p50k_base |
~50K |
English text |
Code Cushman |
p50k_edit |
~50K |
Code editing |
Token vs Context Window¶
Context Window = Maximum tokens the model can process at once
Input tokens (your prompt)
Output tokens (model’s response)
System tokens (instructions, special tokens)
Example with 4K context:
3,000 token prompt = only 1,000 tokens left for response
500 token prompt = 3,500 tokens available for response
Practical Applications¶
1. Estimate API Costs¶
def estimate_cost(text, model="gpt-4"):
encoding = tiktoken.get_encoding("cl100k_base")
token_count = len(encoding.encode(text))
# GPT-4 pricing (example)
input_cost_per_1k = 0.03
cost = (token_count / 1000) * input_cost_per_1k
return token_count, cost
text = "Your long prompt here..."
tokens, cost = estimate_cost(text)
print(f"Tokens: {tokens}, Cost: ${cost:.4f}")
2. Optimize Prompts¶
# Before optimization
prompt = "Please, please, can you help me understand this?"
# 10 tokens
# After optimization
prompt = "Please explain this:"
# 4 tokens (60% reduction!)
3. Split Long Documents¶
def chunk_text(text, max_tokens=1000):
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(encoding.decode(chunk_tokens))
return chunks
Practice Exercises¶
Work through token_exercises.py to test your understanding:
Token Counting - Predict token counts for various texts
Cost Calculator - Build a simple API cost estimator
Prompt Optimizer - Reduce token usage while keeping meaning
Language Compare - See how different languages tokenize
Context Manager - Fit text within token limits
Verification Checklist¶
Before moving to Phase 2 (Embeddings), ensure you can:
Explain what tokens are in your own words
Use tiktoken to encode and decode text
Calculate token counts for given text
Understand why “word” ≠ “token”
Estimate API costs based on token counts
Recognize how context affects tokenization
Split text to fit within token limits
Compare tokenization across languages
Common Questions¶
Q: Why not just use words instead of tokens?
A: Words create a massive vocabulary (170K+ for English alone). Tokens use subwords, keeping vocabulary manageable (~50-100K) while handling any text.
Q: Do all models use the same tokens?
A: No! Each model has its own tokenizer trained on specific data. GPT-4’s “hello” might be a different token ID than LLaMA’s “hello”.
Q: Are emojis one token?
A: Usually multiple tokens. 😀 might be 1-3 tokens depending on the encoding.
Q: Does whitespace matter?
A: Yes! “ hello” (with space) is different from “hello” (no space). Leading/trailing spaces create different tokens.
Q: Can I create my own tokenizer?
A: Yes, but it requires training on large text corpora. Most developers use pre-trained tokenizers from model providers.
Resources for Deeper Learning¶
Official Documentation¶
tiktoken GitHub - Official OpenAI tokenizer
OpenAI Tokenizer Tool - Interactive web tool
Articles & Guides¶
Videos¶
Andrej Karpathy: Tokenization - Deep dive into BPE
3Blue1Brown: But what is a GPT? - Visual explanation
Next Steps¶
Once you’re comfortable with tokens, move to Phase 2: Embeddings:
cd ../4-embeddings
cat README.md
Phase 2 Preview: Learn how tokens are converted into dense vector representations that capture semantic meaning, enabling:
Semantic search (find similar content)
Text classification
Recommendation systems
Vector databases
Need Help?
Review
intro.mdfor detailed explanationsRun examples multiple times with different inputs
Try the interactive exercises in
token_exercises.pyExperiment with the OpenAI Tokenizer Tool