Phase 1: Understanding Tokens¶

Learning Time: 1-2 hours | Difficulty: Beginner

What You’ll Learn¶

By the end of this module, you will understand:

✅ What tokens are and why they’re fundamental to LLMs
✅ How tokenization algorithms work (BPE, WordPiece)
✅ The difference between tokens, characters, and words
✅ How to use tiktoken to encode and decode text
✅ Why token count matters for costs and context limits
✅ How different languages and contexts affect tokenization

Quick Start¶

# Install required package
pip install tiktoken

# Run the basic example
python tiktoken_example.py

# Run advanced examples
python token_exploration.py

# Try interactive exercises
python token_exercises.py

Files in This Module¶

File	Description	Run Time
`intro.md`	Comprehensive theory and concepts	Read: 15-20 min
`tiktoken_example.py`	Basic tokenization example	1 min
`token_exploration.py`	Advanced examples and comparisons	3-5 min
`token_exercises.py`	Interactive practice exercises	10-15 min
`tiktoken_example_output.txt`	Sample output for reference	-

Learning Path¶

Step 1: Read the Theory (15-20 minutes)¶

Start with intro.md to understand:

What tokens are and why they exist
How tokenization works under the hood
Practical implications for cost and performance

Step 2: Run Basic Example (5 minutes)¶

python tiktoken_example.py

See tokenization in action with a simple example.

Step 3: Explore Advanced Concepts (10 minutes)¶

python token_exploration.py

Discover how different types of text get tokenized:

Common vs uncommon words
Code and special characters
Different languages
Context-dependent tokenization

Step 4: Practice with Exercises (15-20 minutes)¶

python token_exercises.py

Interactive exercises to test your understanding:

Predict token counts
Compare tokenization strategies
Calculate API costs
Optimize prompts for token efficiency

Key Concepts¶

Token Fundamentals¶

# Text → Tokens → Token IDs → Model Processing
"Hello world" → ["Hello", " world"] → [9906, 1917] → [embeddings...]

Token: A unit of text (word, subword, or character) that a model processes

Token ID: A unique integer representing a token in the model’s vocabulary

Encoding: Converting text into token IDs Decoding: Converting token IDs back to text

Why Tokens Matter¶

Cost Calculation
- APIs charge per token (not per word or character)
- Example: GPT-4 = $0.03 per 1K input tokens
Context Limits
- Models have token limits (e.g., 8K, 32K, 128K tokens)
- Must fit prompt + response within limit
Processing Speed
- Fewer tokens = faster inference
- Efficient tokenization improves performance
Multilingual Support
- Subword tokenization handles any language
- No need for language-specific dictionaries

Token Count Rules of Thumb¶

Text Type	Token Ratio
English text	1 token ≈ 4 characters
English text	1 token ≈ ¾ word
Code	1 token ≈ 3-4 characters
Non-English	1 token ≈ 2-3 characters
Numbers	Often 1-2 digits per token

Examples:

“Hello world!” = 3 tokens
“GPT-4 is amazing” = 5 tokens
“supercalifragilisticexpialidocious” = 7 tokens

Common Pitfalls¶

❌ Mistake 1: Assuming 1 token = 1 word¶

# Wrong assumption
text = "I love programming"  # 3 words
# Actual: 3 tokens ✅ (happens to match here)

text = "I love ML"  # 3 words  
# Actual: 4 tokens (M and L are separate) ❌

❌ Mistake 2: Ignoring leading spaces¶

# These tokenize differently!
"red"     # Token: 1171
" red"    # Token: 2266 (with space)

❌ Mistake 3: Underestimating non-English costs¶

# English: "How are you?" = 4 tokens
# Chinese: "你好吗?" = 6 tokens (for 3 characters!)

❌ Mistake 4: Not counting special tokens¶

# Many models add special tokens:
# <|start|> text <|end|>
# These count toward your token limit!

Advanced Topics¶

Different Tokenization Algorithms¶

Byte-Pair Encoding (BPE) - Used by GPT models
- Merges frequent character pairs iteratively
- Good compression, handles any text
WordPiece - Used by BERT
- Similar to BPE but uses likelihood-based merging
- Optimized for vocabulary size
SentencePiece - Used by LLaMA, T5
- Language-agnostic, treats text as raw bytes
- No pre-tokenization (no word boundaries)

Note: Learn more at https://huggingface.co/docs/transformers/en/tokenizer_summary https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/ https://medium.com/@lmpo/from-text-to-tokens-understanding-bpe-wordpiece-and-sentencepiece-in-nlp-1367d9d610af https://medium.com/@lmpo/a-brief-history-of-ai-with-deep-learning-26f7948bc87b https://medium.com/@lmpo/the-evolution-of-artificial-neurons-90619f224f63

Model-Specific Encodings¶

Model	Encoding	Vocab Size	Best For
GPT-4	cl100k_base	~100K	General purpose, code
GPT-3.5	cl100k_base	~100K	General purpose
GPT-3	p50k_base	~50K	English text
Code Cushman	p50k_edit	~50K	Code editing

Token vs Context Window¶

Context Window = Maximum tokens the model can process at once

Input tokens (your prompt)
Output tokens (model’s response)
System tokens (instructions, special tokens)

Example with 4K context:

3,000 token prompt = only 1,000 tokens left for response
500 token prompt = 3,500 tokens available for response

Practical Applications¶

1. Estimate API Costs¶

def estimate_cost(text, model="gpt-4"):
    encoding = tiktoken.get_encoding("cl100k_base")
    token_count = len(encoding.encode(text))
    
    # GPT-4 pricing (example)
    input_cost_per_1k = 0.03
    cost = (token_count / 1000) * input_cost_per_1k
    
    return token_count, cost

text = "Your long prompt here..."
tokens, cost = estimate_cost(text)
print(f"Tokens: {tokens}, Cost: ${cost:.4f}")

2. Optimize Prompts¶

# Before optimization
prompt = "Please, please, can you help me understand this?"
# 10 tokens

# After optimization
prompt = "Please explain this:"
# 4 tokens (60% reduction!)

3. Split Long Documents¶

def chunk_text(text, max_tokens=1000):
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk_tokens))
    
    return chunks

Practice Exercises¶

Work through token_exercises.py to test your understanding:

Token Counting - Predict token counts for various texts
Cost Calculator - Build a simple API cost estimator
Prompt Optimizer - Reduce token usage while keeping meaning
Language Compare - See how different languages tokenize
Context Manager - Fit text within token limits

Verification Checklist¶

Before moving to Phase 2 (Embeddings), ensure you can:

Explain what tokens are in your own words
Use tiktoken to encode and decode text
Calculate token counts for given text
Understand why “word” ≠ “token”
Estimate API costs based on token counts
Recognize how context affects tokenization
Split text to fit within token limits
Compare tokenization across languages

Common Questions¶

Q: Why not just use words instead of tokens?
A: Words create a massive vocabulary (170K+ for English alone). Tokens use subwords, keeping vocabulary manageable (~50-100K) while handling any text.

Q: Do all models use the same tokens?
A: No! Each model has its own tokenizer trained on specific data. GPT-4’s “hello” might be a different token ID than LLaMA’s “hello”.

Q: Are emojis one token?
A: Usually multiple tokens. 😀 might be 1-3 tokens depending on the encoding.

Q: Does whitespace matter?
A: Yes! “ hello” (with space) is different from “hello” (no space). Leading/trailing spaces create different tokens.

Q: Can I create my own tokenizer?
A: Yes, but it requires training on large text corpora. Most developers use pre-trained tokenizers from model providers.

Resources for Deeper Learning¶

Official Documentation¶

tiktoken GitHub - Official OpenAI tokenizer
OpenAI Tokenizer Tool - Interactive web tool

Articles & Guides¶

Videos¶

Andrej Karpathy: Tokenization - Deep dive into BPE
3Blue1Brown: But what is a GPT? - Visual explanation

Next Steps¶

Once you’re comfortable with tokens, move to Phase 2: Embeddings:

cd ../4-embeddings
cat README.md

Phase 2 Preview: Learn how tokens are converted into dense vector representations that capture semantic meaning, enabling:

Semantic search (find similar content)
Text classification
Recommendation systems
Vector databases

Need Help?

Review intro.md for detailed explanations
Run examples multiple times with different inputs
Try the interactive exercises in token_exercises.py
Experiment with the OpenAI Tokenizer Tool