Run this notebook: Open in Colab Open in Kaggle

Tiktoken Example¶

Basic Tokenization Example¶

Learn the fundamentals of tokenization using tiktoken.

This example demonstrates:

Loading a tokenization encoding
Converting text to tokens (encoding)
Converting tokens back to text (decoding)
Understanding the text → tokens → IDs pipeline

Prerequisites: pip install tiktoken

Related files: - intro.md: Theory and detailed explanations - token_exploration.py: Advanced examples - token_exercises.py: Interactive practice

Setup¶

Import tiktoken, OpenAI’s fast BPE tokenizer library. Written in Rust with Python bindings, tiktoken is optimized for speed and is the official way to count tokens and estimate costs for OpenAI API calls. It ships pre-built encodings that match GPT-3.5, GPT-4, and the embedding models exactly.

import tiktoken

Basic Tokenization With tiktoken¶

What: Load the cl100k_base encoding (used by GPT-4, GPT-3.5-turbo, and text-embedding-ada-002), encode a text string into token IDs, decode them back, and display the per-token breakdown.

Why: Understanding how tiktoken works is essential for anyone using the OpenAI API. Token count directly determines API cost (you pay per token for both input and output) and whether your prompt fits within the model’s context window (e.g., 128K tokens for GPT-4-turbo). The cl100k_base encoding uses a 100,000-token BPE vocabulary that was trained on a diverse multilingual corpus, making it efficient for English while still handling other languages.

How: The encode-decode pipeline is: text \(\rightarrow\) encoding.encode(text) \(\rightarrow\) list of integer IDs \(\rightarrow\) encoding.decode(ids) \(\rightarrow\) reconstructed text. Each ID maps to a specific byte sequence in the vocabulary. You can decode individual IDs to see exactly how the text was segmented.

Connection: Before sending a prompt to the OpenAI API, use len(encoding.encode(prompt)) to check token count. This avoids surprises in billing and ensures you stay within context limits.

def main():
    """
    Demonstrate basic tokenization with tiktoken.
    
    The cl100k_base encoding is used by:
    - GPT-4
    - GPT-3.5-turbo
    - text-embedding-ada-002
    """
    
    print("=" * 70)
    print("BASIC TOKENIZATION EXAMPLE")
    print("=" * 70 + "\n")
    
    # Step 1: Load the encoding
    # This encoding is used by GPT-4 and GPT-3.5
    print("Loading cl100k_base encoding (used by GPT-4)...")
    encoding = tiktoken.get_encoding("cl100k_base")
    print("✓ Encoding loaded\n")
    
    # Step 2: Define example text
    text = "hello how are you ?"
    print(f"Original text: '{text}'")
    print(f"Character count: {len(text)}\n")
    
    # Step 3: Encode text into token IDs
    # This converts human-readable text into numbers the model understands
    token_ids = encoding.encode(text)
    print("=" * 70)
    print("ENCODING: Text → Token IDs")
    print("=" * 70)
    print(f"Token IDs: {token_ids}")
    print(f"Number of tokens: {len(token_ids)}\n")
    
    # Step 4: Decode token IDs back to text
    # This reverses the process: numbers → human-readable text
    decoded_text = encoding.decode(token_ids)
    print("=" * 70)
    print("DECODING: Token IDs → Text")
    print("=" * 70)
    print(f"Decoded text: '{decoded_text}'")
    print(f"Match original? {decoded_text == text}\n")
    
    # Step 5: Show the token breakdown
    # Each token ID corresponds to a piece of text
    print("=" * 70)
    print("TOKEN BREAKDOWN")
    print("=" * 70)
    print(f"{'Token ID':<12} {'Text Piece'}")
    print("-" * 70)
    
    for tid in token_ids:
        # Decode each individual token ID to see what text it represents
        token_text = encoding.decode([tid])
        print(f"{tid:<12} '{token_text}'")
    
    # Summary
    print("\n" + "=" * 70)
    print("SUMMARY")
    print("=" * 70)
    print(f"Input: '{text}'")
    print(f"Tokens: {len(token_ids)}")
    print(f"Characters: {len(text)}")
    print(f"Ratio: {len(text)/len(token_ids):.2f} characters per token")
    
    # Additional examples
    print("\n" + "=" * 70)
    print("TRY THESE EXAMPLES")
    print("=" * 70)
    
    examples = [
        "The quick brown fox jumps over the lazy dog",
        "I love AI and machine learning!",
        "GPT-4 is amazing",
        "print('Hello, World!')",
        "supercalifragilisticexpialidocious"
    ]
    
    print("\nSee how different texts tokenize:\n")
    for example in examples:
        tokens = encoding.encode(example)
        print(f"{len(tokens):2d} tokens: '{example}'")
    
    print("\n" + "=" * 70)
    print("NEXT STEPS")
    print("=" * 70)
    print("1. Read intro.md for detailed explanations")
    print("2. Run token_exploration.py for advanced examples")
    print("3. Try token_exercises.py for interactive practice")
    print("4. Experiment by changing the 'text' variable above!")
    print("=" * 70 + "\n")

if __name__ == "__main__":
    main()