Run this notebook: Open in Colab Open in Kaggle

02 Tokenizers Training¶

HuggingFace Tokenizers - Training Examples¶

Complete examples for training different types of tokenizers.

Installation: pip install tokenizers

Time: 30-45 minutes

Setup¶

Import os and pathlib for file-system operations. Training a tokenizer often involves reading corpus files from disk and saving the resulting model to JSON, so path handling is essential throughout this notebook.

import os
from pathlib import Path


# =============================================================================
# Example 1: Train BPE Tokenizer (GPT-2 Style)
# =============================================================================

Train a BPE Tokenizer (GPT-2 Style)¶

What: Build and train a Byte Pair Encoding tokenizer that mirrors the architecture used by GPT-2 and GPT-3.

Why: BPE is the most widely used subword algorithm in modern LLMs. By training one from scratch you will see every component: the ByteLevel pre-tokenizer that converts raw Unicode into bytes (so the vocabulary never encounters an unknown character), the BpeTrainer that controls vocab size and merge frequency, and the ByteLevel decoder that reassembles bytes back into readable text.

How: BPE iteratively merges the most frequent byte-pair in the corpus. After $V - 256$ merges (where $V$ is the target vocab size and 256 is the base byte alphabet), the vocabulary contains both single-byte and multi-byte tokens. The add_prefix_space=True setting prepends a space before each word so that a word at the start of a sentence is tokenized identically to one in the middle – a GPT-2 design choice that simplifies generation.

def train_bpe_tokenizer():
    """Train a BPE tokenizer from scratch"""
    print("\n" + "=" * 70)
    print("EXAMPLE 1: Train BPE Tokenizer (GPT-2 Style)")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import ByteLevel as ByteLevelPreTokenizer
    from tokenizers.decoders import ByteLevel as ByteLevelDecoder
    from tokenizers.processors import ByteLevel as ByteLevelProcessor
    
    # 1. Initialize
    print("\n📦 Initializing BPE tokenizer...")
    tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
    
    # 2. Pre-tokenizer (GPT-2 style byte-level)
    tokenizer.pre_tokenizer = ByteLevelPreTokenizer(add_prefix_space=True)
    
    # 3. Decoder
    tokenizer.decoder = ByteLevelDecoder()
    
    # 4. Trainer configuration
    trainer = BpeTrainer(
        vocab_size=5000,
        min_frequency=2,
        special_tokens=["<|endoftext|>"],
        show_progress=True
    )
    
    # 5. Training data
    training_data = [
        "The quick brown fox jumps over the lazy dog.",
        "Machine learning is transforming the world of AI.",
        "Natural language processing helps computers understand text.",
        "Tokenizers split text into manageable pieces.",
        "Deep learning models require proper tokenization.",
    ] * 500  # Repeat for better vocab
    
    print("📚 Training tokenizer...")
    tokenizer.train_from_iterator(training_data, trainer=trainer)
    
    # 6. Test
    test_text = "The tokenizer learned to split words!"
    output = tokenizer.encode(test_text)
    
    print(f"\n✅ Training complete!")
    print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")
    print(f"\n📝 Test: '{test_text}'")
    print(f"🔢 Tokens: {output.tokens}")
    print(f"🆔 IDs: {output.ids[:10]}... (showing first 10)")
    
    # 7. Save
    save_path = "tokenizers/bpe_gpt2_style.json"
    os.makedirs("tokenizers", exist_ok=True)
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")
    
    return tokenizer


# =============================================================================
# Example 2: Train WordPiece Tokenizer (BERT Style)
# =============================================================================

Train a WordPiece Tokenizer (BERT Style)¶

What: Build and train a WordPiece tokenizer with BERT-style normalization, [CLS]/[SEP] post-processing, and the ## continuing-subword prefix.

Why: WordPiece is BERT’s tokenization algorithm and differs from BPE in a subtle but important way: instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training corpus under a unigram language model. This tends to produce slightly different vocabularies than BPE, often with better coverage of rare morphological variants.

How: The BertNormalizer lowercases text, strips accents, and handles Chinese characters. After training, TemplateProcessing automatically wraps every input in [CLS] ... [SEP] (single) or [CLS] A [SEP] B [SEP] (pair), and the WordPieceDecoder re-joins subwords by removing the ## prefix during decoding.

Connection: If you fine-tune BERT, DistilBERT, or any BERT variant, the tokenizer you load is exactly this architecture. Training your own lets you adapt the vocabulary to a specialized domain like biomedical text or legal documents.

def train_wordpiece_tokenizer():
    """Train a WordPiece tokenizer like BERT"""
    print("\n" + "=" * 70)
    print("EXAMPLE 2: Train WordPiece Tokenizer (BERT Style)")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    from tokenizers.models import WordPiece
    from tokenizers.trainers import WordPieceTrainer
    from tokenizers.normalizers import BertNormalizer
    from tokenizers.pre_tokenizers import Whitespace
    from tokenizers.processors import TemplateProcessing
    from tokenizers.decoders import WordPiece as WordPieceDecoder
    
    # 1. Initialize
    print("\n📦 Initializing WordPiece tokenizer...")
    tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
    
    # 2. Normalizer (BERT-style)
    tokenizer.normalizer = BertNormalizer(
        clean_text=True,
        handle_chinese_chars=True,
        strip_accents=True,
        lowercase=True
    )
    
    # 3. Pre-tokenizer
    tokenizer.pre_tokenizer = Whitespace()
    
    # 4. Trainer
    trainer = WordPieceTrainer(
        vocab_size=10000,
        min_frequency=2,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        continuing_subword_prefix="##",
        show_progress=True
    )
    
    # 5. Training data (simulate diverse corpus)
    training_data = [
        "The quick brown fox jumps over the lazy dog.",
        "BERT uses WordPiece tokenization.",
        "Natural language understanding is improving rapidly.",
        "Machine learning models need quality data.",
        "Tokenization is the first step in NLP.",
        "Understanding context is crucial for language models.",
    ] * 500
    
    print("📚 Training WordPiece tokenizer...")
    tokenizer.train_from_iterator(training_data, trainer=trainer)
    
    # 6. Post-processor (BERT-style)
    tokenizer.post_processor = TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
        special_tokens=[
            ("[CLS]", tokenizer.token_to_id("[CLS]")),
            ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ],
    )
    
    # 7. Decoder
    tokenizer.decoder = WordPieceDecoder(prefix="##")
    
    # 8. Test
    print(f"\n✅ Training complete!")
    print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")
    
    test_text = "Understanding tokenization"
    output = tokenizer.encode(test_text)
    print(f"\n📝 Test: '{test_text}'")
    print(f"🔢 Tokens: {output.tokens}")
    decoded = tokenizer.decode(output.ids)
    print(f"🔄 Decoded: '{decoded}'")
    
    # 9. Save
    save_path = "tokenizers/wordpiece_bert_style.json"
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")
    
    return tokenizer


# =============================================================================
# Example 3: Train Unigram Tokenizer (SentencePiece Style)
# =============================================================================

Train a Unigram Tokenizer (SentencePiece Style)¶

What: Train a Unigram language-model tokenizer with NFKC normalization and Metaspace pre-tokenization, suitable for multilingual corpora.

Why: Unlike BPE (which builds up from characters) or WordPiece (which also builds up), the Unigram algorithm works top-down: it starts with a large candidate vocabulary and iteratively removes tokens whose removal causes the smallest increase in corpus loss. This produces a probabilistic model where each segmentation has a well-defined likelihood, enabling features like n-best segmentation and subword regularization (randomly sampling different segmentations during training to improve robustness).

How: The loss function is the negative log-likelihood of the corpus: $$\mathcal{L} = -\sum_{s \in \text{corpus}} \log P(s)$$ where $P(s)$ is the product of individual token probabilities under the unigram assumption. At each pruning step, the token whose removal increases $\mathcal{L}$ the least is dropped.

Connection: T5, ALBERT, XLM-R, and LLaMA all use Unigram-based tokenizers (via SentencePiece). The Metaspace pre-tokenizer replaces spaces with the ▁ character, which is the hallmark of SentencePiece-style models.

def train_unigram_tokenizer():
    """Train a Unigram tokenizer for multilingual use"""
    print("\n" + "=" * 70)
    print("EXAMPLE 3: Train Unigram Tokenizer (SentencePiece Style)")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    from tokenizers.models import Unigram
    from tokenizers.trainers import UnigramTrainer
    from tokenizers.normalizers import NFKC
    from tokenizers.pre_tokenizers import Metaspace
    from tokenizers.decoders import Metaspace as MetaspaceDecoder
    
    # 1. Initialize
    print("\n📦 Initializing Unigram tokenizer...")
    tokenizer = Tokenizer(Unigram())
    
    # 2. Normalizer
    tokenizer.normalizer = NFKC()
    
    # 3. Pre-tokenizer (Metaspace for SentencePiece compatibility)
    tokenizer.pre_tokenizer = Metaspace()
    
    # 4. Decoder
    tokenizer.decoder = MetaspaceDecoder()
    
    # 5. Trainer
    trainer = UnigramTrainer(
        vocab_size=8000,
        special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
        unk_token="<unk>",
        show_progress=True
    )
    
    # 6. Multilingual training data
    training_data = [
        # English
        "The quick brown fox jumps over the lazy dog.",
        "Machine learning is transforming the world.",
        # Spanish
        "El rápido zorro marrón salta sobre el perro perezoso.",
        "El aprendizaje automático está transformando el mundo.",
        # French
        "Le rapide renard brun saute par-dessus le chien paresseux.",
        "L'apprentissage automatique transforme le monde.",
        # German
        "Der schnelle braune Fuchs springt über den faulen Hund.",
        "Maschinelles Lernen verändert die Welt.",
    ] * 300
    
    print("📚 Training Unigram tokenizer...")
    tokenizer.train_from_iterator(training_data, trainer=trainer)
    
    # 7. Test on multiple languages
    print(f"\n✅ Training complete!")
    print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")
    
    test_cases = [
        ("English", "Machine learning is amazing"),
        ("Spanish", "El aprendizaje es increíble"),
        ("French", "L'apprentissage est incroyable"),
    ]
    
    print("\n📝 Testing on multiple languages:")
    for lang, text in test_cases:
        output = tokenizer.encode(text)
        print(f"\n  {lang}: '{text}'")
        print(f"  Tokens: {output.tokens[:10]}... ({len(output.tokens)} total)")
    
    # 8. Save
    save_path = "tokenizers/unigram_multilingual.json"
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")
    
    return tokenizer


# =============================================================================
# Example 4: Train Domain-Specific Tokenizer (Code)
# =============================================================================

Train a Code-Specific Tokenizer¶

What: Train a BPE tokenizer on a corpus of Python and JavaScript code, with custom special tokens for code structure like <INDENT>, <DEDENT>, and <NEWLINE>.

Why: Code has very different statistical patterns from natural language – identifiers use snake_case or camelCase, operators and brackets are frequent, and indentation carries semantic meaning (especially in Python). A general-purpose tokenizer trained on English prose will waste vocabulary slots on common English words while poorly handling code constructs. Training a code-specific tokenizer produces shorter token sequences for code, which means lower API costs and more code fitting in a context window.

Connection: Models like Codex, CodeLlama, and StarCoder all use tokenizers trained on code corpora. The custom special tokens let you explicitly mark structural elements that would otherwise be ambiguous in the token stream.

def train_code_tokenizer():
    """Train a tokenizer optimized for code"""
    print("\n" + "=" * 70)
    print("EXAMPLE 4: Train Code-Specific Tokenizer")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    
    # 1. Initialize
    print("\n📦 Initializing code tokenizer...")
    tokenizer = Tokenizer(BPE(unk_token="<UNK>"))
    
    # 2. Pre-tokenizer (simple whitespace, preserve case)
    tokenizer.pre_tokenizer = Whitespace()
    
    # 3. Trainer with code-specific tokens
    trainer = BpeTrainer(
        vocab_size=15000,
        special_tokens=[
            "<UNK>", "<PAD>", "<BOS>", "<EOS>",
            "<INDENT>", "<DEDENT>", "<NEWLINE>",
            "<COMMENT>", "<STRING>", "<NUMBER>"
        ],
        show_progress=True
    )
    
    # 4. Code training data
    training_data = [
        # Python
        "def hello_world():\n    print('Hello, world!')\n    return True",
        "class MyClass:\n    def __init__(self):\n        self.value = 42",
        "for i in range(10):\n    print(i)",
        "import numpy as np\nimport pandas as pd",
        # JavaScript
        "function hello() {\n  console.log('Hello');\n  return true;\n}",
        "const myArray = [1, 2, 3, 4, 5];",
        "for (let i = 0; i < 10; i++) {\n  console.log(i);\n}",
        # General patterns
        "if (condition) { doSomething(); }",
        "var x = 10;\nvar y = 20;\nvar sum = x + y;",
    ] * 300
    
    print("📚 Training on code samples...")
    tokenizer.train_from_iterator(training_data, trainer=trainer)
    
    # 5. Test on code snippets
    print(f"\n✅ Training complete!")
    print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")
    
    code_examples = [
        ("Python", "def factorial(n):\n    return 1 if n <= 1 else n * factorial(n-1)"),
        ("JavaScript", "const sum = (a, b) => a + b;"),
        ("General", "for i in range(10): print(i)")
    ]
    
    print("\n📝 Testing on code:")
    for lang, code in code_examples:
        output = tokenizer.encode(code)
        print(f"\n  {lang}:")
        print(f"  Code: {code[:50]}...")
        print(f"  Tokens: {output.tokens[:15]}...")
        print(f"  Total: {len(output.tokens)} tokens")
    
    # 6. Save
    save_path = "tokenizers/code_tokenizer.json"
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")
    
    return tokenizer


# =============================================================================
# Example 5: Train with Real Files
# =============================================================================

Train From Files¶

What: Train a tokenizer directly from text files on disk using tokenizer.train(files, trainer) instead of train_from_iterator().

Why: Real-world tokenizer training operates on gigabytes of text stored in files – Wikipedia dumps, Common Crawl extracts, or domain-specific corpora. The train() method reads files lazily and streams data through the Rust backend, making it memory-efficient for large datasets. This is the approach you would use when preparing a production tokenizer for a new language or domain.

Connection: When teams at Google or Meta train tokenizers for models like T5 or LLaMA, they point the trainer at massive file collections. Understanding the file-based API bridges the gap between toy examples and real model development.

def train_from_files():
    """Train tokenizer from actual text files"""
    print("\n" + "=" * 70)
    print("EXAMPLE 5: Train from Files")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    
    # 1. Create sample training files
    print("\n📝 Creating sample training files...")
    os.makedirs("training_data", exist_ok=True)
    
    # Create 3 sample files
    file_contents = [
        ("train.txt", "The quick brown fox jumps over the lazy dog.\n" * 100),
        ("valid.txt", "Machine learning is transforming AI.\n" * 100),
        ("test.txt", "Natural language processing is amazing.\n" * 100),
    ]
    
    files = []
    for filename, content in file_contents:
        filepath = f"training_data/{filename}"
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(content)
        files.append(filepath)
        print(f"  Created: {filepath}")
    
    # 2. Initialize tokenizer
    print("\n📦 Initializing tokenizer...")
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    
    # 3. Trainer
    trainer = BpeTrainer(
        vocab_size=5000,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        show_progress=True
    )
    
    # 4. Train from files
    print("\n📚 Training from files...")
    tokenizer.train(files, trainer)
    
    print(f"\n✅ Training complete!")
    print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")
    
    # 5. Test
    test_text = "The fox is learning machine learning"
    output = tokenizer.encode(test_text)
    print(f"\n📝 Test: '{test_text}'")
    print(f"🔢 Tokens: {output.tokens}")
    
    # 6. Save
    save_path = "tokenizers/file_trained.json"
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")
    
    return tokenizer


# =============================================================================
# Example 6: Compare Different Tokenizers
# =============================================================================

Compare Tokenizers¶

What: Load the BPE, WordPiece, Unigram, and Code tokenizers trained in earlier examples and compare how each one segments the same input sentences.

Why: Different algorithms make different trade-offs. BPE is greedy and deterministic; WordPiece optimizes corpus likelihood; Unigram is probabilistic; and a code tokenizer is domain-tuned. Seeing side-by-side comparisons on the same text reveals these differences concretely – one tokenizer might produce 8 tokens where another produces 12, and the token boundaries will differ. Fewer tokens generally means faster inference and lower API cost, but the “right” tokenizer depends on the model architecture and training data.

Connection: Choosing or training the right tokenizer is one of the most impactful decisions in an LLM project. A mismatch between tokenizer and domain can silently degrade model quality.

def compare_tokenizers():
    """Compare tokenization results across different models"""
    print("\n" + "=" * 70)
    print("EXAMPLE 6: Compare Different Tokenizers")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    
    # Load previously trained tokenizers
    tokenizer_paths = {
        "BPE (GPT-2)": "tokenizers/bpe_gpt2_style.json",
        "WordPiece (BERT)": "tokenizers/wordpiece_bert_style.json",
        "Unigram": "tokenizers/unigram_multilingual.json",
        "Code": "tokenizers/code_tokenizer.json",
    }
    
    tokenizers = {}
    for name, path in tokenizer_paths.items():
        if os.path.exists(path):
            tokenizers[name] = Tokenizer.from_file(path)
    
    if not tokenizers:
        print("⚠️ No trained tokenizers found. Run other examples first.")
        return
    
    # Test sentences
    test_sentences = [
        "Machine learning is transforming the world.",
        "The quick brown fox jumps over the lazy dog.",
        "Understanding natural language processing.",
    ]
    
    print("\n📊 Comparing tokenization results:\n")
    
    for sentence in test_sentences:
        print(f"📝 Sentence: '{sentence}'")
        print("-" * 70)
        
        for name, tokenizer in tokenizers.items():
            output = tokenizer.encode(sentence)
            print(f"\n  {name}:")
            print(f"    Tokens: {output.tokens[:15]}")
            print(f"    Count: {len(output.tokens)} tokens")
            print(f"    IDs: {output.ids[:10]}...")
        
        print("\n")


# =============================================================================
# Example 7: Fine-tune Existing Tokenizer
# =============================================================================

Fine-Tune an Existing Tokenizer¶

What: Load a previously trained WordPiece tokenizer and extend its vocabulary with domain-specific tokens using tokenizer.add_tokens().

Why: Training a tokenizer from scratch requires a large corpus, but you can often get 80% of the benefit by adding new tokens to an existing vocabulary. If your application deals with medical terms like “covid19” or ML framework names like “pytorch”, adding them as single tokens prevents the tokenizer from splitting them into unhelpful subwords. This is the approach used when fine-tuning models for specialized domains – you extend the tokenizer vocabulary and resize the model’s embedding matrix to match.

Connection: HuggingFace Transformers’ tokenizer.add_tokens() and model.resize_token_embeddings() work together to let you adapt any pretrained model to new vocabulary without retraining from scratch.

def finetune_tokenizer():
    """Add domain-specific vocabulary to existing tokenizer"""
    print("\n" + "=" * 70)
    print("EXAMPLE 7: Fine-tune Existing Tokenizer")
    print("=" * 70)
    
    from tokenizers import Tokenizer
    
    # Load base tokenizer
    if not os.path.exists("tokenizers/wordpiece_bert_style.json"):
        print("⚠️ Base tokenizer not found. Train WordPiece tokenizer first.")
        return
    
    print("\n📥 Loading base tokenizer...")
    tokenizer = Tokenizer.from_file("tokenizers/wordpiece_bert_style.json")
    
    vocab_before = tokenizer.get_vocab_size()
    print(f"📊 Vocabulary size before: {vocab_before}")
    
    # Add domain-specific tokens
    print("\n➕ Adding domain-specific tokens...")
    new_tokens = [
        # Medical terms
        "covid19", "vaccine", "antibody", "pandemic",
        # Tech terms
        "tensorflow", "pytorch", "neural_network", "transformer",
        # Custom tokens
        "[ENTITY]", "[DATE]", "[NUMBER]", "[URL]"
    ]
    
    num_added = tokenizer.add_tokens(new_tokens)
    
    vocab_after = tokenizer.get_vocab_size()
    print(f"✅ Added {num_added} new tokens")
    print(f"📊 Vocabulary size after: {vocab_after}")
    
    # Test with domain-specific text
    test_text = "The covid19 vaccine uses tensorflow for neural_network prediction"
    output = tokenizer.encode(test_text)
    
    print(f"\n📝 Test: '{test_text}'")
    print(f"🔢 Tokens: {output.tokens}")
    
    # Save fine-tuned tokenizer
    save_path = "tokenizers/finetuned_tokenizer.json"
    tokenizer.save(save_path)
    print(f"\n💾 Saved to: {save_path}")


# =============================================================================
# Main Function
# =============================================================================

Main – Run All Training Examples¶

Execute every training example in sequence: BPE, WordPiece, Unigram, Code, file-based training, comparison, and fine-tuning. The trained tokenizers are saved to the ./tokenizers/ directory as JSON files that can be loaded later with Tokenizer.from_file().

def main():
    """Run all training examples"""
    print("\n" + "=" * 70)
    print("HUGGINGFACE TOKENIZERS - TRAINING EXAMPLES")
    print("=" * 70)
    
    print("\nThis script demonstrates:")
    print("  1. Training BPE tokenizer (GPT-2 style)")
    print("  2. Training WordPiece tokenizer (BERT style)")
    print("  3. Training Unigram tokenizer (multilingual)")
    print("  4. Training domain-specific tokenizer (code)")
    print("  5. Training from files")
    print("  6. Comparing different tokenizers")
    print("  7. Fine-tuning existing tokenizers")
    
    try:
        # Train different types
        train_bpe_tokenizer()
        train_wordpiece_tokenizer()
        train_unigram_tokenizer()
        train_code_tokenizer()
        train_from_files()
        
        # Compare and fine-tune
        compare_tokenizers()
        finetune_tokenizer()
        
        print("\n" + "=" * 70)
        print("✅ ALL TRAINING EXAMPLES COMPLETED!")
        print("=" * 70)
        print(f"\n📁 Trained tokenizers saved in: ./tokenizers/")
        print("📚 Next steps:")
        print("  - Load and use these tokenizers in your projects")
        print("  - Experiment with different hyperparameters")
        print("  - Train on your own domain-specific data")
        print("\n")
        
    except Exception as e:
        print(f"\n❌ Error: {e}")
        import traceback
        traceback.print_exc()


if __name__ == "__main__":
    main()