02 Tokenizers TrainingΒΆ
HuggingFace Tokenizers - Training ExamplesΒΆ
Complete examples for training different types of tokenizers.
Installation: pip install tokenizers
Time: 30-45 minutes
SetupΒΆ
Import os and pathlib for file-system operations. Training a tokenizer often involves reading corpus files from disk and saving the resulting model to JSON, so path handling is essential throughout this notebook.
import os
from pathlib import Path
# =============================================================================
# Example 1: Train BPE Tokenizer (GPT-2 Style)
# =============================================================================
Train a BPE Tokenizer (GPT-2 Style)ΒΆ
What: Build and train a Byte Pair Encoding tokenizer that mirrors the architecture used by GPT-2 and GPT-3.
Why: BPE is the most widely used subword algorithm in modern LLMs. By training one from scratch you will see every component: the ByteLevel pre-tokenizer that converts raw Unicode into bytes (so the vocabulary never encounters an unknown character), the BpeTrainer that controls vocab size and merge frequency, and the ByteLevel decoder that reassembles bytes back into readable text.
How: BPE iteratively merges the most frequent byte-pair in the corpus. After \(V - 256\) merges (where \(V\) is the target vocab size and 256 is the base byte alphabet), the vocabulary contains both single-byte and multi-byte tokens. The add_prefix_space=True setting prepends a space before each word so that a word at the start of a sentence is tokenized identically to one in the middle β a GPT-2 design choice that simplifies generation.
def train_bpe_tokenizer():
"""Train a BPE tokenizer from scratch"""
print("\n" + "=" * 70)
print("EXAMPLE 1: Train BPE Tokenizer (GPT-2 Style)")
print("=" * 70)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel as ByteLevelPreTokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.processors import ByteLevel as ByteLevelProcessor
# 1. Initialize
print("\nπ¦ Initializing BPE tokenizer...")
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
# 2. Pre-tokenizer (GPT-2 style byte-level)
tokenizer.pre_tokenizer = ByteLevelPreTokenizer(add_prefix_space=True)
# 3. Decoder
tokenizer.decoder = ByteLevelDecoder()
# 4. Trainer configuration
trainer = BpeTrainer(
vocab_size=5000,
min_frequency=2,
special_tokens=["<|endoftext|>"],
show_progress=True
)
# 5. Training data
training_data = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming the world of AI.",
"Natural language processing helps computers understand text.",
"Tokenizers split text into manageable pieces.",
"Deep learning models require proper tokenization.",
] * 500 # Repeat for better vocab
print("π Training tokenizer...")
tokenizer.train_from_iterator(training_data, trainer=trainer)
# 6. Test
test_text = "The tokenizer learned to split words!"
output = tokenizer.encode(test_text)
print(f"\nβ
Training complete!")
print(f"π Vocabulary size: {tokenizer.get_vocab_size()}")
print(f"\nπ Test: '{test_text}'")
print(f"π’ Tokens: {output.tokens}")
print(f"π IDs: {output.ids[:10]}... (showing first 10)")
# 7. Save
save_path = "tokenizers/bpe_gpt2_style.json"
os.makedirs("tokenizers", exist_ok=True)
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
return tokenizer
# =============================================================================
# Example 2: Train WordPiece Tokenizer (BERT Style)
# =============================================================================
Train a WordPiece Tokenizer (BERT Style)ΒΆ
What: Build and train a WordPiece tokenizer with BERT-style normalization, [CLS]/[SEP] post-processing, and the ## continuing-subword prefix.
Why: WordPiece is BERTβs tokenization algorithm and differs from BPE in a subtle but important way: instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training corpus under a unigram language model. This tends to produce slightly different vocabularies than BPE, often with better coverage of rare morphological variants.
How: The BertNormalizer lowercases text, strips accents, and handles Chinese characters. After training, TemplateProcessing automatically wraps every input in [CLS] ... [SEP] (single) or [CLS] A [SEP] B [SEP] (pair), and the WordPieceDecoder re-joins subwords by removing the ## prefix during decoding.
Connection: If you fine-tune BERT, DistilBERT, or any BERT variant, the tokenizer you load is exactly this architecture. Training your own lets you adapt the vocabulary to a specialized domain like biomedical text or legal documents.
def train_wordpiece_tokenizer():
"""Train a WordPiece tokenizer like BERT"""
print("\n" + "=" * 70)
print("EXAMPLE 2: Train WordPiece Tokenizer (BERT Style)")
print("=" * 70)
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.decoders import WordPiece as WordPieceDecoder
# 1. Initialize
print("\nπ¦ Initializing WordPiece tokenizer...")
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
# 2. Normalizer (BERT-style)
tokenizer.normalizer = BertNormalizer(
clean_text=True,
handle_chinese_chars=True,
strip_accents=True,
lowercase=True
)
# 3. Pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()
# 4. Trainer
trainer = WordPieceTrainer(
vocab_size=10000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##",
show_progress=True
)
# 5. Training data (simulate diverse corpus)
training_data = [
"The quick brown fox jumps over the lazy dog.",
"BERT uses WordPiece tokenization.",
"Natural language understanding is improving rapidly.",
"Machine learning models need quality data.",
"Tokenization is the first step in NLP.",
"Understanding context is crucial for language models.",
] * 500
print("π Training WordPiece tokenizer...")
tokenizer.train_from_iterator(training_data, trainer=trainer)
# 6. Post-processor (BERT-style)
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
# 7. Decoder
tokenizer.decoder = WordPieceDecoder(prefix="##")
# 8. Test
print(f"\nβ
Training complete!")
print(f"π Vocabulary size: {tokenizer.get_vocab_size()}")
test_text = "Understanding tokenization"
output = tokenizer.encode(test_text)
print(f"\nπ Test: '{test_text}'")
print(f"π’ Tokens: {output.tokens}")
decoded = tokenizer.decode(output.ids)
print(f"π Decoded: '{decoded}'")
# 9. Save
save_path = "tokenizers/wordpiece_bert_style.json"
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
return tokenizer
# =============================================================================
# Example 3: Train Unigram Tokenizer (SentencePiece Style)
# =============================================================================
Train a Unigram Tokenizer (SentencePiece Style)ΒΆ
What: Train a Unigram language-model tokenizer with NFKC normalization and Metaspace pre-tokenization, suitable for multilingual corpora.
Why: Unlike BPE (which builds up from characters) or WordPiece (which also builds up), the Unigram algorithm works top-down: it starts with a large candidate vocabulary and iteratively removes tokens whose removal causes the smallest increase in corpus loss. This produces a probabilistic model where each segmentation has a well-defined likelihood, enabling features like n-best segmentation and subword regularization (randomly sampling different segmentations during training to improve robustness).
How: The loss function is the negative log-likelihood of the corpus: $\(\mathcal{L} = -\sum_{s \in \text{corpus}} \log P(s)\)\( where \)P(s)\( is the product of individual token probabilities under the unigram assumption. At each pruning step, the token whose removal increases \)\mathcal{L}$ the least is dropped.
Connection: T5, ALBERT, XLM-R, and LLaMA all use Unigram-based tokenizers (via SentencePiece). The Metaspace pre-tokenizer replaces spaces with the β character, which is the hallmark of SentencePiece-style models.
def train_unigram_tokenizer():
"""Train a Unigram tokenizer for multilingual use"""
print("\n" + "=" * 70)
print("EXAMPLE 3: Train Unigram Tokenizer (SentencePiece Style)")
print("=" * 70)
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.normalizers import NFKC
from tokenizers.pre_tokenizers import Metaspace
from tokenizers.decoders import Metaspace as MetaspaceDecoder
# 1. Initialize
print("\nπ¦ Initializing Unigram tokenizer...")
tokenizer = Tokenizer(Unigram())
# 2. Normalizer
tokenizer.normalizer = NFKC()
# 3. Pre-tokenizer (Metaspace for SentencePiece compatibility)
tokenizer.pre_tokenizer = Metaspace()
# 4. Decoder
tokenizer.decoder = MetaspaceDecoder()
# 5. Trainer
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
unk_token="<unk>",
show_progress=True
)
# 6. Multilingual training data
training_data = [
# English
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming the world.",
# Spanish
"El rΓ‘pido zorro marrΓ³n salta sobre el perro perezoso.",
"El aprendizaje automΓ‘tico estΓ‘ transformando el mundo.",
# French
"Le rapide renard brun saute par-dessus le chien paresseux.",
"L'apprentissage automatique transforme le monde.",
# German
"Der schnelle braune Fuchs springt ΓΌber den faulen Hund.",
"Maschinelles Lernen verΓ€ndert die Welt.",
] * 300
print("π Training Unigram tokenizer...")
tokenizer.train_from_iterator(training_data, trainer=trainer)
# 7. Test on multiple languages
print(f"\nβ
Training complete!")
print(f"π Vocabulary size: {tokenizer.get_vocab_size()}")
test_cases = [
("English", "Machine learning is amazing"),
("Spanish", "El aprendizaje es increΓble"),
("French", "L'apprentissage est incroyable"),
]
print("\nπ Testing on multiple languages:")
for lang, text in test_cases:
output = tokenizer.encode(text)
print(f"\n {lang}: '{text}'")
print(f" Tokens: {output.tokens[:10]}... ({len(output.tokens)} total)")
# 8. Save
save_path = "tokenizers/unigram_multilingual.json"
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
return tokenizer
# =============================================================================
# Example 4: Train Domain-Specific Tokenizer (Code)
# =============================================================================
Train a Code-Specific TokenizerΒΆ
What: Train a BPE tokenizer on a corpus of Python and JavaScript code, with custom special tokens for code structure like <INDENT>, <DEDENT>, and <NEWLINE>.
Why: Code has very different statistical patterns from natural language β identifiers use snake_case or camelCase, operators and brackets are frequent, and indentation carries semantic meaning (especially in Python). A general-purpose tokenizer trained on English prose will waste vocabulary slots on common English words while poorly handling code constructs. Training a code-specific tokenizer produces shorter token sequences for code, which means lower API costs and more code fitting in a context window.
Connection: Models like Codex, CodeLlama, and StarCoder all use tokenizers trained on code corpora. The custom special tokens let you explicitly mark structural elements that would otherwise be ambiguous in the token stream.
def train_code_tokenizer():
"""Train a tokenizer optimized for code"""
print("\n" + "=" * 70)
print("EXAMPLE 4: Train Code-Specific Tokenizer")
print("=" * 70)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# 1. Initialize
print("\nπ¦ Initializing code tokenizer...")
tokenizer = Tokenizer(BPE(unk_token="<UNK>"))
# 2. Pre-tokenizer (simple whitespace, preserve case)
tokenizer.pre_tokenizer = Whitespace()
# 3. Trainer with code-specific tokens
trainer = BpeTrainer(
vocab_size=15000,
special_tokens=[
"<UNK>", "<PAD>", "<BOS>", "<EOS>",
"<INDENT>", "<DEDENT>", "<NEWLINE>",
"<COMMENT>", "<STRING>", "<NUMBER>"
],
show_progress=True
)
# 4. Code training data
training_data = [
# Python
"def hello_world():\n print('Hello, world!')\n return True",
"class MyClass:\n def __init__(self):\n self.value = 42",
"for i in range(10):\n print(i)",
"import numpy as np\nimport pandas as pd",
# JavaScript
"function hello() {\n console.log('Hello');\n return true;\n}",
"const myArray = [1, 2, 3, 4, 5];",
"for (let i = 0; i < 10; i++) {\n console.log(i);\n}",
# General patterns
"if (condition) { doSomething(); }",
"var x = 10;\nvar y = 20;\nvar sum = x + y;",
] * 300
print("π Training on code samples...")
tokenizer.train_from_iterator(training_data, trainer=trainer)
# 5. Test on code snippets
print(f"\nβ
Training complete!")
print(f"π Vocabulary size: {tokenizer.get_vocab_size()}")
code_examples = [
("Python", "def factorial(n):\n return 1 if n <= 1 else n * factorial(n-1)"),
("JavaScript", "const sum = (a, b) => a + b;"),
("General", "for i in range(10): print(i)")
]
print("\nπ Testing on code:")
for lang, code in code_examples:
output = tokenizer.encode(code)
print(f"\n {lang}:")
print(f" Code: {code[:50]}...")
print(f" Tokens: {output.tokens[:15]}...")
print(f" Total: {len(output.tokens)} tokens")
# 6. Save
save_path = "tokenizers/code_tokenizer.json"
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
return tokenizer
# =============================================================================
# Example 5: Train with Real Files
# =============================================================================
Train From FilesΒΆ
What: Train a tokenizer directly from text files on disk using tokenizer.train(files, trainer) instead of train_from_iterator().
Why: Real-world tokenizer training operates on gigabytes of text stored in files β Wikipedia dumps, Common Crawl extracts, or domain-specific corpora. The train() method reads files lazily and streams data through the Rust backend, making it memory-efficient for large datasets. This is the approach you would use when preparing a production tokenizer for a new language or domain.
Connection: When teams at Google or Meta train tokenizers for models like T5 or LLaMA, they point the trainer at massive file collections. Understanding the file-based API bridges the gap between toy examples and real model development.
def train_from_files():
"""Train tokenizer from actual text files"""
print("\n" + "=" * 70)
print("EXAMPLE 5: Train from Files")
print("=" * 70)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# 1. Create sample training files
print("\nπ Creating sample training files...")
os.makedirs("training_data", exist_ok=True)
# Create 3 sample files
file_contents = [
("train.txt", "The quick brown fox jumps over the lazy dog.\n" * 100),
("valid.txt", "Machine learning is transforming AI.\n" * 100),
("test.txt", "Natural language processing is amazing.\n" * 100),
]
files = []
for filename, content in file_contents:
filepath = f"training_data/{filename}"
with open(filepath, "w", encoding="utf-8") as f:
f.write(content)
files.append(filepath)
print(f" Created: {filepath}")
# 2. Initialize tokenizer
print("\nπ¦ Initializing tokenizer...")
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
# 3. Trainer
trainer = BpeTrainer(
vocab_size=5000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
show_progress=True
)
# 4. Train from files
print("\nπ Training from files...")
tokenizer.train(files, trainer)
print(f"\nβ
Training complete!")
print(f"π Vocabulary size: {tokenizer.get_vocab_size()}")
# 5. Test
test_text = "The fox is learning machine learning"
output = tokenizer.encode(test_text)
print(f"\nπ Test: '{test_text}'")
print(f"π’ Tokens: {output.tokens}")
# 6. Save
save_path = "tokenizers/file_trained.json"
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
return tokenizer
# =============================================================================
# Example 6: Compare Different Tokenizers
# =============================================================================
Compare TokenizersΒΆ
What: Load the BPE, WordPiece, Unigram, and Code tokenizers trained in earlier examples and compare how each one segments the same input sentences.
Why: Different algorithms make different trade-offs. BPE is greedy and deterministic; WordPiece optimizes corpus likelihood; Unigram is probabilistic; and a code tokenizer is domain-tuned. Seeing side-by-side comparisons on the same text reveals these differences concretely β one tokenizer might produce 8 tokens where another produces 12, and the token boundaries will differ. Fewer tokens generally means faster inference and lower API cost, but the βrightβ tokenizer depends on the model architecture and training data.
Connection: Choosing or training the right tokenizer is one of the most impactful decisions in an LLM project. A mismatch between tokenizer and domain can silently degrade model quality.
def compare_tokenizers():
"""Compare tokenization results across different models"""
print("\n" + "=" * 70)
print("EXAMPLE 6: Compare Different Tokenizers")
print("=" * 70)
from tokenizers import Tokenizer
# Load previously trained tokenizers
tokenizer_paths = {
"BPE (GPT-2)": "tokenizers/bpe_gpt2_style.json",
"WordPiece (BERT)": "tokenizers/wordpiece_bert_style.json",
"Unigram": "tokenizers/unigram_multilingual.json",
"Code": "tokenizers/code_tokenizer.json",
}
tokenizers = {}
for name, path in tokenizer_paths.items():
if os.path.exists(path):
tokenizers[name] = Tokenizer.from_file(path)
if not tokenizers:
print("β οΈ No trained tokenizers found. Run other examples first.")
return
# Test sentences
test_sentences = [
"Machine learning is transforming the world.",
"The quick brown fox jumps over the lazy dog.",
"Understanding natural language processing.",
]
print("\nπ Comparing tokenization results:\n")
for sentence in test_sentences:
print(f"π Sentence: '{sentence}'")
print("-" * 70)
for name, tokenizer in tokenizers.items():
output = tokenizer.encode(sentence)
print(f"\n {name}:")
print(f" Tokens: {output.tokens[:15]}")
print(f" Count: {len(output.tokens)} tokens")
print(f" IDs: {output.ids[:10]}...")
print("\n")
# =============================================================================
# Example 7: Fine-tune Existing Tokenizer
# =============================================================================
Fine-Tune an Existing TokenizerΒΆ
What: Load a previously trained WordPiece tokenizer and extend its vocabulary with domain-specific tokens using tokenizer.add_tokens().
Why: Training a tokenizer from scratch requires a large corpus, but you can often get 80% of the benefit by adding new tokens to an existing vocabulary. If your application deals with medical terms like βcovid19β or ML framework names like βpytorchβ, adding them as single tokens prevents the tokenizer from splitting them into unhelpful subwords. This is the approach used when fine-tuning models for specialized domains β you extend the tokenizer vocabulary and resize the modelβs embedding matrix to match.
Connection: HuggingFace Transformersβ tokenizer.add_tokens() and model.resize_token_embeddings() work together to let you adapt any pretrained model to new vocabulary without retraining from scratch.
def finetune_tokenizer():
"""Add domain-specific vocabulary to existing tokenizer"""
print("\n" + "=" * 70)
print("EXAMPLE 7: Fine-tune Existing Tokenizer")
print("=" * 70)
from tokenizers import Tokenizer
# Load base tokenizer
if not os.path.exists("tokenizers/wordpiece_bert_style.json"):
print("β οΈ Base tokenizer not found. Train WordPiece tokenizer first.")
return
print("\nπ₯ Loading base tokenizer...")
tokenizer = Tokenizer.from_file("tokenizers/wordpiece_bert_style.json")
vocab_before = tokenizer.get_vocab_size()
print(f"π Vocabulary size before: {vocab_before}")
# Add domain-specific tokens
print("\nβ Adding domain-specific tokens...")
new_tokens = [
# Medical terms
"covid19", "vaccine", "antibody", "pandemic",
# Tech terms
"tensorflow", "pytorch", "neural_network", "transformer",
# Custom tokens
"[ENTITY]", "[DATE]", "[NUMBER]", "[URL]"
]
num_added = tokenizer.add_tokens(new_tokens)
vocab_after = tokenizer.get_vocab_size()
print(f"β
Added {num_added} new tokens")
print(f"π Vocabulary size after: {vocab_after}")
# Test with domain-specific text
test_text = "The covid19 vaccine uses tensorflow for neural_network prediction"
output = tokenizer.encode(test_text)
print(f"\nπ Test: '{test_text}'")
print(f"π’ Tokens: {output.tokens}")
# Save fine-tuned tokenizer
save_path = "tokenizers/finetuned_tokenizer.json"
tokenizer.save(save_path)
print(f"\nπΎ Saved to: {save_path}")
# =============================================================================
# Main Function
# =============================================================================
Main β Run All Training ExamplesΒΆ
Execute every training example in sequence: BPE, WordPiece, Unigram, Code, file-based training, comparison, and fine-tuning. The trained tokenizers are saved to the ./tokenizers/ directory as JSON files that can be loaded later with Tokenizer.from_file().
def main():
"""Run all training examples"""
print("\n" + "=" * 70)
print("HUGGINGFACE TOKENIZERS - TRAINING EXAMPLES")
print("=" * 70)
print("\nThis script demonstrates:")
print(" 1. Training BPE tokenizer (GPT-2 style)")
print(" 2. Training WordPiece tokenizer (BERT style)")
print(" 3. Training Unigram tokenizer (multilingual)")
print(" 4. Training domain-specific tokenizer (code)")
print(" 5. Training from files")
print(" 6. Comparing different tokenizers")
print(" 7. Fine-tuning existing tokenizers")
try:
# Train different types
train_bpe_tokenizer()
train_wordpiece_tokenizer()
train_unigram_tokenizer()
train_code_tokenizer()
train_from_files()
# Compare and fine-tune
compare_tokenizers()
finetune_tokenizer()
print("\n" + "=" * 70)
print("β
ALL TRAINING EXAMPLES COMPLETED!")
print("=" * 70)
print(f"\nπ Trained tokenizers saved in: ./tokenizers/")
print("π Next steps:")
print(" - Load and use these tokenizers in your projects")
print(" - Experiment with different hyperparameters")
print(" - Train on your own domain-specific data")
print("\n")
except Exception as e:
print(f"\nβ Error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
main()