Run this notebook: Open in Colab Open in Kaggle

01 Tokenizers Quickstart¶

HuggingFace Tokenizers - Quick Start Examples¶

This script demonstrates the basics of the tokenizers library. Run each example to understand core concepts.

Installation: pip install tokenizers

Time: 15 minutes

Setup¶

Import the core modules we will use throughout this notebook. The tokenizers library provides a fast, flexible Python API backed by a Rust implementation, making it orders of magnitude faster than pure-Python alternatives. We also use standard-library utilities for subprocess calls and version introspection.

Install Packages¶

Before working with tokenizers, we need to install the tokenizers library (HuggingFace’s Rust-backed tokenizer engine) and tiktoken (OpenAI’s fast BPE implementation). The cell below also prints your current Python environment so you can verify versions and troubleshoot any conflicts.

import sys
import subprocess

# Check Python location and version
print("Python Executable:")
print(f"  Location: {sys.executable}")
print(f"  Version: {sys.version}")
print(f"  Version Info: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")

# List installed packages
print("\n" + "=" * 70)
print("Installed Packages:")
print("=" * 70)

result = subprocess.run(
    [sys.executable, "-m", "pip", "list"],
    capture_output=True,
    text=True
)

print(result.stdout)

import sys
import subprocess

# Install required libraries
print("Installing tokenizers and tiktoken...")
print("=" * 70)

# Install tokenizers
print("\n📦 Installing tokenizers...")
result1 = subprocess.run(
    [sys.executable, "-m", "pip", "install", "tokenizers"],
    capture_output=True,
    text=True
)
print(result1.stdout)
if result1.returncode != 0:
    print("Error:", result1.stderr)

# Install tiktoken
print("\n📦 Installing tiktoken...")
result2 = subprocess.run(
    [sys.executable, "-m", "pip", "install", "tiktoken"],
    capture_output=True,
    text=True
)
print(result2.stdout)
if result2.returncode != 0:
    print("Error:", result2.stderr)

print("\n✅ Installation complete!")
print("=" * 70)

# =============================================================================
# Example 1: Load a Pretrained Tokenizer (Easiest Start)
# =============================================================================

Example 1: Loading a Pretrained Tokenizer¶

What: Load BERT’s pretrained tokenizer from the HuggingFace Hub and use it to encode a sentence into token IDs.

Why: Most NLP workflows start by loading an existing tokenizer that was trained alongside a model. Using the same tokenizer the model was trained with is critical – a mismatched vocabulary will produce garbage embeddings. Tokenizer.from_pretrained() downloads the vocabulary, merge rules, and configuration in one call, so the encoding is guaranteed to be consistent with the model.

How: The tokenizer splits the input text into subword tokens using the WordPiece algorithm, maps each token to an integer ID from a fixed vocabulary of ~30,000 entries, and returns an Encoding object containing the token strings, IDs, and offset mappings back to the original text.

# =============================================================================
# Example 1: Load a Pretrained Tokenizer (Easiest Start)
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 1: Load Pretrained Tokenizer")
print("=" * 70)

from tokenizers import Tokenizer

# Load BERT tokenizer from Hugging Face Hub
print("\n📥 Loading bert-base-uncased tokenizer...")
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode a simple sentence
text = "Hello, world! How are you?"
output = tokenizer.encode(text)

print(f"\n📝 Input text: {text}")
print(f"🔢 Tokens: {output.tokens}")
print(f"🆔 Token IDs: {output.ids}")
print(f"📊 Number of tokens: {len(output.tokens)}")

# Decode back to text
decoded = tokenizer.decode(output.ids)
print(f"🔄 Decoded text: {decoded}")

Example 2: Build a BPE Tokenizer From Scratch¶

What: Initialize a Byte Pair Encoding (BPE) tokenizer, configure its pre-tokenizer and trainer, then train it on a small corpus.

Why: Understanding how to build a tokenizer from scratch demystifies the process that GPT-2, GPT-3, and LLaMA use under the hood. BPE starts with individual characters and iteratively merges the most frequent adjacent pair, producing a vocabulary that balances between whole words (efficient) and single characters (flexible). The merge count is controlled by vocab_size, and min_frequency filters out rare pairs.

How: BPE greedily selects the pair \((a, b)\) with the highest corpus frequency at each step, replaces every occurrence of \(a\,b\) with the new token \(ab\), and repeats until the desired vocabulary size is reached. The Whitespace pre-tokenizer ensures words are split on spaces before BPE sees them, which prevents merges across word boundaries.

# =============================================================================
# Example 2: Build a Simple Tokenizer from Scratch
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 2: Build Tokenizer from Scratch")
print("=" * 70)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Step 1: Initialize with BPE model
print("\n📦 Creating BPE tokenizer...")
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Step 2: Configure pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()

# Step 3: Create trainer
trainer = BpeTrainer(
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    vocab_size=1000,
    min_frequency=2
)

# Step 4: Train on sample data
print("📚 Training on sample sentences...")
training_data = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is awesome!",
    "Tokenizers make NLP easier and faster.",
    "Natural language processing helps computers understand text.",
    "The tokenizer library is very fast and easy to use.",
] * 100  # Repeat for better training

tokenizer.train_from_iterator(training_data, trainer=trainer)

# Step 5: Test the tokenizer
test_text = "The tokenizer is learning!"
output = tokenizer.encode(test_text)

print(f"\n📝 Test text: {test_text}")
print(f"🔢 Tokens: {output.tokens}")
print(f"🆔 Token IDs: {output.ids}")
print(f"📊 Vocabulary size: {tokenizer.get_vocab_size()}")

Example 3: The Encoding Object¶

What: Inspect every property of the Encoding object returned by tokenizer.encode() – tokens, IDs, offsets, attention masks, type IDs, and word IDs.

Why: When you feed tokenized text into a transformer, the model consumes not just the token IDs but also an attention mask (which positions are real tokens vs. padding) and type IDs (which sentence each token belongs to in pair-classification tasks). The offsets let you map each token back to the exact character span in the original string, which is essential for tasks like named-entity recognition where you need to highlight source text.

Connection: Understanding alignment tracking is the foundation of explainability in NLP – you need offsets to draw bounding boxes around entities, highlight relevant passages in search results, or build interactive tokenizer visualizations.

# =============================================================================
# Example 3: Understanding the Encoding Object
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 3: Understanding the Encoding Object")
print("=" * 70)

from tokenizers import Tokenizer

# Load a tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode with special tokens
text = "Hello, world!"
output = tokenizer.encode(text)

print(f"\n📝 Input: '{text}'")
print(f"\n🔢 Tokens: {output.tokens}")
print(f"🆔 IDs: {output.ids}")
print(f"📏 Offsets: {output.offsets}")
print(f"👁️ Attention mask: {output.attention_mask}")
print(f"🏷️ Type IDs: {output.type_ids}")
print(f"🔖 Word IDs: {output.word_ids}")

# Demonstrate alignment tracking
print("\n" + "-" * 70)
print("ALIGNMENT TRACKING:")
print("-" * 70)

for i, token in enumerate(output.tokens):
    start, end = output.offsets[i]
    original = text[start:end]
    print(f"Token '{token}' (ID: {output.ids[i]}) -> Original: '{original}'")

Example 4: Batch Encoding¶

What: Encode multiple sequences at once using encode_batch().

Why: In production, you rarely encode one sentence at a time. Batch encoding leverages the Rust backend to tokenize many sequences in parallel, often achieving 5-20x speedups over looping through individual calls. This matches how deep-learning dataloaders work – they collate batches of tokenized inputs before sending them to the GPU.

Connection: Every training loop and inference pipeline in frameworks like HuggingFace Transformers relies on batch encoding under the hood. Getting comfortable with it here prepares you for efficient data preprocessing at scale.

# =============================================================================
# Example 4: Batch Encoding
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 4: Batch Encoding")
print("=" * 70)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode multiple sequences
sequences = [
    "The quick brown fox",
    "jumps over the lazy dog",
    "Machine learning is fun!"
]

print("📦 Encoding batch of sequences...")
outputs = tokenizer.encode_batch(sequences)

print(f"\n📊 Encoded {len(outputs)} sequences:")
for i, (seq, output) in enumerate(zip(sequences, outputs)):
    print(f"\n  [{i+1}] '{seq}'")
    print(f"      Tokens: {output.tokens}")
    print(f"      Length: {len(output.tokens)} tokens")

Example 5: Padding and Truncation¶

What: Enable padding (so all sequences in a batch share the same length) and truncation (so no sequence exceeds a maximum length).

Why: Transformers require fixed-size tensor inputs. If sentences in a batch have different lengths, shorter ones must be padded with a special [PAD] token and the attention mask set to 0 at those positions so the model ignores them. Conversely, inputs longer than the model’s context window (e.g., 512 tokens for BERT) must be truncated to fit. Together, padding and truncation convert ragged text data into the uniform rectangular tensors that GPUs need.

How: enable_padding(pad_id, pad_token) appends [PAD] tokens to the right of shorter sequences. enable_truncation(max_length) clips from the right. The attention mask is automatically updated: 1 for real tokens, 0 for padding.

# =============================================================================
# Example 5: Padding and Truncation
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 5: Padding and Truncation")
print("=" * 70)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Enable padding
print("\n📏 Enabling padding...")
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")

sequences = [
    "Short",
    "A bit longer sentence",
    "This is an even longer sentence with more words"
]

outputs = tokenizer.encode_batch(sequences)

print("\n📊 Padded sequences (all same length):")
for i, output in enumerate(outputs):
    print(f"\n  [{i+1}] Tokens: {output.tokens}")
    print(f"      Length: {len(output.tokens)}")
    print(f"      Attention mask: {output.attention_mask}")

# Enable truncation
print("\n" + "-" * 70)
print("✂️ Enabling truncation (max_length=10)...")
tokenizer.enable_truncation(max_length=10)

long_text = "This is a very long sentence that will definitely be truncated because it exceeds the maximum length"
output = tokenizer.encode(long_text)

print(f"\n📝 Original: {long_text}")
print(f"🔢 Truncated tokens: {output.tokens}")
print(f"📊 Length: {len(output.tokens)} (max: 10)")

Example 6: Encode and Decode (Round-Trip)¶

What: Encode text into token IDs, then decode the IDs back to text to verify the round-trip fidelity.

Why: A good tokenizer should be lossless – you should be able to recover the original text from the token IDs (modulo normalization like lowercasing). Testing this round-trip is a quick sanity check when training or loading a new tokenizer. If the decoded text diverges from the input, it usually means the vocabulary is too small (producing too many [UNK] tokens) or the decoder is misconfigured.

Connection: Decode fidelity matters in generative models: when GPT produces token IDs, the decoder converts them back to the readable text you see in ChatGPT.

# =============================================================================
# Example 6: Encode and Decode
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 6: Encode and Decode")
print("=" * 70)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

original_texts = [
    "Hello, world!",
    "Machine learning is awesome.",
    "Tokenizers are fast!"
]

print("\n🔄 Round-trip encoding/decoding:")
for text in original_texts:
    # Encode
    output = tokenizer.encode(text)
    
    # Decode
    decoded = tokenizer.decode(output.ids)
    
    print(f"\n  Original:  {text}")
    print(f"  Tokens:    {output.tokens}")
    print(f"  Decoded:   {decoded}")
    print(f"  Match:     {'✓' if text.lower() == decoded else '✗'}")

Example 7: Encoding Sentence Pairs¶

What: Encode two sentences together as a pair, producing combined tokens with type IDs that distinguish which sentence each token belongs to.

Why: Many NLP tasks require processing two pieces of text jointly. Natural Language Inference (NLI) compares a premise and hypothesis; question answering pairs a question with a context passage; and semantic similarity compares two sentences. BERT-style models use a [CLS] Sentence_A [SEP] Sentence_B [SEP] format, with type ID 0 for Sentence A tokens and type ID 1 for Sentence B. The model’s self-attention can then learn relationships between the two sequences.

Connection: Sentence-pair encoding is the backbone of tasks like textual entailment, duplicate question detection, and retrieval re-ranking – all common in production search and recommendation systems.

# =============================================================================
# Example 7: Working with Sentence Pairs
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 7: Sentence Pairs")
print("=" * 70)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode a pair
sentence_a = "The cat sat on the mat"
sentence_b = "A feline rested on the rug"

output = tokenizer.encode(sentence_a, sentence_b)

print(f"\n📝 Sentence A: {sentence_a}")
print(f"📝 Sentence B: {sentence_b}")
print(f"\n🔢 Combined tokens: {output.tokens}")
print(f"🏷️ Type IDs: {output.type_ids}")
print(f"   (0 = sentence A, 1 = sentence B)")

# Show which tokens belong to which sentence
print("\n" + "-" * 70)
print("TOKEN BREAKDOWN:")
print("-" * 70)
for i, (token, type_id) in enumerate(zip(output.tokens, output.type_ids)):
    sentence = "Special/A" if type_id == 0 else "B"
    print(f"  {token:20s} -> Sentence {sentence}")

Example 8: Vocabulary Inspection¶

What: Retrieve the full vocabulary dictionary and look up individual tokens by name or by ID.

Why: Inspecting the vocabulary helps you understand what the tokenizer has learned. You can check whether domain-specific terms exist as single tokens (efficient) or get split into subwords (less efficient, potentially lossy). For BERT, the vocabulary contains ~30,522 tokens including whole words, subword prefixes like ##ing, and special tokens like [MASK]. Knowing a token’s ID is also useful for manual debugging – for example, verifying that your custom special tokens were assigned the correct IDs after training.

Connection: Vocabulary analysis is a practical first step when deciding whether to use an off-the-shelf tokenizer or train a domain-specific one for fields like medicine, law, or code.

# =============================================================================
# Example 8: Vocabulary Inspection
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 8: Vocabulary Inspection")
print("=" * 70)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Get vocabulary
vocab = tokenizer.get_vocab()
vocab_size = tokenizer.get_vocab_size()

print(f"\n📊 Vocabulary size: {vocab_size:,} tokens")

# Show first 20 tokens
print("\n🔤 First 20 tokens in vocabulary:")
for token, token_id in list(vocab.items())[:20]:
    print(f"  ID {token_id:5d}: '{token}'")

# Test token lookup
print("\n" + "-" * 70)
print("TOKEN LOOKUP:")
print("-" * 70)

test_tokens = ["hello", "world", "[CLS]", "[SEP]", "[MASK]", "##ing"]
for token in test_tokens:
    token_id = tokenizer.token_to_id(token)
    if token_id is not None:
        print(f"  '{token}' -> ID: {token_id}")
    else:
        print(f"  '{token}' -> Not in vocabulary")

Example 9: Special Tokens¶

What: Define and register custom special tokens – reserved vocabulary entries like [CLS], [SEP], [MASK], and application-specific markers like [USER] and [BOT].

Why: Special tokens serve as control signals for the model. [CLS] marks the start of a classification head’s input, [SEP] separates sentence pairs, [MASK] indicates positions for masked-language-model pretraining, and [PAD] fills empty positions. By convention, special tokens are assigned the lowest IDs (0, 1, 2, …) and are never split by BPE merges. Adding custom tokens like [USER]/[BOT] lets you encode conversational structure directly in the token stream, which is how chat-tuned LLMs distinguish between user messages and assistant responses.

Connection: Every modern chat model (ChatGPT, Claude, LLaMA-Chat) relies on special tokens to delineate system prompts, user turns, and assistant turns.

# =============================================================================
# Example 9: Special Tokens
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 9: Special Tokens")
print("=" * 70)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Create tokenizer with custom special tokens
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

trainer = BpeTrainer(
    special_tokens=[
        "[UNK]",   # Unknown token
        "[CLS]",   # Start of sequence
        "[SEP]",   # Separator
        "[PAD]",   # Padding
        "[MASK]",  # Masked token
        "[USER]",  # Custom: user message
        "[BOT]"    # Custom: bot response
    ],
    vocab_size=1000
)

# Train on simple data
training_data = ["Hello world", "How are you"] * 100
tokenizer.train_from_iterator(training_data, trainer=trainer)

print("\n🔖 Special tokens and their IDs:")
special_token_names = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "[USER]", "[BOT]"]

for token in special_token_names:
    token_id = tokenizer.token_to_id(token)
    print(f"  {token:10s} -> ID: {token_id}")

# Test encoding with special tokens
print("\n" + "-" * 70)
text = "Hello world"
output = tokenizer.encode(text)
print(f"\n📝 Text: '{text}'")
print(f"🔢 Tokens: {output.tokens}")

Example 10: Performance – Single vs. Batch Encoding¶

What: Time the difference between encoding 1,000 sequences one at a time versus encoding them all in a single encode_batch() call.

Why: The HuggingFace tokenizers library is written in Rust with Python bindings. Each Python-to-Rust call has fixed overhead, so calling encode() in a loop pays that overhead 1,000 times. encode_batch() crosses the FFI boundary once and parallelizes the work across CPU cores, yielding dramatic speedups. In data-processing pipelines where tokenization is the bottleneck (e.g., preparing millions of training examples), batch encoding can cut preprocessing time from hours to minutes.

Connection: The same principle applies across ML tooling – batching operations (whether tokenization, embedding generation, or GPU inference) almost always outperforms sequential processing due to reduced overhead and better hardware utilization.

# =============================================================================
# Example 10: Performance Comparison
# =============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 10: Performance Comparison")
print("=" * 70)

import time
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Prepare test data
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming the world.",
    "Natural language processing enables computers to understand text.",
    "Tokenizers are essential for NLP applications."
] * 250  # 1000 sequences

print(f"\n⏱️ Encoding {len(texts)} sequences...")

# Method 1: One by one
start = time.time()
for text in texts:
    _ = tokenizer.encode(text)
time_single = time.time() - start

# Method 2: Batch encoding
start = time.time()
_ = tokenizer.encode_batch(texts)
time_batch = time.time() - start

print(f"\n📊 Results:")
print(f"  Single encoding: {time_single:.3f}s")
print(f"  Batch encoding:  {time_batch:.3f}s")
print(f"  Speedup:         {time_single/time_batch:.1f}x faster! 🚀")