Run this notebook: Open in Colab Open in Kaggle

Setup¶

Import every component of the HuggingFace tokenizer pipeline: models (BPE, WordPiece, Unigram), normalizers, pre-tokenizers, post-processors, and decoders. This notebook dissects each stage individually so you can understand how they compose into a full tokenization pipeline.

# !pip install tokenizers

from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, processors, decoders, trainers
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.normalizers import NFD, NFKC, Lowercase, StripAccents, Replace, Strip, Sequence as NormSeq
from tokenizers.pre_tokenizers import Whitespace, WhitespaceSplit, ByteLevel, CharDelimiterSplit, Digits, Punctuation, Sequence as PreSeq
from tokenizers.processors import TemplateProcessing
from tokenizers.decoders import ByteLevel as ByteLevelDecoder, WordPiece as WordPieceDecoder

print("✅ Imports successful")

Part 1: Normalization¶

Normalization cleans and standardizes text before tokenization.

1.1 Unicode Normalization (NFD, NFC, NFKC, NFKD)¶

NFD (Canonical Decomposition): Breaks characters into components NFC (Canonical Composition): Combines components NFKC (Compatibility Composition): Normalizes to standard forms

# Test text with special characters
test_text = "Héllo Wörld! café résumé"

# NFD - Decompose
nfd = normalizers.NFD()
print(f"Original:  {test_text}")
print(f"NFD:       {nfd.normalize_str(test_text)}")
print(f"Bytes:     {[c for c in nfd.normalize_str(test_text)]}")

# NFKC - Compose and compatibility normalize
nfkc = normalizers.NFKC()
print(f"\nNFKC:      {nfkc.normalize_str(test_text)}")

1.2 Lowercase Normalization¶

What: Convert all characters to lowercase using normalizers.Lowercase().

Why: Lowercasing reduces vocabulary size by collapsing “Hello”, “hello”, and “HELLO” into a single token. BERT’s “uncased” models use this normalization, which improves generalization for tasks where case does not carry important meaning (e.g., topic classification). However, for tasks like named-entity recognition or code processing, preserving case is critical – which is why models like RoBERTa and GPT skip this step.

lowercase = normalizers.Lowercase()

examples = [
    "Hello World",
    "MACHINE LEARNING",
    "NaTuRaL LaNgUaGe"
]

print("Lowercase Normalization:")
print("=" * 60)
for text in examples:
    normalized = lowercase.normalize_str(text)
    print(f"{text:25s} → {normalized}")

1.3 Strip Accents¶

What: Remove diacritical marks (accents) from characters using normalizers.StripAccents(), so “cafe” and “cafe” map to the same form.

Why: Stripping accents further reduces vocabulary fragmentation for languages with accented characters (French, Spanish, German). Combined with NFD decomposition (which separates base characters from their combining marks), this normalizer collapses “e” into “e”. Note that this is destructive – it discards information that may matter for some languages – so use it deliberately.

strip_accents = normalizers.StripAccents()

examples = [
    "café résumé",
    "naïve señor",
    "Москва",  # Moscow in Russian
    "北京"      # Beijing in Chinese
]

print("Strip Accents:")
print("=" * 60)
for text in examples:
    normalized = strip_accents.normalize_str(text)
    print(f"{text:25s} → {normalized}")

1.4 Replace Normalizer (Pattern Replacement)¶

What: Use normalizers.Replace() to apply regex-based substitutions – for example, collapsing multiple spaces or replacing URLs with a [URL] placeholder.

Why: Pattern replacement is the most flexible normalizer. Common uses include: removing or replacing URLs, email addresses, and phone numbers before training (so the vocabulary does not waste capacity on them); normalizing whitespace; and masking personally identifiable information (PII). Replacements happen before tokenization, so they affect what the model sees.

Connection: In production NLP pipelines, PII masking and URL normalization are often the first preprocessing steps, both for privacy compliance and to improve model robustness.

# Replace multiple spaces with single space
replace_spaces = normalizers.Replace("  +", " ")

examples = [
    "Hello    world",
    "Too     many       spaces",
    "Normal text"
]

print("Replace Multiple Spaces:")
print("=" * 60)
for text in examples:
    normalized = replace_spaces.normalize_str(text)
    print(f"'{text:30s}' → '{normalized}'")

# Remove URLs
remove_urls = normalizers.Replace(r"https?://\S+", "[URL]")
text_with_url = "Check out https://huggingface.co for more info"
print(f"\nRemove URLs:")
print(f"Original:   {text_with_url}")
print(f"Normalized: {remove_urls.normalize_str(text_with_url)}")

1.5 Strip Normalizer (Remove Leading/Trailing Whitespace)¶

What: Remove leading and trailing whitespace (spaces, tabs, newlines) from the input text using normalizers.Strip().

Why: Stray whitespace is a common artifact of web scraping and data extraction. If left in place, it can produce unexpected tokens (a leading tab becomes a separate token) and misalign offset mappings. Stripping ensures the tokenizer sees clean input boundaries.

strip = normalizers.Strip()

examples = [
    "  leading spaces",
    "trailing spaces  ",
    "  both sides  ",
    "\ttabs and newlines\n"
]

print("Strip Whitespace:")
print("=" * 60)
for text in examples:
    normalized = strip.normalize_str(text)
    print(f"'{text}' → '{normalized}'")

1.6 Combining Normalizers (Sequence)¶

What: Chain multiple normalizers into a single pipeline using normalizers.Sequence(). The normalizers execute in order, left to right.

Why: Real tokenizers rarely use a single normalizer in isolation. BERT, for example, applies NFD decomposition, then lowercasing, then accent stripping – in that specific order (NFD must come before StripAccents so that combining marks are separated before removal). The Sequence combinator lets you compose arbitrary normalization pipelines declaratively.

Connection: The order of normalizers matters. Lowercasing before NFD decomposition can produce different results than the reverse, so always reason about the pipeline as a sequential transformation.

# BERT-style normalization
bert_normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.Lowercase(),
    normalizers.StripAccents(),
])

test_texts = [
    "HELLO WORLD!",
    "Café Résumé",
    "   Spaced   Out   "
]

print("BERT-style Normalization (NFD + Lowercase + StripAccents):")
print("=" * 60)
for text in test_texts:
    normalized = bert_normalizer.normalize_str(text)
    print(f"{text:30s} → {normalized}")

Part 2: Pre-tokenization¶

Pre-tokenization splits text into words or subword units before the model processes it.

2.1 Whitespace Pre-tokenizer¶

What: Split text on whitespace boundaries using pre_tokenizers.Whitespace(), which also separates punctuation from adjacent words.

Why: Pre-tokenization defines the initial word boundaries that the subword model (BPE, WordPiece, Unigram) will then further split. The Whitespace pre-tokenizer handles the common case where words are space-separated and punctuation should be its own token. Crucially, it returns (token, offset) tuples so that every pre-token can be traced back to its exact position in the original string.

whitespace = pre_tokenizers.Whitespace()

text = "Hello world! How are you?"
pre_tokenized = whitespace.pre_tokenize_str(text)

print(f"Text: {text}")
print(f"\nPre-tokenized (word, offset):")
for word, offset in pre_tokenized:
    print(f"  '{word}' at {offset}")

2.2 WhitespaceSplit vs Whitespace¶

Whitespace: Splits on whitespace and keeps punctuation with words WhitespaceSplit: Only splits on whitespace

text = "Hello, world! How are you?"

whitespace = pre_tokenizers.Whitespace()
whitespace_split = pre_tokenizers.WhitespaceSplit()

print(f"Text: {text}\n")

print("Whitespace (splits on spaces, keeps punctuation):")
for word, offset in whitespace.pre_tokenize_str(text):
    print(f"  '{word}'")

print("\nWhitespaceSplit (only splits on spaces):")
for word, offset in whitespace_split.pre_tokenize_str(text):
    print(f"  '{word}'")

2.3 ByteLevel Pre-tokenizer (GPT-2 style)¶

Converts all characters to bytes and splits on whitespace. Handles any Unicode character.

byte_level = pre_tokenizers.ByteLevel(add_prefix_space=True)

examples = [
    "Hello world!",
    "café résumé",
    "日本語テキスト"  # Japanese text
]

print("ByteLevel Pre-tokenization:")
print("=" * 60)
for text in examples:
    pre_tokenized = byte_level.pre_tokenize_str(text)
    print(f"\nText: {text}")
    print("Tokens:")
    for word, offset in pre_tokenized:
        print(f"  '{word}'")

2.4 Punctuation Pre-tokenizer¶

Isolates punctuation from words.

punctuation = pre_tokenizers.Punctuation(behavior="isolated")

examples = [
    "Hello, world!",
    "What's that?",
    "Cost: $99.99"
]

print("Punctuation Pre-tokenization (isolated):")
print("=" * 60)
for text in examples:
    pre_tokenized = punctuation.pre_tokenize_str(text)
    print(f"\nText: {text}")
    print("Tokens: ", end="")
    print([word for word, _ in pre_tokenized])

2.5 Digits Pre-tokenizer¶

What: Isolate individual digits from surrounding text using pre_tokenizers.Digits(individual_digits=True).

Why: Numbers behave differently from words in NLP. With individual_digits=True, the number “2024” becomes four separate tokens [“2”, “0”, “2”, “4”] rather than one. This is the approach LLaMA and many modern LLMs use because it keeps the vocabulary small (only 10 digit tokens needed) and gives the model a consistent way to handle numbers of any length. Without digit splitting, every distinct number would need its own vocabulary entry.

digits = pre_tokenizers.Digits(individual_digits=True)

examples = [
    "Room 101",
    "Price: $1234",
    "Year 2024"
]

print("Digits Pre-tokenization:")
print("=" * 60)
for text in examples:
    pre_tokenized = digits.pre_tokenize_str(text)
    print(f"\nText: {text}")
    print("Tokens: ", end="")
    print([word for word, _ in pre_tokenized])

2.6 Combining Pre-tokenizers (Sequence)¶

What: Chain WhitespaceSplit and Punctuation pre-tokenizers into a single pipeline using pre_tokenizers.Sequence().

Why: Just like normalizers, pre-tokenizers can be composed. A BERT-style pipeline first splits on whitespace (keeping punctuation attached) and then isolates punctuation into separate tokens. The sequence combinator lets you build this multi-step logic declaratively. Each pre-tokenizer in the chain receives the output of the previous one, progressively refining the token boundaries.

# BERT-style: Whitespace + Punctuation
bert_pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Punctuation(behavior="isolated")
])

text = "Hello, world! How are you?"
pre_tokenized = bert_pre_tokenizer.pre_tokenize_str(text)

print(f"Text: {text}\n")
print("BERT-style Pre-tokenization (Whitespace + Punctuation):")
for word, offset in pre_tokenized:
    print(f"  '{word}' at {offset}")

2.7 CharDelimiterSplit (Split on Custom Character)¶

What: Split text on any single character delimiter using pre_tokenizers.CharDelimiterSplit().

Why: Some data formats use non-standard delimiters – pipe-separated values, tab-delimited logs, or slash-separated paths. This pre-tokenizer lets you define a custom split character, which is useful for domain-specific tokenization where whitespace splitting would be incorrect (e.g., splitting file paths on / or CSV fields on |).

# Split on pipes
pipe_splitter = pre_tokenizers.CharDelimiterSplit('|')

text = "apple|banana|cherry"
pre_tokenized = pipe_splitter.pre_tokenize_str(text)

print(f"Text: {text}\n")
print("Split on '|':")
for word, offset in pre_tokenized:
    print(f"  '{word}'")

Part 3: Post-processing¶

Post-processing adds special tokens (like [CLS], [SEP]) after tokenization.

3.1 Template Processing (BERT-style)¶

What: Use TemplateProcessing to automatically wrap encoded sequences with special tokens like [CLS] and [SEP].

Why: BERT expects every input to start with [CLS] and end with [SEP]. For sentence pairs, the format is [CLS] A [SEP] B [SEP] with type ID 1 assigned to the second sentence’s tokens. Rather than manually inserting these tokens, TemplateProcessing handles it declaratively using a template string. The post-processor runs after the subword model, so it operates on the final token sequence.

How: The template syntax uses $A and $B as placeholders for the first and second input sequences. The :1 suffix (e.g., $B:1) assigns type ID 1 to those tokens, which the model uses to distinguish between the two segments.

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.processors import TemplateProcessing

# Create simple tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Add BERT-style post-processing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

# Train on simple data
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
    vocab_size=100,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train_from_iterator(["hello world", "how are you"] * 50, trainer)

# Test single sequence
output = tokenizer.encode("hello world")
print("Single sequence with [CLS] and [SEP]:")
print(f"Tokens: {output.tokens}")
print(f"IDs: {output.ids}")

# Test pair
output = tokenizer.encode("hello", "world")
print("\nSentence pair with [CLS] and [SEP]:")
print(f"Tokens: {output.tokens}")
print(f"Type IDs: {output.type_ids}")

Part 4: Decoders¶

Decoders convert token IDs back to readable text.

4.1 ByteLevel Decoder (GPT-2 Style)¶

What: Decode byte-level BPE tokens back into readable Unicode text using decoders.ByteLevel().

Why: The ByteLevel pre-tokenizer encodes every character as one or more bytes (e.g., the euro sign becomes three bytes). The decoder reverses this transformation, reassembling bytes into the original Unicode characters. Without the correct decoder, the output of tokenizer.decode() would be a garbled sequence of byte-representation characters like Ġ instead of spaces.

# Create tokenizer with ByteLevel
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# Train
trainer = trainers.BpeTrainer(vocab_size=200, special_tokens=["<PAD>"])
tokenizer.train_from_iterator(["hello world", "machine learning"] * 50, trainer)

# Encode and decode
text = "Hello, world!"
output = tokenizer.encode(text)
decoded = tokenizer.decode(output.ids)

print(f"Original: {text}")
print(f"Tokens:   {output.tokens}")
print(f"Decoded:  {decoded}")

4.2 WordPiece Decoder (BERT style)¶

Removes ## prefixes and joins subwords.

# Simulate WordPiece tokens
tokens = ["running", "##ly", "fast", "##er"]

# Manual demonstration
print("WordPiece tokens: ", tokens)
print("\nDecoding process:")
decoded_words = []
current_word = ""

for token in tokens:
    if token.startswith("##"):
        current_word += token[2:]  # Remove ##
    else:
        if current_word:
            decoded_words.append(current_word)
        current_word = token

if current_word:
    decoded_words.append(current_word)

print(f"Decoded: {' '.join(decoded_words)}")

Part 5: Complete Pipeline Examples¶

5.1 BERT-style Tokenizer (Complete Pipeline)¶

What: Assemble a full BERT-style tokenizer from scratch by composing all four pipeline stages: normalization (NFD + lowercase + strip accents), pre-tokenization (whitespace + punctuation), post-processing ([CLS]/[SEP] template), and decoding (WordPiece with ## prefix).

Why: Seeing the complete pipeline in one place shows how the individual components you explored above work together. Each stage transforms the data for the next: the normalizer standardizes the text, the pre-tokenizer splits it into initial words, the WordPiece model breaks words into subwords, the post-processor adds special tokens, and the decoder reverses the process for human-readable output. This is exactly the architecture behind bert-base-uncased.

# Create BERT-style tokenizer from scratch
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# 1. Normalization
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.Lowercase(),
    normalizers.StripAccents()
])

# 2. Pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Punctuation(behavior="isolated")
])

# 3. Post-processing
tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)]
)

# 4. Decoder
tokenizer.decoder = decoders.WordPiece(prefix="##")

# Train
trainer = trainers.WordPieceTrainer(
    vocab_size=500,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

training_data = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is awesome!",
    "Natural language processing is fascinating."
] * 100

tokenizer.train_from_iterator(training_data, trainer)

# Test
text = "HELLO, World! Machine Learning."
output = tokenizer.encode(text)

print("BERT-style Complete Pipeline:")
print("=" * 60)
print(f"Original:   {text}")
print(f"Normalized: {tokenizer.normalizer.normalize_str(text)}")
print(f"Tokens:     {output.tokens}")
print(f"Decoded:    {tokenizer.decode(output.ids)}")

5.2 GPT-2 Style Tokenizer (Complete Pipeline)¶

What: Assemble a GPT-2 style tokenizer: no normalization (case-sensitive), ByteLevel pre-tokenization, BPE model, and ByteLevel decoder.

Why: GPT-2’s design philosophy differs fundamentally from BERT’s. It preserves case and whitespace exactly as-is (no normalization), uses byte-level encoding to guarantee complete Unicode coverage with zero unknown tokens, and does not add special tokens like [CLS]. Comparing this pipeline side-by-side with the BERT pipeline above highlights how architectural decisions in tokenization directly reflect the model’s design goals – BERT optimizes for understanding (classification, NER), while GPT-2 optimizes for generation (the output must be valid, readable text).

# Create GPT-2 style tokenizer
tokenizer = Tokenizer(BPE())

# 1. No normalization (GPT-2 is case-sensitive)
tokenizer.normalizer = None

# 2. ByteLevel pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

# 3. ByteLevel decoder
tokenizer.decoder = decoders.ByteLevel()

# Train
trainer = trainers.BpeTrainer(
    vocab_size=500,
    special_tokens=["<|endoftext|>"],
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)

tokenizer.train_from_iterator(training_data, trainer)

# Test
text = "Hello, World! 🌍"
output = tokenizer.encode(text)

print("GPT-2 style Complete Pipeline:")
print("=" * 60)
print(f"Original: {text}")
print(f"Tokens:   {output.tokens}")
print(f"Decoded:  {tokenizer.decode(output.ids)}")

Summary¶

Pipeline Components¶

1. Normalization¶

NFD/NFKC: Unicode normalization
Lowercase: Convert to lowercase
StripAccents: Remove accents
Replace: Pattern replacement
Strip: Remove whitespace
Sequence: Chain normalizers

2. Pre-tokenization¶

Whitespace: Split on spaces (keeps punctuation)
WhitespaceSplit: Only split on spaces
ByteLevel: Byte-level splitting (GPT-2)
Punctuation: Isolate punctuation
Digits: Isolate digits
CharDelimiterSplit: Custom delimiter
Sequence: Chain pre-tokenizers

3. Post-processing¶

TemplateProcessing: Add special tokens ([CLS], [SEP])
Configure for single or paired sequences
Set type IDs for sentence pairs

4. Decoding¶

ByteLevel: Decode byte-level tokens
WordPiece: Join subwords (remove ##)
BPE: Merge BPE tokens

Common Patterns¶

BERT: NFD + Lowercase + StripAccents → WhitespaceSplit + Punctuation → WordPiece → [CLS]/[SEP]

GPT-2: No normalization → ByteLevel → BPE → ByteLevel decoder

RoBERTa: Similar to GPT-2 but with different special tokens

Next Steps¶

Experiment with different combinations
Build custom pipelines for your domain
Compare performance of different normalizers
Understand trade-offs (speed vs accuracy)