Run this notebook: Open in Colab Open in Kaggle

Sentencepiece Example¶

SentencePiece Tokenization Examples

SentencePiece is a language-agnostic tokenizer that doesn’t require pre-tokenization. Used by LLaMA, T5, ALBERT, XLM-R, and many production LLMs.

Install: pip install sentencepiece

Setup¶

Import the sentencepiece module, Google’s language-agnostic tokenizer library written in C++. Unlike HuggingFace tokenizers, SentencePiece treats the input as a raw Unicode stream with no language-specific pre-tokenization – it learns word boundaries directly from the data, making it ideal for multilingual models.

import sentencepiece as spm
import os

Example 1: Train a SentencePiece BPE Model¶

What: Write sample training data to a file and train a SentencePiece BPE tokenizer using SentencePieceTrainer.train().

Why: SentencePiece BPE is the algorithm behind many production LLMs (LLaMA, Mistral, Falcon). Unlike HuggingFace’s BPE, SentencePiece operates directly on raw Unicode – it does not require a separate pre-tokenization step. The character_coverage=1.0 parameter ensures every character in the training data gets represented, and the special token IDs (pad_id, unk_id, bos_id, eos_id) are explicitly assigned so they match the model’s expectations.

Connection: When you download a LLaMA model, the .model file in the repository is a SentencePiece model trained with settings very similar to these.

def example_1_train_bpe():
    """Train SentencePiece BPE tokenizer"""
    print("\n" + "="*60)
    print("Example 1: Train SentencePiece BPE")
    print("="*60)
    
    # Create sample training data
    os.makedirs("data", exist_ok=True)
    with open("data/sample.txt", "w", encoding="utf-8") as f:
        f.write("""
        Machine learning is the study of computer algorithms.
        Deep learning uses neural networks with many layers.
        Natural language processing helps computers understand human language.
        Tokenization is the process of breaking text into tokens.
        SentencePiece is a language-agnostic tokenizer.
        """ * 100)
    
    # Train BPE model
    print("\nTraining SentencePiece BPE model...")
    spm.SentencePieceTrainer.train(
        input='data/sample.txt',
        model_prefix='models/sentencepiece_bpe',
        vocab_size=1000,
        model_type='bpe',
        character_coverage=1.0,  # Cover all characters
        pad_id=0,
        unk_id=1,
        bos_id=2,
        eos_id=3,
    )
    
    # Load and test
    sp = spm.SentencePieceProcessor()
    sp.load('models/sentencepiece_bpe.model')
    
    test_text = "Machine learning tokenization"
    pieces = sp.encode_as_pieces(test_text)
    ids = sp.encode_as_ids(test_text)
    
    print(f"\nTest: '{test_text}'")
    print(f"Pieces: {pieces}")
    print(f"IDs: {ids}")
    print(f"Decoded: {sp.decode_pieces(pieces)}")

Example 2: Train a SentencePiece Unigram Model¶

What: Train a Unigram language-model tokenizer with SentencePiece, then explore its unique ability to provide n-best segmentations.

Why: The Unigram algorithm assigns a probability to every possible segmentation of a string, so it can rank multiple alternatives. nbest_encode_as_pieces(text, nbest_size=3) returns the top-3 most likely segmentations, which is useful for subword regularization – a training technique where you randomly sample different segmentations of the same text to make the model more robust to tokenization noise.

How: The Unigram model maximizes \(P(\text{corpus}) = \prod_s \sum_{\mathbf{x} \in S(s)} P(\mathbf{x})\) where \(S(s)\) is the set of all valid segmentations of sentence \(s\). In practice the Viterbi algorithm finds the single best segmentation, while n-best decoding returns the top \(k\) alternatives.

Connection: T5 and ALBERT use SentencePiece Unigram. The n-best capability is what makes subword regularization possible – a technique shown to improve translation quality by 1-2 BLEU points.

def example_2_train_unigram():
    """Train SentencePiece Unigram tokenizer"""
    print("\n" + "="*60)
    print("Example 2: Train SentencePiece Unigram")
    print("="*60)
    
    # Train Unigram model (better for multilingual)
    print("\nTraining SentencePiece Unigram model...")
    spm.SentencePieceTrainer.train(
        input='data/sample.txt',
        model_prefix='models/sentencepiece_unigram',
        vocab_size=1000,
        model_type='unigram',  # Probabilistic model
        character_coverage=1.0,
    )
    
    # Load and test
    sp = spm.SentencePieceProcessor()
    sp.load('models/sentencepiece_unigram.model')
    
    test_text = "Natural language processing"
    
    # Get multiple segmentation options
    pieces = sp.encode_as_pieces(test_text)
    ids = sp.encode_as_ids(test_text)
    
    print(f"\nTest: '{test_text}'")
    print(f"Pieces: {pieces}")
    print(f"IDs: {ids}")
    
    # NBest segmentation (Unigram can provide alternatives)
    nbest = sp.nbest_encode_as_pieces(test_text, nbest_size=3)
    print(f"\nTop 3 segmentations:")
    for i, segmentation in enumerate(nbest, 1):
        print(f"  {i}. {segmentation}")

Example 3: Multilingual SentencePiece¶

What: Train a single SentencePiece model on text from seven languages (English, Chinese, Japanese, French, Spanish, Russian, Arabic) and test how it handles each.

Why: SentencePiece’s language-agnostic design makes it the go-to choice for multilingual models. Because it works on raw Unicode bytes/characters without assuming spaces separate words, it handles logographic scripts (Chinese, Japanese) just as naturally as alphabetic ones. The character_coverage=0.9995 setting covers 99.95% of characters in the training data – slightly below 100% to avoid allocating vocabulary slots to extremely rare characters.

Connection: Models like XLM-R, mBART, and multilingual T5 all use SentencePiece trained on 100+ languages. The tokenizer’s ability to share subword units across languages (e.g., Latin-script prefixes) enables effective cross-lingual transfer learning.

def example_3_multilingual():
    """Train multilingual SentencePiece"""
    print("\n" + "="*60)
    print("Example 3: Multilingual SentencePiece")
    print("="*60)
    
    # Create multilingual data
    with open("data/multilingual.txt", "w", encoding="utf-8") as f:
        f.write("""
        Hello, how are you?
        你好吗？
        こんにちは
        Bonjour, comment allez-vous?
        Hola, ¿cómo estás?
        Привет, как дела?
        مرحبا، كيف حالك؟
        """ * 50)
    
    # Train with language-agnostic settings
    print("\nTraining multilingual SentencePiece...")
    spm.SentencePieceTrainer.train(
        input='data/multilingual.txt',
        model_prefix='models/sentencepiece_multilingual',
        vocab_size=2000,
        model_type='unigram',
        character_coverage=0.9995,  # Cover 99.95% of characters
        normalization_rule_name='nmt_nfkc',  # Normalize text
    )
    
    # Load and test on different languages
    sp = spm.SentencePieceProcessor()
    sp.load('models/sentencepiece_multilingual.model')
    
    test_texts = {
        'English': 'Hello, how are you?',
        'Chinese': '你好吗？',
        'Japanese': 'こんにちは',
        'French': 'Bonjour!',
        'Spanish': '¿Cómo estás?',
        'Russian': 'Привет!',
        'Arabic': 'مرحبا',
    }
    
    print("\nMultilingual tokenization:")
    for lang, text in test_texts.items():
        pieces = sp.encode_as_pieces(text)
        print(f"  {lang:10s}: {text:20s} → {len(pieces):2d} tokens → {pieces}")

Example 4: LLaMA-Style Tokenizer¶

What: Train a SentencePiece BPE tokenizer with the exact settings used by Meta’s LLaMA models: byte_fallback=True, split_digits=True, normalization_rule_name='identity', and a 32K vocabulary.

Why: LLaMA’s tokenizer design embodies several deliberate choices. Byte fallback means any character not in the vocabulary is decomposed into raw UTF-8 bytes, ensuring zero <unk> tokens – the model can encode literally anything. Digit splitting breaks numbers into individual digits for consistent handling. Identity normalization (no normalization at all) preserves the input exactly, which is critical for code generation and case-sensitive tasks.

Connection: These settings have become the de facto standard for open-source LLMs. Mistral, Qwen, and DeepSeek all use similar configurations. Understanding them prepares you to evaluate and customize tokenizers for your own models.

def example_4_llama_style():
    """Train LLaMA-style SentencePiece tokenizer"""
    print("\n" + "="*60)
    print("Example 4: LLaMA-Style Tokenizer")
    print("="*60)
    
    # LLaMA uses BPE with specific settings
    print("\nTraining LLaMA-style tokenizer...")
    spm.SentencePieceTrainer.train(
        input='data/multilingual.txt',
        model_prefix='models/llama_style',
        vocab_size=32000,  # LLaMA uses 32K
        model_type='bpe',
        character_coverage=0.9995,
        byte_fallback=True,  # Handle unknown characters
        split_digits=True,  # Split numbers digit-by-digit
        normalization_rule_name='identity',  # No normalization
        num_threads=16,
    )
    
    sp = spm.SentencePieceProcessor()
    sp.load('models/llama_style.model')
    
    print(f"\nVocabulary size: {sp.vocab_size()}")
    print(f"BOS ID: {sp.bos_id()}")
    print(f"EOS ID: {sp.eos_id()}")
    print(f"PAD ID: {sp.pad_id()}")
    print(f"UNK ID: {sp.unk_id()}")
    
    # Test on code (LLaMA handles code well)
    code = "def hello(): print('Hello, world!')"
    pieces = sp.encode_as_pieces(code)
    print(f"\nCode: {code}")
    print(f"Tokens: {pieces}")

Example 5: Compare SentencePiece with HuggingFace Tokenizers¶

What: Tokenize the same text with both SentencePiece and a HuggingFace GPT-2 tokenizer, then compare the token counts and segmentation boundaries.

Why: SentencePiece and HuggingFace tokenizers can both implement BPE, but they differ in pre-tokenization and byte handling. SentencePiece treats text as a raw character stream (with ▁ marking word boundaries), while GPT-2’s HuggingFace tokenizer uses byte-level pre-tokenization (with Ġ for spaces). These differences mean the same BPE algorithm produces different segmentations depending on the framework. Comparing them side-by-side builds intuition for why two BPE tokenizers can disagree on token boundaries.

Connection: When migrating a model from one framework to another (e.g., converting LLaMA weights for use with HuggingFace), ensuring tokenizer compatibility is one of the trickiest steps.

def example_5_compare_with_huggingface():
    """Compare SentencePiece with HuggingFace Tokenizers"""
    print("\n" + "="*60)
    print("Example 5: SentencePiece vs HuggingFace Tokenizers")
    print("="*60)
    
    # Load SentencePiece
    sp = spm.SentencePieceProcessor()
    sp.load('models/sentencepiece_bpe.model')
    
    # Load HuggingFace tokenizer (if available)
    try:
        from transformers import GPT2Tokenizer
        hf_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        
        test_text = "Machine learning is fascinating!"
        
        # SentencePiece
        sp_pieces = sp.encode_as_pieces(test_text)
        sp_ids = sp.encode_as_ids(test_text)
        
        # HuggingFace
        hf_tokens = hf_tokenizer.tokenize(test_text)
        hf_ids = hf_tokenizer.encode(test_text)
        
        print(f"\nTest: '{test_text}'")
        print(f"\nSentencePiece:")
        print(f"  Tokens: {sp_pieces}")
        print(f"  Count: {len(sp_ids)}")
        
        print(f"\nHuggingFace (GPT-2):")
        print(f"  Tokens: {hf_tokens}")
        print(f"  Count: {len(hf_ids)}")
        
        print("\nKey Differences:")
        print("  - SentencePiece: Language-agnostic, no pre-tokenization")
        print("  - HuggingFace: Faster (Rust), more integrations")
        
    except ImportError:
        print("Install transformers to compare: pip install transformers")

Example 6: Advanced SentencePiece Features¶

What: Explore advanced capabilities: subword sampling (alpha-based regularization), vocabulary inspection, score retrieval, and multiple decode methods.

Why: Subword sampling (alpha:0.1) randomly perturbs the segmentation during encoding, which acts as a form of data augmentation. When used during model training, it forces the model to be robust to different tokenizations of the same text, improving generalization. Token scores represent the log-probability of each piece under the Unigram model, which can be used for debugging (low-score pieces are rare and may indicate vocabulary gaps).

Connection: Subword regularization was introduced in the SentencePiece paper (Kudo, 2018) and has since been adopted in machine translation, speech recognition, and multilingual NLP to improve robustness.

def example_6_advanced_features():
    """Advanced SentencePiece features"""
    print("\n" + "="*60)
    print("Example 6: Advanced SentencePiece Features")
    print("="*60)
    
    sp = spm.SentencePieceProcessor()
    sp.load('models/sentencepiece_unigram.model')
    
    test_text = "Natural language processing"
    
    # 1. Regular encoding
    print("\n1. Regular Encoding:")
    pieces = sp.encode_as_pieces(test_text)
    print(f"   {pieces}")
    
    # 2. With sampling (for data augmentation)
    print("\n2. Sampling (adds randomness for augmentation):")
    sp.set_encode_extra_options('alpha:0.1')  # Enable sampling
    for i in range(3):
        sampled = sp.encode_as_pieces(test_text)
        print(f"   Sample {i+1}: {sampled}")
    sp.set_encode_extra_options('')  # Disable sampling
    
    # 3. Get vocabulary
    print("\n3. Vocabulary info:")
    print(f"   Vocab size: {sp.vocab_size()}")
    print(f"   First 10 tokens:")
    for i in range(10):
        print(f"      {i}: {sp.id_to_piece(i)}")
    
    # 4. Decode with control
    ids = sp.encode_as_ids(test_text)
    print("\n4. Decoding:")
    print(f"   IDs: {ids}")
    print(f"   Decoded: {sp.decode_ids(ids)}")
    print(f"   Decoded (pieces): {sp.decode_pieces(pieces)}")
    
    # 5. Get scores
    print("\n5. Token scores (probabilities):")
    for piece in pieces[:5]:
        score = sp.get_score(sp.piece_to_id(piece))
        print(f"   {piece}: {score:.4f}")

Example 7: Integration With HuggingFace Transformers¶

What: Load SentencePiece-based tokenizers through the HuggingFace Transformers library using T5Tokenizer and LlamaTokenizer.

Why: In practice, you rarely interact with SentencePiece directly. The HuggingFace Transformers library wraps SentencePiece models behind the familiar AutoTokenizer interface, so you get the same .encode(), .decode(), and .batch_encode_plus() API regardless of whether the underlying algorithm is BPE, WordPiece, or Unigram. Understanding that T5 and LLaMA use SentencePiece under the hood helps you debug tokenization issues (like the ▁ space character) when working with these models.

Connection: The ▁ (Unicode U+2581) character you see in T5 token outputs is the SentencePiece word-boundary marker. It replaces whitespace so that the model can reconstruct the original spacing during decoding.

def example_7_integration_with_transformers():
    """Use SentencePiece with Transformers library"""
    print("\n" + "="*60)
    print("Example 7: SentencePiece + Transformers")
    print("="*60)
    
    try:
        from transformers import T5Tokenizer, LlamaTokenizer
        
        # T5 uses SentencePiece Unigram
        print("\n1. T5 Tokenizer (SentencePiece Unigram):")
        t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
        text = "translate English to French: Hello, how are you?"
        tokens = t5_tokenizer.tokenize(text)
        print(f"   Text: {text}")
        print(f"   Tokens: {tokens}")
        print(f"   Note: ▁ represents spaces in SentencePiece")
        
        # LLaMA uses SentencePiece BPE
        print("\n2. LLaMA Tokenizer (SentencePiece BPE):")
        print("   LLaMA tokenizer requires authentication")
        print("   Example: tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')")
        print("   Both T5 and LLaMA use SentencePiece under the hood")
        
    except ImportError:
        print("Install transformers: pip install transformers")

Main – Run All Examples¶

Execute all seven SentencePiece examples in sequence. The trained models are saved to the ./models/ directory as .model and .vocab files. The .model file is a binary protobuf that can be loaded with SentencePieceProcessor.load().

def main():
    """Run all examples"""
    print("\n" + "="*70)
    print("SENTENCEPIECE TOKENIZATION EXAMPLES")
    print("="*70)
    
    # Create directories
    os.makedirs("models", exist_ok=True)
    os.makedirs("data", exist_ok=True)
    
    # Run examples
    example_1_train_bpe()
    example_2_train_unigram()
    example_3_multilingual()
    example_4_llama_style()
    example_5_compare_with_huggingface()
    example_6_advanced_features()
    example_7_integration_with_transformers()
    
    print("\n" + "="*70)
    print("KEY TAKEAWAYS")
    print("="*70)
    print("""
    1. SentencePiece is language-agnostic (no pre-tokenization needed)
    2. Supports both BPE and Unigram algorithms
    3. Used by production models: LLaMA, T5, ALBERT, XLM-R
    4. Treats text as raw Unicode characters
    5. Excellent for multilingual applications
    6. Integrates with HuggingFace Transformers
    7. Fast C++ implementation
    
    When to use SentencePiece:
    ✅ Multilingual applications
    ✅ Training LLMs from scratch
    ✅ Language-agnostic systems
    ✅ When you need both BPE and Unigram options
    
    Install: pip install sentencepiece
    """)


if __name__ == "__main__":
    main()