Run this notebook: Open in Colab Open in Kaggle

Huggingface Embeddings¶

HuggingFace Transformers Embeddings Guide

This module shows how to generate embeddings using HuggingFace Transformers library. Covers BERT, RoBERTa, DistilBERT, and other transformer models.

Install: pip install transformers torch sentence-transformers

Setup¶

Import PyTorch, HuggingFace Transformers (AutoTokenizer, AutoModel), NumPy, and SciPy’s cosine distance function. This notebook works directly with transformer models at the lowest level – tokenizing text, running it through the model, and extracting embeddings from hidden states – giving you full control over every step of the embedding pipeline.

import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
from scipy.spatial.distance import cosine

Example 1: Basic BERT Embeddings¶

What: Load bert-base-uncased, tokenize input texts, run them through the model, and extract the [CLS] token embedding as the sentence representation.

Why: BERT produces a contextualized embedding for every token in the input. The [CLS] (classification) token is a special token prepended to every input that is designed to aggregate sentence-level information during pre-training. Using it as the sentence embedding is the simplest approach, though not always the best (see Example 2 for mean pooling). The model.eval() call disables dropout so embeddings are deterministic, and torch.no_grad() saves memory by skipping gradient computation.

How: The model’s output last_hidden_state has shape (batch_size, sequence_length, hidden_size). Extracting [:, 0, :] takes the first token (CLS) across all sequences, producing a (batch_size, 768) tensor for BERT-base.

Connection: Understanding how to extract embeddings from raw transformer models is essential when you need custom pooling strategies, when working with models not supported by sentence-transformers, or when building production inference pipelines with ONNX or TensorRT.

def example_1_basic_bert_embeddings():
    """Generate embeddings using BERT"""
    print("\n" + "="*70)
    print("Example 1: Basic BERT Embeddings")
    print("="*70)
    
    # Load pre-trained BERT model and tokenizer
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    # Put model in evaluation mode
    model.eval()
    
    # Example texts
    texts = [
        "Machine learning is fascinating",
        "AI and machine learning are related",
        "I love pizza and pasta"
    ]
    
    print("\nGenerating BERT embeddings...")
    print(f"Model: {model_name}")
    print(f"Hidden size: {model.config.hidden_size}")
    
    embeddings = []
    
    for text in texts:
        # Tokenize
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        
        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Get [CLS] token embedding (first token)
        # This represents the entire sentence
        cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.append(cls_embedding[0])
        
        print(f"\nText: '{text}'")
        print(f"  Shape: {cls_embedding.shape}")
        print(f"  First 5 values: {cls_embedding[0][:5]}")
    
    # Calculate similarities
    print("\n" + "-"*70)
    print("Similarity Scores:")
    print("-"*70)
    
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            similarity = 1 - cosine(embeddings[i], embeddings[j])
            print(f"\n'{texts[i]}'")
            print(f"  vs")
            print(f"'{texts[j]}'")
            print(f"  Similarity: {similarity:.4f}")

Example 2: Mean Pooling for Better Sentence Embeddings¶

What: Implement mean pooling – averaging all token embeddings (weighted by the attention mask) – and compare it to the CLS token approach.

Why: The CLS token was designed for classification tasks during BERT’s pre-training, but research has shown that mean pooling over all token embeddings often produces better sentence representations for similarity tasks. Mean pooling captures information from every token in the sentence rather than relying on a single aggregation point. The attention mask weighting ensures that padding tokens (which carry no meaning) are excluded from the average.

How: Mean pooling computes: $$\mathbf{e}_{\text{sentence}} = \frac{\sum_{i=1}^{n} \mathbf{h}_i \cdot m_i}{\sum_{i=1}^{n} m_i}$$ where $\mathbf{h}_i$ is the hidden state of token $i$ and $m_i$ is its attention mask (1 for real tokens, 0 for padding).

Connection: The sentence-transformers library uses mean pooling by default for most of its models. Understanding how it works under the hood helps you implement custom pooling when the library does not support your model.

def example_2_mean_pooling():
    """Use mean pooling instead of CLS token"""
    print("\n" + "="*70)
    print("Example 2: Mean Pooling for Better Sentence Embeddings")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    def mean_pooling(model_output, attention_mask):
        """
        Mean pooling - average all token embeddings
        weighted by attention mask
        """
        token_embeddings = model_output[0]  # First element = token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    texts = [
        "The cat sits on the mat",
        "A feline rests on the carpet",
        "Pizza is my favorite food"
    ]
    
    print("\nComparing CLS token vs Mean Pooling:")
    
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Method 1: CLS token
        cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()[0]
        
        # Method 2: Mean pooling
        mean_embedding = mean_pooling(outputs, inputs['attention_mask']).numpy()[0]
        
        print(f"\nText: '{text}'")
        print(f"  CLS embedding norm: {np.linalg.norm(cls_embedding):.4f}")
        print(f"  Mean embedding norm: {np.linalg.norm(mean_embedding):.4f}")
    
    print("\n💡 Mean pooling often works better for sentence similarity!")

Example 3: Comparing Different Transformer Models¶

What: Load BERT, DistilBERT, and RoBERTa, then compare the embedding dimensions, norms, and values they produce for the same input text.

Why: Not all transformers are equal for embedding tasks. BERT (12 layers, 110M params) is the baseline. DistilBERT (6 layers, 66M params) is 40% smaller and 60% faster while retaining 97% of BERT’s performance – ideal for latency-sensitive applications. RoBERTa uses the same architecture as BERT but was trained with better hyperparameters and more data, often producing higher-quality representations. Comparing them side by side helps you choose the right model for your accuracy/speed trade-off.

Connection: In production, model selection depends on your constraints: DistilBERT for edge devices, RoBERTa for quality-critical applications, and specialized models like BGE or E5 for state-of-the-art retrieval.

def example_3_different_models():
    """Compare different transformer models"""
    print("\n" + "="*70)
    print("Example 3: Comparing Different Transformer Models")
    print("="*70)
    
    models = {
        "BERT Base": "bert-base-uncased",
        "DistilBERT": "distilbert-base-uncased",
        "RoBERTa": "roberta-base",
    }
    
    text = "Natural language processing with transformers"
    
    print(f"\nText: '{text}'")
    print("\nModel Comparison:")
    print("-"*70)
    
    for model_name, model_id in models.items():
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        model = AutoModel.from_pretrained(model_id)
        model.eval()
        
        inputs = tokenizer(text, return_tensors="pt")
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        embedding = outputs.last_hidden_state[:, 0, :].numpy()[0]
        
        print(f"\n{model_name}:")
        print(f"  Model ID: {model_id}")
        print(f"  Embedding dim: {len(embedding)}")
        print(f"  Norm: {np.linalg.norm(embedding):.4f}")
        print(f"  First 5 values: {embedding[:5]}")

Example 4: Efficient Batch Processing¶

What: Tokenize and encode multiple texts in a single forward pass through the model, then find the most similar pair.

Why: Processing texts one at a time is wasteful because each forward pass has fixed overhead (model loading, GPU kernel launches). Batching fills the GPU’s parallel processing units, achieving near-linear throughput scaling. The tokenizer’s padding=True and truncation=True flags ensure all sequences in the batch have the same length, which is required for tensor operations. For BERT-base with a batch of 5 texts, the input tensor has shape (5, seq_len) and the output has shape (5, seq_len, 768).

Connection: In production embedding services, batch processing is critical. Systems like NVIDIA Triton Inference Server batch incoming requests to maximize GPU utilization, reducing per-embedding latency from milliseconds to microseconds.

def example_4_batch_processing():
    """Process multiple texts efficiently in batches"""
    print("\n" + "="*70)
    print("Example 4: Efficient Batch Processing")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    texts = [
        "Machine learning is a subset of AI",
        "Deep learning uses neural networks",
        "Natural language processing handles text",
        "Computer vision processes images",
        "Robotics combines hardware and software"
    ]
    
    print(f"\nProcessing {len(texts)} texts in a batch...")
    
    # Batch tokenization
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt",
        max_length=128
    )
    
    print(f"Batch shape: {inputs['input_ids'].shape}")
    print(f"  (batch_size={inputs['input_ids'].shape[0]}, seq_length={inputs['input_ids'].shape[1]})")
    
    # Generate embeddings for all texts at once
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract CLS token embeddings
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    
    print(f"\nEmbeddings shape: {embeddings.shape}")
    print(f"  (num_texts={embeddings.shape[0]}, embedding_dim={embeddings.shape[1]})")
    
    # Find most similar pair
    print("\nFinding most similar pair:")
    max_similarity = -1
    max_pair = (0, 0)
    
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            similarity = 1 - cosine(embeddings[i], embeddings[j])
            if similarity > max_similarity:
                max_similarity = similarity
                max_pair = (i, j)
    
    print(f"\nMost similar texts:")
    print(f"  '{texts[max_pair[0]]}'")
    print(f"  '{texts[max_pair[1]]}'")
    print(f"  Similarity: {max_similarity:.4f}")

Example 5: Sentence Transformers (The Easy Way)¶

What: Use the sentence-transformers library’s high-level API (model.encode()) to generate embeddings in a single line, and compare similarity scores.

Why: While Examples 1-4 showed the manual approach (tokenize, forward pass, pool), the sentence-transformers library encapsulates all of this behind a simple encode() call. The all-MiniLM-L6-v2 model is specifically fine-tuned for sentence similarity using a contrastive learning objective – it was trained on over 1 billion sentence pairs to push similar sentences closer and dissimilar ones apart in the embedding space. This produces dramatically better similarity scores than raw BERT embeddings.

Connection: For most practical applications (semantic search, RAG, clustering), sentence-transformers models are the recommended starting point. You only need the low-level HuggingFace approach when working with unsupported models or custom architectures.

def example_5_sentence_transformers():
    """Use sentence-transformers for optimized embeddings"""
    print("\n" + "="*70)
    print("Example 5: Sentence Transformers (Optimized)")
    print("="*70)
    
    from sentence_transformers import SentenceTransformer
    
    # This model is specifically fine-tuned for sentence embeddings
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    texts = [
        "A man is eating food",
        "A person is consuming a meal",
        "A dog is playing in the park",
        "The weather is sunny today"
    ]
    
    print("\nUsing sentence-transformers library:")
    print("Model: all-MiniLM-L6-v2 (optimized for speed & quality)")
    
    # Generate embeddings (much simpler API!)
    embeddings = model.encode(texts)
    
    print(f"\nEmbeddings shape: {embeddings.shape}")
    print(f"  Dimension: {embeddings.shape[1]}")
    
    # Calculate similarity matrix
    print("\nSimilarity Matrix:")
    print("-"*70)
    
    for i, text_i in enumerate(texts):
        print(f"\n{i+1}. '{text_i}'")
        for j, text_j in enumerate(texts):
            if i != j:
                similarity = 1 - cosine(embeddings[i], embeddings[j])
                print(f"   vs {j+1}: {similarity:.4f}")

Example 6: Multilingual Embeddings¶

What: Load paraphrase-multilingual-MiniLM-L12-v2 and generate embeddings for the same greeting in six languages, then measure cross-language similarity.

Why: Multilingual embedding models map text from different languages into a shared vector space, so “Hello, how are you?” in English and its Chinese translation occupy nearby points. This enables cross-lingual search (query in English, find results in Japanese), multilingual clustering, and zero-shot transfer (train a classifier on English data, apply it to French). The model supports 50+ languages and was trained on parallel sentence pairs from multiple languages.

Connection: Cross-lingual embeddings power multilingual customer support systems, global search engines, and translation quality estimation. They are the foundation of multilingual RAG systems where the knowledge base spans multiple languages.

def example_6_multilingual_embeddings():
    """Generate embeddings for multiple languages"""
    print("\n" + "="*70)
    print("Example 6: Multilingual Embeddings")
    print("="*70)
    
    from sentence_transformers import SentenceTransformer
    
    # Multilingual model
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    
    texts = {
        'English': 'Hello, how are you?',
        'Spanish': '¡Hola, cómo estás?',
        'French': 'Bonjour, comment allez-vous?',
        'German': 'Hallo, wie geht es dir?',
        'Chinese': '你好吗？',
        'Japanese': 'こんにちは、お元気ですか？'
    }
    
    print("\nMultilingual Model: paraphrase-multilingual-MiniLM-L12-v2")
    print("Supports 50+ languages")
    
    # Generate embeddings
    text_list = list(texts.values())
    embeddings = model.encode(text_list)
    
    print(f"\nEmbeddings shape: {embeddings.shape}")
    
    # Compare similarities
    print("\nCross-language Similarity Scores:")
    print("-"*70)
    
    languages = list(texts.keys())
    english_idx = 0
    
    for i, (lang, text) in enumerate(texts.items()):
        if i != english_idx:
            similarity = 1 - cosine(embeddings[english_idx], embeddings[i])
            print(f"English <-> {lang:10s}: {similarity:.4f}")
    
    print("\n💡 All mean 'Hello, how are you?' - high similarity!")

Example 7: Custom Pooling Strategies¶

What: Compare four different pooling strategies – CLS token, mean pooling, max pooling, and last-token pooling – by examining the resulting embedding norms, means, and standard deviations.

Why: The choice of pooling strategy affects embedding quality for different tasks. CLS token works well for classification but poorly for similarity. Mean pooling averages all token representations and is the most robust general-purpose choice. Max pooling takes the element-wise maximum across tokens, capturing the “strongest” activation in each dimension – useful for tasks where specific keywords matter. Last-token pooling is used by some GPT-style models where the final token aggregates context from the entire sequence.

How: Given token embeddings $\mathbf{H} \in \mathbb{R}^{n \times d}$ (where $n$ is sequence length and $d$ is hidden size), the strategies compute: CLS = $\mathbf{H}[0]$, Mean = $\frac{1}{n}\sum_i \mathbf{H}[i]$, Max = $\max_i \mathbf{H}[i]$ (element-wise), Last = $\mathbf{H}[n-1]$.

def example_7_custom_pooling_strategies():
    """Different ways to pool token embeddings"""
    print("\n" + "="*70)
    print("Example 7: Custom Pooling Strategies")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    text = "Machine learning transforms data into insights"
    
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    token_embeddings = outputs.last_hidden_state[0]  # [seq_len, hidden_size]
    
    # Different pooling strategies
    strategies = {
        "CLS Token": token_embeddings[0],
        "Mean Pooling": torch.mean(token_embeddings, dim=0),
        "Max Pooling": torch.max(token_embeddings, dim=0)[0],
        "Last Token": token_embeddings[-1],
    }
    
    print(f"\nText: '{text}'")
    print(f"Token embeddings shape: {token_embeddings.shape}")
    print(f"  ({token_embeddings.shape[0]} tokens × {token_embeddings.shape[1]} dimensions)")
    
    print("\nPooling Strategy Comparison:")
    print("-"*70)
    
    for strategy, embedding in strategies.items():
        embedding_np = embedding.numpy()
        print(f"\n{strategy}:")
        print(f"  Shape: {embedding_np.shape}")
        print(f"  Norm: {np.linalg.norm(embedding_np):.4f}")
        print(f"  Mean: {embedding_np.mean():.4f}")
        print(f"  Std: {embedding_np.std():.4f}")

Example 8: Token-Level Embeddings¶

What: Extract and inspect the embedding vector for each individual token in a sentence, rather than pooling them into a single sentence embedding.

Why: Token-level embeddings are essential for tasks that operate at the word or subword level: Named Entity Recognition (NER) classifies each token as a person, organization, location, etc.; Part-of-Speech tagging labels each token’s grammatical role; and token classification tasks in general require per-token representations. Understanding the shape and properties of token embeddings helps you build custom heads on top of transformer models.

Connection: When you use spaCy’s transformer pipeline or HuggingFace’s TokenClassificationPipeline, they are extracting and classifying these exact token-level embeddings under the hood.

def example_8_token_level_embeddings():
    """Get embeddings for individual tokens"""
    print("\n" + "="*70)
    print("Example 8: Token-Level Embeddings")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    text = "Natural language processing"
    
    inputs = tokenizer(text, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    token_embeddings = outputs.last_hidden_state[0]  # [seq_len, hidden_size]
    
    print(f"\nText: '{text}'")
    print(f"\nToken-by-Token Embeddings:")
    print("-"*70)
    
    for i, (token, embedding) in enumerate(zip(tokens, token_embeddings)):
        embedding_np = embedding.numpy()
        print(f"\nToken {i}: '{token}'")
        print(f"  Embedding dim: {len(embedding_np)}")
        print(f"  Norm: {np.linalg.norm(embedding_np):.4f}")
        print(f"  First 5 values: {embedding_np[:5]}")

Example 9: Embeddings From Different Layers¶

What: Extract the CLS token embedding from every layer of BERT (layers 0-12) and compare their norms and means to see how representations evolve through the network.

Why: Different layers capture different levels of abstraction. Lower layers (0-3) capture syntactic features like part-of-speech and word order. Middle layers (4-8) capture more semantic features like word sense and entity types. Upper layers (9-12) produce the most task-specific representations. For some tasks, using a weighted combination of multiple layers (or a specific intermediate layer) outperforms using only the final layer.

How: Setting output_hidden_states=True in the model constructor causes it to return all intermediate layer outputs as a tuple. Layer 0 is the raw input embedding (before any transformer processing), and layer 12 is the final output.

Connection: The idea of layer-wise representation probing comes from the NLP interpretability literature (e.g., “BERTology” research). Practical systems sometimes concatenate or average the last 4 layers for improved performance.

def example_9_comparing_layers():
    """Compare embeddings from different layers"""
    print("\n" + "="*70)
    print("Example 9: Embeddings from Different Layers")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name, output_hidden_states=True)
    model.eval()
    
    text = "Transformers are powerful"
    
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get embeddings from all layers
    all_hidden_states = outputs.hidden_states  # Tuple of layer outputs
    
    print(f"\nText: '{text}'")
    print(f"Number of layers: {len(all_hidden_states)}")
    print("  (Layer 0 = input embeddings, Layer 12 = final output)")
    
    # Compare CLS token embedding across layers
    print("\nCLS Token Embedding Across Layers:")
    print("-"*70)
    
    for i, hidden_state in enumerate(all_hidden_states):
        cls_embedding = hidden_state[0, 0, :].numpy()  # CLS token
        norm = np.linalg.norm(cls_embedding)
        mean = cls_embedding.mean()
        
        print(f"Layer {i:2d}: norm={norm:.4f}, mean={mean:.6f}")
    
    print("\n💡 Higher layers capture more semantic meaning!")

Example 10: Contextual Embeddings¶

What: Show that the same word (“bank”) receives different embedding vectors depending on the surrounding context (financial bank vs. river bank).

Why: This is the defining feature of transformer-based embeddings and what makes them fundamentally different from static word embeddings like Word2Vec. In Word2Vec, “bank” always has the same vector regardless of context. In BERT, each occurrence of “bank” is contextualized by the entire sentence through self-attention, so the financial sense and the geographical sense produce distinct vectors. This context-sensitivity is what makes transformers so powerful for understanding language.

How: The self-attention mechanism computes: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$, which means each token’s representation is a weighted combination of all other tokens in the sequence. The word “bank” next to “deposit money” attends to financial terms, producing a different output than “bank” next to “river.”

Connection: Contextual embeddings are what enable modern NLP to handle polysemy (words with multiple meanings), coreference resolution, and nuanced language understanding – capabilities that were impossible with pre-transformer approaches.

def example_10_contextual_embeddings():
    """Show how context affects word embeddings"""
    print("\n" + "="*70)
    print("Example 10: Contextual Embeddings")
    print("="*70)
    
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    # Same word "bank" in different contexts
    sentences = [
        "I need to go to the bank to deposit money",
        "The river bank was covered with flowers",
        "The bank announced new interest rates"
    ]
    
    print("\nWord: 'bank' in different contexts")
    print("-"*70)
    
    bank_embeddings = []
    
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt")
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        token_embeddings = outputs.last_hidden_state[0]
        
        # Find "bank" token
        bank_idx = [i for i, token in enumerate(tokens) if 'bank' in token.lower()][0]
        bank_embedding = token_embeddings[bank_idx].numpy()
        bank_embeddings.append(bank_embedding)
        
        print(f"\n'{sentence}'")
        print(f"  'bank' at position {bank_idx}")
        print(f"  Embedding norm: {np.linalg.norm(bank_embedding):.4f}")
    
    # Compare similarity
    print("\n" + "-"*70)
    print("Contextual Similarity:")
    print("-"*70)
    
    for i in range(len(sentences)):
        for j in range(i+1, len(sentences)):
            similarity = 1 - cosine(bank_embeddings[i], bank_embeddings[j])
            print(f"\nContext {i+1} vs Context {j+1}: {similarity:.4f}")
    
    print("\n💡 Same word, different embeddings based on context!")

Main – Run All Examples¶

Execute all ten examples in sequence, covering the full spectrum from basic BERT embeddings through contextual representations. Each example builds on the previous ones, progressively deepening your understanding of how transformer models represent text.

def main():
    """Run all examples"""
    print("\n" + "="*70)
    print("HUGGINGFACE TRANSFORMERS EMBEDDINGS GUIDE")
    print("="*70)
    
    print("\nThis guide covers:")
    print("  1. Basic BERT embeddings")
    print("  2. Mean pooling vs CLS token")
    print("  3. Different transformer models")
    print("  4. Batch processing")
    print("  5. Sentence transformers (optimized)")
    print("  6. Multilingual embeddings")
    print("  7. Custom pooling strategies")
    print("  8. Token-level embeddings")
    print("  9. Embeddings from different layers")
    print("  10. Contextual embeddings")
    
    try:
        example_1_basic_bert_embeddings()
        example_2_mean_pooling()
        example_3_different_models()
        example_4_batch_processing()
        example_5_sentence_transformers()
        example_6_multilingual_embeddings()
        example_7_custom_pooling_strategies()
        example_8_token_level_embeddings()
        example_9_comparing_layers()
        example_10_contextual_embeddings()
        
        print("\n" + "="*70)
        print("KEY TAKEAWAYS")
        print("="*70)
        print("""
1. **HuggingFace Transformers**: Flexible but requires more code
   - Use AutoModel and AutoTokenizer
   - Choose pooling strategy (CLS, mean, max)
   
2. **Sentence Transformers**: Optimized for sentence embeddings
   - Simpler API: model.encode(texts)
   - Pre-trained for similarity tasks
   
3. **Pooling Matters**: Different strategies for different tasks
   - CLS token: Traditional BERT approach
   - Mean pooling: Often better for similarity
   - Max pooling: Captures strongest features
   
4. **Context is Key**: Transformer embeddings are contextual
   - Same word → different embeddings in different contexts
   - This is the power of transformers!
   
5. **Choose the Right Model**:
   - English only: BERT, RoBERTa, DistilBERT
   - Multilingual: multilingual-BERT, XLM-RoBERTa
   - Optimized: sentence-transformers models
   
6. **Batch Processing**: Always process multiple texts together
   - Much faster than one-by-one
   - GPU utilization improves
        """)
        
    except Exception as e:
        print(f"\n❌ Error: {e}")
        print("\nMake sure you have installed:")
        print("  pip install transformers torch sentence-transformers")


if __name__ == "__main__":
    main()