Run this notebook: Open in Colab Open in Kaggle

Embeddings Intro¶

Embeddings Introduction¶

Learn what embeddings are and how to generate them.

Key Concepts:

Embeddings convert text into dense vectors (arrays of numbers)
Similar meanings = similar vectors
Vector dimensions represent features learned from data
Common models: sentence-transformers, OpenAI embeddings

Installation: pip install sentence-transformers numpy

Setup¶

Import SentenceTransformer from the sentence-transformers library and numpy for vector operations. Sentence Transformers provides pretrained models that convert text into fixed-size dense vectors (embeddings) optimized for semantic similarity tasks. NumPy will be used for cosine similarity calculations and vector arithmetic.

from sentence_transformers import SentenceTransformer
import numpy as np

Part 1: Generating Your First Embeddings¶

What: Load the all-MiniLM-L6-v2 model and encode four sentences into 384-dimensional embedding vectors using model.encode().

Why: Embeddings are the bridge between human-readable text and the numerical representations that ML models operate on. Unlike bag-of-words or TF-IDF (which treat words as independent features), embedding models produce dense vectors where each dimension captures a learned semantic feature. Two sentences with similar meaning will have nearby vectors in this 384-dimensional space, even if they share no words at all.

How: The model passes each sentence through a 6-layer transformer (MiniLM), then applies mean pooling over the token-level outputs to produce a single fixed-size vector. The result is a NumPy array of shape (num_sentences, 384).

Connection: This is the same operation that powers semantic search, recommendation systems, and RAG (Retrieval Augmented Generation) – you encode your knowledge base once, then encode queries at runtime and find the nearest neighbors.

def introduction_to_embeddings():
    """Generate your first embeddings and explore their properties."""
    
    print("=" * 70)
    print("PART 1: GENERATING EMBEDDINGS")
    print("=" * 70)
    
    # Load a pre-trained model (downloads ~90MB on first run)
    # This model creates 384-dimensional vectors
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Simple examples
    texts = [
        "The cat sits on the mat",
        "A feline rests on the carpet",  # Similar meaning, different words
        "Dogs are playing in the park",   # Different topic
        "Python is a programming language" # Completely different
    ]
    
    # Generate embeddings
    embeddings = model.encode(texts)
    
    print(f"\nGenerated embeddings for {len(texts)} sentences")
    print(f"Each embedding has {embeddings.shape[1]} dimensions")
    print(f"Shape: {embeddings.shape}")
    
    # Show first embedding (truncated)
    print(f"\nFirst 10 values of embedding for: '{texts[0]}'")
    print(embeddings[0][:10])
    print("... (374 more dimensions)")
    
    return model, texts, embeddings

Part 2: Measuring Semantic Similarity¶

What: Compute pairwise cosine similarity between the embeddings to quantify how semantically related each pair of sentences is.

Why: Cosine similarity is the standard metric for comparing embeddings. It measures the angle between two vectors, ignoring their magnitudes, so it captures direction (meaning) rather than length. Values range from -1 (opposite meaning) to 1 (identical meaning), with 0 indicating no semantic relationship.

How: For two vectors \(\mathbf{A}\) and \(\mathbf{B}\), cosine similarity is computed as:

\[\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \times \|\mathbf{B}\|}\]

The numerator is the dot product (sum of element-wise products), and the denominator normalizes by each vector’s magnitude. When embeddings are already unit-normalized (as with many models), cosine similarity reduces to a simple dot product.

Connection: Search engines, duplicate detectors, and recommendation systems all rely on cosine similarity (or its equivalent, cosine distance = 1 - similarity) to rank results by semantic relevance.

def explore_similarity(texts, embeddings):
    """Understand how similar meanings produce similar vectors."""
    
    print("\n" + "=" * 70)
    print("PART 2: MEASURING SEMANTIC SIMILARITY")
    print("=" * 70)
    
    # Cosine similarity: measures angle between vectors
    # Range: -1 (opposite) to 1 (identical)
    # Formula: cos(θ) = (A · B) / (||A|| × ||B||)
    
    def cosine_similarity(vec1, vec2):
        """Calculate cosine similarity between two vectors."""
        dot_product = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        return dot_product / (norm1 * norm2)
    
    print("\nComparing all sentences:\n")
    
    # Compare each pair
    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            similarity = cosine_similarity(embeddings[i], embeddings[j])
            print(f"Similarity: {similarity:.4f}")
            print(f"  '{texts[i]}'")
            print(f"  '{texts[j]}'")
            print()
    
    print("\n🔍 OBSERVATION:")
    print("The first two sentences (cat/feline) should have HIGH similarity (~0.7+)")
    print("because they mean the same thing, even with different words!")

Part 3: Vector Arithmetic – The Magic of Embeddings¶

What: Demonstrate the famous analogy “king - man + woman = queen” using embedding vector arithmetic.

Why: One of the most remarkable properties of well-trained embeddings is that semantic relationships are encoded as linear directions in the vector space. Subtracting the “man” vector from “king” removes the male-gender direction, and adding “woman” inserts the female-gender direction, landing near “queen.” This works because the training process organizes the vector space so that analogous relationships occupy parallel directions.

How: The vector arithmetic is simply: \(\mathbf{v}_{\text{result}} = \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}}\). We then compute cosine similarity between \(\mathbf{v}_{\text{result}}\) and every candidate word to find the closest match.

Connection: This linear structure underlies many practical applications: bias detection (measuring gender/racial directions in embedding spaces), analogy completion, and even creative applications like “style transfer” in text by manipulating embedding directions.

def vector_operations(model):
    """Demonstrate that embeddings capture semantic relationships."""
    
    print("\n" + "=" * 70)
    print("PART 3: VECTOR MATH (THE MAGIC!)")
    print("=" * 70)
    
    # Famous example: king - man + woman ≈ queen
    words = ["king", "man", "woman", "queen", "prince", "princess"]
    word_embeddings = model.encode(words)
    
    # Create a dictionary for easy access
    word_vectors = {word: emb for word, emb in zip(words, word_embeddings)}
    
    # Vector arithmetic
    result_vector = (word_vectors["king"] 
                    - word_vectors["man"] 
                    + word_vectors["woman"])
    
    print("\nVector equation: king - man + woman = ?")
    print("\nComparing result to known words:")
    
    for word in words:
        similarity = np.dot(result_vector, word_vectors[word]) / (
            np.linalg.norm(result_vector) * np.linalg.norm(word_vectors[word])
        )
        print(f"  {word:10s}: {similarity:.4f}")
    
    print("\n✨ The result should be closest to 'queen'!")
    print("This shows embeddings capture gender relationships!")

Part 4: The Math Explained¶

What: A detailed walkthrough of the mathematical foundations behind embeddings: how they represent text as vectors, how cosine similarity and Euclidean distance work, why 384 dimensions is a practical sweet spot, and how these representations enable downstream ML tasks.

Why: Building strong intuition for the math behind embeddings lets you make informed decisions about model selection (how many dimensions?), distance metrics (cosine vs. Euclidean vs. dot product), and debugging (why are two unrelated texts scoring as similar?). The concepts here – dot products, norms, and distance metrics – are the same ones used in vector databases, nearest-neighbor search, and attention mechanisms in transformers.

def explain_the_math():
    """Explain what's happening under the hood."""
    
    print("\n" + "=" * 70)
    print("PART 4: THE MATH EXPLAINED")
    print("=" * 70)
    
    explanation = """
    What Are Embeddings?
    ====================
    
    1. REPRESENTATION:
       - Text: "cat" → Embedding: [0.23, -0.45, 0.67, ..., 0.12] (384 numbers)
       - Each number represents a learned feature
       - Dimensions might capture: animal?, size?, domestic?, etc.
    
    2. HOW THEY'RE CREATED:
       - Neural networks trained on massive text data
       - Learn to predict context (words around a target word)
       - Similar contexts → similar vectors
       - Example: "cat" and "dog" appear in similar contexts
    
    3. COSINE SIMILARITY:
       - Measures angle between vectors (not distance!)
       - Formula: cos(θ) = (A·B) / (||A|| × ||B||)
       
       Where:
         A·B = dot product (sum of element-wise multiplication)
         ||A|| = magnitude/length of vector A
         ||B|| = magnitude/length of vector B
       
       Example:
         A = [1, 2, 3]
         B = [2, 4, 6]  (B is 2×A, same direction)
         
         A·B = (1×2) + (2×4) + (3×6) = 2 + 8 + 18 = 28
         ||A|| = √(1² + 2² + 3²) = √14 ≈ 3.74
         ||B|| = √(2² + 4² + 6²) = √56 ≈ 7.48
         
         cos(θ) = 28 / (3.74 × 7.48) ≈ 1.0 (identical direction!)
    
    4. WHY 384 DIMENSIONS?
       - More dimensions = more capacity to encode meaning
       - 384 is a sweet spot: fast, accurate, efficient
       - Larger models use 768, 1536, or even 4096 dimensions
    
    5. EUCLIDEAN DISTANCE (alternative metric):
       - Measures straight-line distance between vectors
       - Formula: d = √(Σ(A[i] - B[i])²)
       - Smaller distance = more similar
       - Less commonly used for text (cosine is better)
    
    6. WHY THIS MATTERS:
       - Once text is a vector, we can:
         * Search by meaning (not keywords)
         * Find similar documents
         * Cluster related content
         * Build recommendation systems
         * Power RAG (Retrieval Augmented Generation)
    """
    
    print(explanation)

Part 5: Practical Example – Semantic Document Search¶

What: Build a minimal semantic search engine: encode a knowledge base of documents and a user query into embeddings, rank documents by cosine similarity, and return the top 3 most relevant results.

Why: Semantic search is the single most important application of embeddings. Unlike keyword search (which requires exact word matches), semantic search finds documents by meaning. A query like “How do computers learn from data?” will match a document about “Machine learning is a subset of AI” even though they share almost no words. This is the core mechanism behind RAG systems, where an LLM retrieves relevant context from a knowledge base before generating an answer.

Connection: Every modern AI-powered search system – from ChatGPT’s retrieval plugin to enterprise knowledge bases – uses this exact pattern: encode, compare, rank. The only difference at scale is using a vector database (FAISS, Pinecone, ChromaDB) instead of brute-force comparison.

def practical_example():
    """A practical use case: finding similar sentences."""
    
    print("\n" + "=" * 70)
    print("PART 5: PRACTICAL EXAMPLE - FIND SIMILAR CONTENT")
    print("=" * 70)
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Your knowledge base
    documents = [
        "Machine learning is a subset of artificial intelligence",
        "Deep learning uses neural networks with many layers",
        "Python is the most popular programming language for ML",
        "Natural language processing helps computers understand text",
        "Computer vision enables machines to interpret images",
        "Data preprocessing is crucial for model accuracy",
        "Gradient descent optimizes neural network weights",
    ]
    
    # User query
    query = "How do computers learn from data?"
    
    print(f"\nQuery: '{query}'")
    print("\nFinding most relevant documents...\n")
    
    # Encode everything
    doc_embeddings = model.encode(documents)
    query_embedding = model.encode([query])[0]
    
    # Calculate similarities
    similarities = []
    for i, doc_emb in enumerate(doc_embeddings):
        similarity = np.dot(query_embedding, doc_emb) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
        )
        similarities.append((similarity, i, documents[i]))
    
    # Sort by similarity (highest first)
    similarities.sort(reverse=True)
    
    # Show top 3 results
    print("Top 3 Most Relevant Documents:")
    print("-" * 70)
    for rank, (score, idx, doc) in enumerate(similarities[:3], 1):
        print(f"\n{rank}. Score: {score:.4f}")
        print(f"   {doc}")
    
    print("\n\n💡 INSIGHT:")
    print("Notice how results are ranked by MEANING, not keyword matching!")
    print("The query doesn't contain 'machine learning' but ML doc ranks high.")

Main – Run All Demonstrations¶

Execute all five parts in sequence: generate embeddings, measure similarity, perform vector arithmetic, review the math, and build a practical semantic search example. Each part builds on the concepts from the previous one.

def main():
    """Run all demonstrations."""
    
    print("\n")
    print("🚀 " + "=" * 66 + " 🚀")
    print("   EMBEDDINGS: TURNING TEXT INTO MEANINGFUL NUMBERS")
    print("🚀 " + "=" * 66 + " 🚀")
    
    # Part 1: Generate embeddings
    model, texts, embeddings = introduction_to_embeddings()
    
    # Part 2: Explore similarity
    explore_similarity(texts, embeddings)
    
    # Part 3: Vector operations
    vector_operations(model)
    
    # Part 4: Explain the math
    explain_the_math()
    
    # Part 5: Practical example
    practical_example()
    
    print("\n" + "=" * 70)
    print("🎓 NEXT STEPS:")
    print("=" * 70)
    print("""
    1. Run: semantic_similarity.py - Compare different texts
    2. Run: vector_database_demo.py - Store & search embeddings at scale
    3. Experiment: Try different models from sentence-transformers
    4. Build: Create semantic search for YOUR documents
    
    Key Takeaway:
    Embeddings are the bridge between human language and machine understanding.
    Master this, and you'll understand how modern AI really works!
    """)


if __name__ == "__main__":
    main()