Embeddings IntroΒΆ

Embeddings IntroductionΒΆ

Learn what embeddings are and how to generate them.

Key Concepts:

  • Embeddings convert text into dense vectors (arrays of numbers)

  • Similar meanings = similar vectors

  • Vector dimensions represent features learned from data

  • Common models: sentence-transformers, OpenAI embeddings

Installation: pip install sentence-transformers numpy

SetupΒΆ

Import SentenceTransformer from the sentence-transformers library and numpy for vector operations. Sentence Transformers provides pretrained models that convert text into fixed-size dense vectors (embeddings) optimized for semantic similarity tasks. NumPy will be used for cosine similarity calculations and vector arithmetic.

from sentence_transformers import SentenceTransformer
import numpy as np

Part 1: Generating Your First EmbeddingsΒΆ

What: Load the all-MiniLM-L6-v2 model and encode four sentences into 384-dimensional embedding vectors using model.encode().

Why: Embeddings are the bridge between human-readable text and the numerical representations that ML models operate on. Unlike bag-of-words or TF-IDF (which treat words as independent features), embedding models produce dense vectors where each dimension captures a learned semantic feature. Two sentences with similar meaning will have nearby vectors in this 384-dimensional space, even if they share no words at all.

How: The model passes each sentence through a 6-layer transformer (MiniLM), then applies mean pooling over the token-level outputs to produce a single fixed-size vector. The result is a NumPy array of shape (num_sentences, 384).

Connection: This is the same operation that powers semantic search, recommendation systems, and RAG (Retrieval Augmented Generation) – you encode your knowledge base once, then encode queries at runtime and find the nearest neighbors.

def introduction_to_embeddings():
    """Generate your first embeddings and explore their properties."""
    
    print("=" * 70)
    print("PART 1: GENERATING EMBEDDINGS")
    print("=" * 70)
    
    # Load a pre-trained model (downloads ~90MB on first run)
    # This model creates 384-dimensional vectors
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Simple examples
    texts = [
        "The cat sits on the mat",
        "A feline rests on the carpet",  # Similar meaning, different words
        "Dogs are playing in the park",   # Different topic
        "Python is a programming language" # Completely different
    ]
    
    # Generate embeddings
    embeddings = model.encode(texts)
    
    print(f"\nGenerated embeddings for {len(texts)} sentences")
    print(f"Each embedding has {embeddings.shape[1]} dimensions")
    print(f"Shape: {embeddings.shape}")
    
    # Show first embedding (truncated)
    print(f"\nFirst 10 values of embedding for: '{texts[0]}'")
    print(embeddings[0][:10])
    print("... (374 more dimensions)")
    
    return model, texts, embeddings

Part 2: Measuring Semantic SimilarityΒΆ

What: Compute pairwise cosine similarity between the embeddings to quantify how semantically related each pair of sentences is.

Why: Cosine similarity is the standard metric for comparing embeddings. It measures the angle between two vectors, ignoring their magnitudes, so it captures direction (meaning) rather than length. Values range from -1 (opposite meaning) to 1 (identical meaning), with 0 indicating no semantic relationship.

How: For two vectors \(\mathbf{A}\) and \(\mathbf{B}\), cosine similarity is computed as:

\[\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \times \|\mathbf{B}\|}\]

The numerator is the dot product (sum of element-wise products), and the denominator normalizes by each vector’s magnitude. When embeddings are already unit-normalized (as with many models), cosine similarity reduces to a simple dot product.

Connection: Search engines, duplicate detectors, and recommendation systems all rely on cosine similarity (or its equivalent, cosine distance = 1 - similarity) to rank results by semantic relevance.

def explore_similarity(texts, embeddings):
    """Understand how similar meanings produce similar vectors."""
    
    print("\n" + "=" * 70)
    print("PART 2: MEASURING SEMANTIC SIMILARITY")
    print("=" * 70)
    
    # Cosine similarity: measures angle between vectors
    # Range: -1 (opposite) to 1 (identical)
    # Formula: cos(ΞΈ) = (A Β· B) / (||A|| Γ— ||B||)
    
    def cosine_similarity(vec1, vec2):
        """Calculate cosine similarity between two vectors."""
        dot_product = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        return dot_product / (norm1 * norm2)
    
    print("\nComparing all sentences:\n")
    
    # Compare each pair
    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            similarity = cosine_similarity(embeddings[i], embeddings[j])
            print(f"Similarity: {similarity:.4f}")
            print(f"  '{texts[i]}'")
            print(f"  '{texts[j]}'")
            print()
    
    print("\nπŸ” OBSERVATION:")
    print("The first two sentences (cat/feline) should have HIGH similarity (~0.7+)")
    print("because they mean the same thing, even with different words!")

Part 3: Vector Arithmetic – The Magic of EmbeddingsΒΆ

What: Demonstrate the famous analogy β€œking - man + woman = queen” using embedding vector arithmetic.

Why: One of the most remarkable properties of well-trained embeddings is that semantic relationships are encoded as linear directions in the vector space. Subtracting the β€œman” vector from β€œking” removes the male-gender direction, and adding β€œwoman” inserts the female-gender direction, landing near β€œqueen.” This works because the training process organizes the vector space so that analogous relationships occupy parallel directions.

How: The vector arithmetic is simply: \(\mathbf{v}_{\text{result}} = \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}}\). We then compute cosine similarity between \(\mathbf{v}_{\text{result}}\) and every candidate word to find the closest match.

Connection: This linear structure underlies many practical applications: bias detection (measuring gender/racial directions in embedding spaces), analogy completion, and even creative applications like β€œstyle transfer” in text by manipulating embedding directions.

def vector_operations(model):
    """Demonstrate that embeddings capture semantic relationships."""
    
    print("\n" + "=" * 70)
    print("PART 3: VECTOR MATH (THE MAGIC!)")
    print("=" * 70)
    
    # Famous example: king - man + woman β‰ˆ queen
    words = ["king", "man", "woman", "queen", "prince", "princess"]
    word_embeddings = model.encode(words)
    
    # Create a dictionary for easy access
    word_vectors = {word: emb for word, emb in zip(words, word_embeddings)}
    
    # Vector arithmetic
    result_vector = (word_vectors["king"] 
                    - word_vectors["man"] 
                    + word_vectors["woman"])
    
    print("\nVector equation: king - man + woman = ?")
    print("\nComparing result to known words:")
    
    for word in words:
        similarity = np.dot(result_vector, word_vectors[word]) / (
            np.linalg.norm(result_vector) * np.linalg.norm(word_vectors[word])
        )
        print(f"  {word:10s}: {similarity:.4f}")
    
    print("\n✨ The result should be closest to 'queen'!")
    print("This shows embeddings capture gender relationships!")

Part 4: The Math ExplainedΒΆ

What: A detailed walkthrough of the mathematical foundations behind embeddings: how they represent text as vectors, how cosine similarity and Euclidean distance work, why 384 dimensions is a practical sweet spot, and how these representations enable downstream ML tasks.

Why: Building strong intuition for the math behind embeddings lets you make informed decisions about model selection (how many dimensions?), distance metrics (cosine vs. Euclidean vs. dot product), and debugging (why are two unrelated texts scoring as similar?). The concepts here – dot products, norms, and distance metrics – are the same ones used in vector databases, nearest-neighbor search, and attention mechanisms in transformers.

def explain_the_math():
    """Explain what's happening under the hood."""
    
    print("\n" + "=" * 70)
    print("PART 4: THE MATH EXPLAINED")
    print("=" * 70)
    
    explanation = """
    What Are Embeddings?
    ====================
    
    1. REPRESENTATION:
       - Text: "cat" β†’ Embedding: [0.23, -0.45, 0.67, ..., 0.12] (384 numbers)
       - Each number represents a learned feature
       - Dimensions might capture: animal?, size?, domestic?, etc.
    
    2. HOW THEY'RE CREATED:
       - Neural networks trained on massive text data
       - Learn to predict context (words around a target word)
       - Similar contexts β†’ similar vectors
       - Example: "cat" and "dog" appear in similar contexts
    
    3. COSINE SIMILARITY:
       - Measures angle between vectors (not distance!)
       - Formula: cos(ΞΈ) = (AΒ·B) / (||A|| Γ— ||B||)
       
       Where:
         AΒ·B = dot product (sum of element-wise multiplication)
         ||A|| = magnitude/length of vector A
         ||B|| = magnitude/length of vector B
       
       Example:
         A = [1, 2, 3]
         B = [2, 4, 6]  (B is 2Γ—A, same direction)
         
         AΒ·B = (1Γ—2) + (2Γ—4) + (3Γ—6) = 2 + 8 + 18 = 28
         ||A|| = √(1Β² + 2Β² + 3Β²) = √14 β‰ˆ 3.74
         ||B|| = √(2Β² + 4Β² + 6Β²) = √56 β‰ˆ 7.48
         
         cos(ΞΈ) = 28 / (3.74 Γ— 7.48) β‰ˆ 1.0 (identical direction!)
    
    4. WHY 384 DIMENSIONS?
       - More dimensions = more capacity to encode meaning
       - 384 is a sweet spot: fast, accurate, efficient
       - Larger models use 768, 1536, or even 4096 dimensions
    
    5. EUCLIDEAN DISTANCE (alternative metric):
       - Measures straight-line distance between vectors
       - Formula: d = √(Σ(A[i] - B[i])²)
       - Smaller distance = more similar
       - Less commonly used for text (cosine is better)
    
    6. WHY THIS MATTERS:
       - Once text is a vector, we can:
         * Search by meaning (not keywords)
         * Find similar documents
         * Cluster related content
         * Build recommendation systems
         * Power RAG (Retrieval Augmented Generation)
    """
    
    print(explanation)

Main – Run All DemonstrationsΒΆ

Execute all five parts in sequence: generate embeddings, measure similarity, perform vector arithmetic, review the math, and build a practical semantic search example. Each part builds on the concepts from the previous one.

def main():
    """Run all demonstrations."""
    
    print("\n")
    print("πŸš€ " + "=" * 66 + " πŸš€")
    print("   EMBEDDINGS: TURNING TEXT INTO MEANINGFUL NUMBERS")
    print("πŸš€ " + "=" * 66 + " πŸš€")
    
    # Part 1: Generate embeddings
    model, texts, embeddings = introduction_to_embeddings()
    
    # Part 2: Explore similarity
    explore_similarity(texts, embeddings)
    
    # Part 3: Vector operations
    vector_operations(model)
    
    # Part 4: Explain the math
    explain_the_math()
    
    # Part 5: Practical example
    practical_example()
    
    print("\n" + "=" * 70)
    print("πŸŽ“ NEXT STEPS:")
    print("=" * 70)
    print("""
    1. Run: semantic_similarity.py - Compare different texts
    2. Run: vector_database_demo.py - Store & search embeddings at scale
    3. Experiment: Try different models from sentence-transformers
    4. Build: Create semantic search for YOUR documents
    
    Key Takeaway:
    Embeddings are the bridge between human language and machine understanding.
    Master this, and you'll understand how modern AI really works!
    """)


if __name__ == "__main__":
    main()