Embeddings IntroΒΆ
Embeddings IntroductionΒΆ
Learn what embeddings are and how to generate them.
Key Concepts:
Embeddings convert text into dense vectors (arrays of numbers)
Similar meanings = similar vectors
Vector dimensions represent features learned from data
Common models: sentence-transformers, OpenAI embeddings
Installation: pip install sentence-transformers numpy
SetupΒΆ
Import SentenceTransformer from the sentence-transformers library and numpy for vector operations. Sentence Transformers provides pretrained models that convert text into fixed-size dense vectors (embeddings) optimized for semantic similarity tasks. NumPy will be used for cosine similarity calculations and vector arithmetic.
from sentence_transformers import SentenceTransformer
import numpy as np
Part 1: Generating Your First EmbeddingsΒΆ
What: Load the all-MiniLM-L6-v2 model and encode four sentences into 384-dimensional embedding vectors using model.encode().
Why: Embeddings are the bridge between human-readable text and the numerical representations that ML models operate on. Unlike bag-of-words or TF-IDF (which treat words as independent features), embedding models produce dense vectors where each dimension captures a learned semantic feature. Two sentences with similar meaning will have nearby vectors in this 384-dimensional space, even if they share no words at all.
How: The model passes each sentence through a 6-layer transformer (MiniLM), then applies mean pooling over the token-level outputs to produce a single fixed-size vector. The result is a NumPy array of shape (num_sentences, 384).
Connection: This is the same operation that powers semantic search, recommendation systems, and RAG (Retrieval Augmented Generation) β you encode your knowledge base once, then encode queries at runtime and find the nearest neighbors.
def introduction_to_embeddings():
"""Generate your first embeddings and explore their properties."""
print("=" * 70)
print("PART 1: GENERATING EMBEDDINGS")
print("=" * 70)
# Load a pre-trained model (downloads ~90MB on first run)
# This model creates 384-dimensional vectors
model = SentenceTransformer('all-MiniLM-L6-v2')
# Simple examples
texts = [
"The cat sits on the mat",
"A feline rests on the carpet", # Similar meaning, different words
"Dogs are playing in the park", # Different topic
"Python is a programming language" # Completely different
]
# Generate embeddings
embeddings = model.encode(texts)
print(f"\nGenerated embeddings for {len(texts)} sentences")
print(f"Each embedding has {embeddings.shape[1]} dimensions")
print(f"Shape: {embeddings.shape}")
# Show first embedding (truncated)
print(f"\nFirst 10 values of embedding for: '{texts[0]}'")
print(embeddings[0][:10])
print("... (374 more dimensions)")
return model, texts, embeddings
Part 2: Measuring Semantic SimilarityΒΆ
What: Compute pairwise cosine similarity between the embeddings to quantify how semantically related each pair of sentences is.
Why: Cosine similarity is the standard metric for comparing embeddings. It measures the angle between two vectors, ignoring their magnitudes, so it captures direction (meaning) rather than length. Values range from -1 (opposite meaning) to 1 (identical meaning), with 0 indicating no semantic relationship.
How: For two vectors \(\mathbf{A}\) and \(\mathbf{B}\), cosine similarity is computed as:
The numerator is the dot product (sum of element-wise products), and the denominator normalizes by each vectorβs magnitude. When embeddings are already unit-normalized (as with many models), cosine similarity reduces to a simple dot product.
Connection: Search engines, duplicate detectors, and recommendation systems all rely on cosine similarity (or its equivalent, cosine distance = 1 - similarity) to rank results by semantic relevance.
def explore_similarity(texts, embeddings):
"""Understand how similar meanings produce similar vectors."""
print("\n" + "=" * 70)
print("PART 2: MEASURING SEMANTIC SIMILARITY")
print("=" * 70)
# Cosine similarity: measures angle between vectors
# Range: -1 (opposite) to 1 (identical)
# Formula: cos(ΞΈ) = (A Β· B) / (||A|| Γ ||B||)
def cosine_similarity(vec1, vec2):
"""Calculate cosine similarity between two vectors."""
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
print("\nComparing all sentences:\n")
# Compare each pair
for i in range(len(texts)):
for j in range(i + 1, len(texts)):
similarity = cosine_similarity(embeddings[i], embeddings[j])
print(f"Similarity: {similarity:.4f}")
print(f" '{texts[i]}'")
print(f" '{texts[j]}'")
print()
print("\nπ OBSERVATION:")
print("The first two sentences (cat/feline) should have HIGH similarity (~0.7+)")
print("because they mean the same thing, even with different words!")
Part 3: Vector Arithmetic β The Magic of EmbeddingsΒΆ
What: Demonstrate the famous analogy βking - man + woman = queenβ using embedding vector arithmetic.
Why: One of the most remarkable properties of well-trained embeddings is that semantic relationships are encoded as linear directions in the vector space. Subtracting the βmanβ vector from βkingβ removes the male-gender direction, and adding βwomanβ inserts the female-gender direction, landing near βqueen.β This works because the training process organizes the vector space so that analogous relationships occupy parallel directions.
How: The vector arithmetic is simply: \(\mathbf{v}_{\text{result}} = \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}}\). We then compute cosine similarity between \(\mathbf{v}_{\text{result}}\) and every candidate word to find the closest match.
Connection: This linear structure underlies many practical applications: bias detection (measuring gender/racial directions in embedding spaces), analogy completion, and even creative applications like βstyle transferβ in text by manipulating embedding directions.
def vector_operations(model):
"""Demonstrate that embeddings capture semantic relationships."""
print("\n" + "=" * 70)
print("PART 3: VECTOR MATH (THE MAGIC!)")
print("=" * 70)
# Famous example: king - man + woman β queen
words = ["king", "man", "woman", "queen", "prince", "princess"]
word_embeddings = model.encode(words)
# Create a dictionary for easy access
word_vectors = {word: emb for word, emb in zip(words, word_embeddings)}
# Vector arithmetic
result_vector = (word_vectors["king"]
- word_vectors["man"]
+ word_vectors["woman"])
print("\nVector equation: king - man + woman = ?")
print("\nComparing result to known words:")
for word in words:
similarity = np.dot(result_vector, word_vectors[word]) / (
np.linalg.norm(result_vector) * np.linalg.norm(word_vectors[word])
)
print(f" {word:10s}: {similarity:.4f}")
print("\n⨠The result should be closest to 'queen'!")
print("This shows embeddings capture gender relationships!")
Part 4: The Math ExplainedΒΆ
What: A detailed walkthrough of the mathematical foundations behind embeddings: how they represent text as vectors, how cosine similarity and Euclidean distance work, why 384 dimensions is a practical sweet spot, and how these representations enable downstream ML tasks.
Why: Building strong intuition for the math behind embeddings lets you make informed decisions about model selection (how many dimensions?), distance metrics (cosine vs. Euclidean vs. dot product), and debugging (why are two unrelated texts scoring as similar?). The concepts here β dot products, norms, and distance metrics β are the same ones used in vector databases, nearest-neighbor search, and attention mechanisms in transformers.
def explain_the_math():
"""Explain what's happening under the hood."""
print("\n" + "=" * 70)
print("PART 4: THE MATH EXPLAINED")
print("=" * 70)
explanation = """
What Are Embeddings?
====================
1. REPRESENTATION:
- Text: "cat" β Embedding: [0.23, -0.45, 0.67, ..., 0.12] (384 numbers)
- Each number represents a learned feature
- Dimensions might capture: animal?, size?, domestic?, etc.
2. HOW THEY'RE CREATED:
- Neural networks trained on massive text data
- Learn to predict context (words around a target word)
- Similar contexts β similar vectors
- Example: "cat" and "dog" appear in similar contexts
3. COSINE SIMILARITY:
- Measures angle between vectors (not distance!)
- Formula: cos(ΞΈ) = (AΒ·B) / (||A|| Γ ||B||)
Where:
AΒ·B = dot product (sum of element-wise multiplication)
||A|| = magnitude/length of vector A
||B|| = magnitude/length of vector B
Example:
A = [1, 2, 3]
B = [2, 4, 6] (B is 2ΓA, same direction)
AΒ·B = (1Γ2) + (2Γ4) + (3Γ6) = 2 + 8 + 18 = 28
||A|| = β(1Β² + 2Β² + 3Β²) = β14 β 3.74
||B|| = β(2Β² + 4Β² + 6Β²) = β56 β 7.48
cos(ΞΈ) = 28 / (3.74 Γ 7.48) β 1.0 (identical direction!)
4. WHY 384 DIMENSIONS?
- More dimensions = more capacity to encode meaning
- 384 is a sweet spot: fast, accurate, efficient
- Larger models use 768, 1536, or even 4096 dimensions
5. EUCLIDEAN DISTANCE (alternative metric):
- Measures straight-line distance between vectors
- Formula: d = β(Ξ£(A[i] - B[i])Β²)
- Smaller distance = more similar
- Less commonly used for text (cosine is better)
6. WHY THIS MATTERS:
- Once text is a vector, we can:
* Search by meaning (not keywords)
* Find similar documents
* Cluster related content
* Build recommendation systems
* Power RAG (Retrieval Augmented Generation)
"""
print(explanation)
Part 5: Practical Example β Semantic Document SearchΒΆ
What: Build a minimal semantic search engine: encode a knowledge base of documents and a user query into embeddings, rank documents by cosine similarity, and return the top 3 most relevant results.
Why: Semantic search is the single most important application of embeddings. Unlike keyword search (which requires exact word matches), semantic search finds documents by meaning. A query like βHow do computers learn from data?β will match a document about βMachine learning is a subset of AIβ even though they share almost no words. This is the core mechanism behind RAG systems, where an LLM retrieves relevant context from a knowledge base before generating an answer.
Connection: Every modern AI-powered search system β from ChatGPTβs retrieval plugin to enterprise knowledge bases β uses this exact pattern: encode, compare, rank. The only difference at scale is using a vector database (FAISS, Pinecone, ChromaDB) instead of brute-force comparison.
def practical_example():
"""A practical use case: finding similar sentences."""
print("\n" + "=" * 70)
print("PART 5: PRACTICAL EXAMPLE - FIND SIMILAR CONTENT")
print("=" * 70)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Your knowledge base
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Python is the most popular programming language for ML",
"Natural language processing helps computers understand text",
"Computer vision enables machines to interpret images",
"Data preprocessing is crucial for model accuracy",
"Gradient descent optimizes neural network weights",
]
# User query
query = "How do computers learn from data?"
print(f"\nQuery: '{query}'")
print("\nFinding most relevant documents...\n")
# Encode everything
doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])[0]
# Calculate similarities
similarities = []
for i, doc_emb in enumerate(doc_embeddings):
similarity = np.dot(query_embedding, doc_emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
)
similarities.append((similarity, i, documents[i]))
# Sort by similarity (highest first)
similarities.sort(reverse=True)
# Show top 3 results
print("Top 3 Most Relevant Documents:")
print("-" * 70)
for rank, (score, idx, doc) in enumerate(similarities[:3], 1):
print(f"\n{rank}. Score: {score:.4f}")
print(f" {doc}")
print("\n\nπ‘ INSIGHT:")
print("Notice how results are ranked by MEANING, not keyword matching!")
print("The query doesn't contain 'machine learning' but ML doc ranks high.")
Main β Run All DemonstrationsΒΆ
Execute all five parts in sequence: generate embeddings, measure similarity, perform vector arithmetic, review the math, and build a practical semantic search example. Each part builds on the concepts from the previous one.
def main():
"""Run all demonstrations."""
print("\n")
print("π " + "=" * 66 + " π")
print(" EMBEDDINGS: TURNING TEXT INTO MEANINGFUL NUMBERS")
print("π " + "=" * 66 + " π")
# Part 1: Generate embeddings
model, texts, embeddings = introduction_to_embeddings()
# Part 2: Explore similarity
explore_similarity(texts, embeddings)
# Part 3: Vector operations
vector_operations(model)
# Part 4: Explain the math
explain_the_math()
# Part 5: Practical example
practical_example()
print("\n" + "=" * 70)
print("π NEXT STEPS:")
print("=" * 70)
print("""
1. Run: semantic_similarity.py - Compare different texts
2. Run: vector_database_demo.py - Store & search embeddings at scale
3. Experiment: Try different models from sentence-transformers
4. Build: Create semantic search for YOUR documents
Key Takeaway:
Embeddings are the bridge between human language and machine understanding.
Master this, and you'll understand how modern AI really works!
""")
if __name__ == "__main__":
main()