Run this notebook: Open in Colab Open in Kaggle

1. Understanding Vectors and Embeddings¶

Vectors are numerical representations of data. In AI/ML, we convert text, images, and other data into dense vectors (embeddings).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# Example: Simple word embeddings (in practice, these come from models)
embeddings = {
    "king": np.array([0.9, 0.8, 0.1]),
    "queen": np.array([0.85, 0.82, 0.15]),
    "man": np.array([0.7, 0.3, 0.2]),
    "woman": np.array([0.65, 0.35, 0.25]),
    "cat": np.array([0.1, 0.2, 0.9]),
    "dog": np.array([0.15, 0.25, 0.85])
}

print("Word Embeddings (3D vectors):")
for word, vec in embeddings.items():
    print(f"{word:8} -> {vec}")

2. Similarity Metrics¶

Measuring Distance in Embedding Space¶

Storing vectors is only useful if we can measure how close they are. Similarity metrics are the mathematical functions that quantify the “closeness” of two vectors, and choosing the right metric has a direct impact on search quality. The three most common metrics in vector databases are cosine similarity, Euclidean (L2) distance, and dot product. Each captures a different notion of similarity: cosine focuses purely on direction, Euclidean considers both direction and magnitude, and dot product blends alignment with magnitude. Understanding these trade-offs is essential because most vector database APIs require you to choose a metric when creating a collection.

2.1 Cosine Similarity¶

Cosine similarity measures the cosine of the angle between two vectors, ignoring their magnitudes entirely. It is defined as:

\[\text{cosine\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}\]

The result ranges from \(-1\) (opposite directions) through \(0\) (orthogonal) to \(1\) (identical direction). Because it normalizes by magnitude, cosine similarity is the default choice when embeddings come from models like Sentence Transformers, which may produce vectors of varying length. In semantic search, a cosine score near 1 means two texts share strong topical or contextual overlap.

def cosine_sim(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compare similar words
print("Cosine Similarity Examples:")
print(f"king vs queen:  {cosine_sim(embeddings['king'], embeddings['queen']):.4f}")
print(f"man vs woman:   {cosine_sim(embeddings['man'], embeddings['woman']):.4f}")
print(f"cat vs dog:     {cosine_sim(embeddings['cat'], embeddings['dog']):.4f}")
print(f"king vs cat:    {cosine_sim(embeddings['king'], embeddings['cat']):.4f}")
print(f"man vs dog:     {cosine_sim(embeddings['man'], embeddings['dog']):.4f}")

2.2 Euclidean Distance (L2)¶

Euclidean distance measures the straight-line (L2) distance between two points in the embedding space:

\[d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}\]

Lower values mean higher similarity. Unlike cosine, Euclidean distance is sensitive to the magnitude of the vectors, so two vectors pointing in the same direction but with different lengths will register a non-zero distance. It is the natural choice when embeddings are already normalized (unit length), and it is the metric used internally by many approximate-nearest-neighbor indexes like HNSW and IVF-Flat.

def euclidean_dist(vec1, vec2):
    """Calculate Euclidean (L2) distance between two vectors"""
    return np.linalg.norm(vec1 - vec2)

print("\nEuclidean Distance Examples:")
print(f"king vs queen:  {euclidean_dist(embeddings['king'], embeddings['queen']):.4f}")
print(f"man vs woman:   {euclidean_dist(embeddings['man'], embeddings['woman']):.4f}")
print(f"cat vs dog:     {euclidean_dist(embeddings['cat'], embeddings['dog']):.4f}")
print(f"king vs cat:    {euclidean_dist(embeddings['king'], embeddings['cat']):.4f}")
print(f"man vs dog:     {euclidean_dist(embeddings['man'], embeddings['dog']):.4f}")

2.3 Dot Product (Inner Product)¶

The dot product (inner product) combines both directional alignment and vector magnitude into a single score:

\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i \, b_i = \|\mathbf{a}\| \, \|\mathbf{b}\| \cos\theta\]

Higher values indicate greater similarity. When vectors are unit-normalized, the dot product is numerically identical to cosine similarity. The dot product is favored in retrieval systems where the embedding model intentionally encodes “importance” or “relevance strength” in the vector magnitude, such as some dense passage retrieval models. It is also the fastest metric to compute because it requires no square-root or division operations.

print("\nDot Product Examples:")
print(f"king vs queen:  {np.dot(embeddings['king'], embeddings['queen']):.4f}")
print(f"man vs woman:   {np.dot(embeddings['man'], embeddings['woman']):.4f}")
print(f"cat vs dog:     {np.dot(embeddings['cat'], embeddings['dog']):.4f}")
print(f"king vs cat:    {np.dot(embeddings['king'], embeddings['cat']):.4f}")
print(f"man vs dog:     {np.dot(embeddings['man'], embeddings['dog']):.4f}")

3. Why Vector Databases?¶

Traditional databases are optimized for exact matches. Vector databases are optimized for similarity search.

Use Cases:

Semantic search (“find similar documents”)
Recommendation systems
RAG (Retrieval-Augmented Generation)
Image/video search
Anomaly detection

4. Simple Vector Database Implementation¶

Building a Naive Vector Store from Scratch¶

Before reaching for production tools, it is instructive to build a minimal vector database in pure Python. The implementation below stores vectors in a dictionary, supports CRUD operations (add, search, update, delete), and performs brute-force search by computing the similarity between the query and every stored vector. Brute-force search has \(O(n)\) complexity per query, which is fine for small collections but becomes prohibitive at scale – this is exactly the problem that production vector databases solve with approximate nearest-neighbor (ANN) indexing structures like HNSW and IVF.

class SimpleVectorDB:
    def __init__(self, metric='cosine'):
        self.vectors = {}
        self.metadata = {}
        self.metric = metric
    
    def add(self, id, vector, metadata=None):
        """Add a vector to the database"""
        self.vectors[id] = np.array(vector)
        if metadata:
            self.metadata[id] = metadata
    
    def search(self, query_vector, top_k=5):
        """Find the most similar vectors"""
        query_vector = np.array(query_vector)
        similarities = []
        
        for id, vector in self.vectors.items():
            if self.metric == 'cosine':
                score = cosine_sim(query_vector, vector)
            elif self.metric == 'euclidean':
                score = -euclidean_dist(query_vector, vector)  # Negative for sorting
            else:  # dot product
                score = np.dot(query_vector, vector)
            
            similarities.append((id, score, self.metadata.get(id)))
        
        # Sort by score (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]
    
    def delete(self, id):
        """Delete a vector from the database"""
        if id in self.vectors:
            del self.vectors[id]
            if id in self.metadata:
                del self.metadata[id]
    
    def update(self, id, vector=None, metadata=None):
        """Update a vector or its metadata"""
        if id in self.vectors:
            if vector is not None:
                self.vectors[id] = np.array(vector)
            if metadata is not None:
                self.metadata[id] = metadata

print("✅ SimpleVectorDB class created")

5. Using Our Simple Vector Database¶

Adding Documents and Running Queries¶

With the SimpleVectorDB class defined, we can now populate it with documents and test similarity search. Each document is represented by its embedding vector plus optional metadata (the original text, a category label, etc.). When we search, the database computes the chosen similarity metric between the query vector and every stored vector, sorts by score, and returns the top-\(k\) results. Metadata is returned alongside scores so the caller can display human-readable results. In a production system, metadata filtering (e.g., “only documents in category ML”) would be applied before or after the vector similarity ranking.

# Create database
db = SimpleVectorDB(metric='cosine')

# Add vectors with metadata
documents = [
    {"id": "doc1", "text": "Machine learning is a subset of AI", "category": "ML"},
    {"id": "doc2", "text": "Deep learning uses neural networks", "category": "DL"},
    {"id": "doc3", "text": "Natural language processing handles text", "category": "NLP"},
    {"id": "doc4", "text": "Computer vision analyzes images", "category": "CV"},
    {"id": "doc5", "text": "Reinforcement learning learns from rewards", "category": "RL"},
]

# Simulate embeddings (in practice, use a real embedding model)
np.random.seed(42)
for doc in documents:
    # Fake embeddings for demo
    vector = np.random.random(128)
    db.add(doc["id"], vector, {"text": doc["text"], "category": doc["category"]})

print(f"✅ Added {len(documents)} documents to database")

# Search for similar documents
query_vector = np.random.random(128)  # In practice, embed your query

results = db.search(query_vector, top_k=3)

print("Search Results:")
print("=" * 70)
for i, (doc_id, score, metadata) in enumerate(results, 1):
    print(f"{i}. Score: {score:.4f}")
    print(f"   ID: {doc_id}")
    print(f"   Text: {metadata['text']}")
    print(f"   Category: {metadata['category']}")
    print()

6. Update and Delete Operations¶

Maintaining a Living Database¶

Real-world vector databases are not write-once stores. Documents get corrected, enriched, or removed over time, so the database must support update (replace a vector or its metadata in place) and delete (remove a vector entirely). Our simple implementation mutates the in-memory dictionaries directly. Production databases like Qdrant and Milvus handle these operations while also updating their ANN index structures, which adds complexity but keeps search performance stable.

# Update metadata
db.update("doc1", metadata={"text": "Machine learning is AI", "category": "ML", "updated": True})
print("✅ Updated doc1")

# Delete a document
db.delete("doc5")
print("✅ Deleted doc5")

print(f"\nRemaining documents: {len(db.vectors)}")

7. Real-World Example with Sentence Embeddings¶

From Toy Vectors to Semantic Understanding¶

The random vectors used earlier do not carry any meaning. To perform genuine semantic search – where “artificial intelligence” matches “machine learning” – we need embeddings generated by a pre-trained language model. The SentenceTransformer model (all-MiniLM-L6-v2) maps each sentence to a 384-dimensional vector that places semantically similar texts close together in the embedding space. By storing these model-generated embeddings in our vector database, the same brute-force search now retrieves documents by meaning rather than by keyword overlap.

# Install sentence-transformers if needed
# !pip install sentence-transformers

from sentence_transformers import SentenceTransformer

# Load a real embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("✅ Model loaded")

# Create database with real embeddings
db_real = SimpleVectorDB(metric='cosine')

documents = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning trains models on data",
    "Deep learning uses neural networks",
    "Dogs are loyal pets",
    "Cats are independent animals",
    "The sun rises in the east",
    "Water boils at 100 degrees Celsius"
]

# Generate embeddings and add to database
for i, doc in enumerate(documents):
    embedding = model.encode(doc)
    db_real.add(f"doc{i}", embedding, {"text": doc})

print(f"✅ Added {len(documents)} documents with real embeddings")

# Semantic search with real query
queries = [
    "What is artificial intelligence?",
    "Tell me about programming languages",
    "What are some common pets?"
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("=" * 70)
    
    query_embedding = model.encode(query)
    results = db_real.search(query_embedding, top_k=3)
    
    for i, (doc_id, score, metadata) in enumerate(results, 1):
        print(f"{i}. Score: {score:.4f} - {metadata['text']}")

8. Performance Comparison: Different Metrics¶

Choosing the Right Metric for Your Data¶

Different similarity metrics can produce different rankings for the same query, especially when vectors vary in magnitude. The comparison below runs the same semantic search query using cosine similarity, Euclidean distance, and dot product, letting you see how the top results and their scores change. For normalized embeddings (which all-MiniLM-L6-v2 produces), cosine and dot product will agree closely. In practice, always use the metric recommended by your embedding model’s documentation to get the best retrieval quality.

import time

# Test different metrics
query = "neural networks for deep learning"
query_embedding = model.encode(query)

metrics = ['cosine', 'euclidean', 'dot']

print(f"Query: '{query}'")
print("\nResults by metric:")
print("=" * 70)

for metric in metrics:
    db_test = SimpleVectorDB(metric=metric)
    
    # Add documents
    for i, doc in enumerate(documents):
        embedding = model.encode(doc)
        db_test.add(f"doc{i}", embedding, {"text": doc})
    
    # Search
    start = time.time()
    results = db_test.search(query_embedding, top_k=3)
    elapsed = time.time() - start
    
    print(f"\n{metric.upper()} (took {elapsed*1000:.2f}ms):")
    for i, (doc_id, score, metadata) in enumerate(results, 1):
        print(f"  {i}. Score: {score:.4f} - {metadata['text']}")

Key Takeaways¶

Vectors represent data as numerical arrays
Similarity metrics measure how close vectors are:
- Cosine: angle between vectors (direction)
- Euclidean: straight-line distance (magnitude)
- Dot product: alignment (both direction and magnitude)
Vector databases optimize similarity search at scale
Real embeddings from models like Sentence Transformers capture semantic meaning
CRUD operations: Create (add), Read (search), Update, Delete

Next Steps¶

Try production vector databases (Pinecone, Chroma, Qdrant)
Learn about indexing (HNSW, IVF)
Build RAG applications
Explore hybrid search (vector + keyword)