Openai EmbeddingsΒΆ

OpenAI Embeddings Integration Guide

Learn how to use OpenAI’s embedding models (text-embedding-3-small, text-embedding-3-large). Best for production applications requiring highest quality embeddings.

Install: pip install openai numpy

SetupΒΆ

Import the OpenAI client, NumPy for vector operations, and SciPy’s cosine distance function. The OpenAI API provides cloud-hosted embedding models that require no local GPU or model downloads – you send text over HTTPS and receive embedding vectors back. An OPENAI_API_KEY environment variable is required for authentication.

import os
from openai import OpenAI
import numpy as np
from scipy.spatial.distance import cosine

Example 1: Basic OpenAI EmbeddingsΒΆ

What: Generate an embedding for a single text string using text-embedding-3-small and inspect its properties.

Why: OpenAI’s embedding models are among the highest quality available, consistently ranking near the top of the MTEB (Massive Text Embedding Benchmark). The text-embedding-3-small model produces 1,536-dimensional vectors at a cost of \(0.02 per million tokens -- an excellent quality-to-cost ratio. Unlike local models, OpenAI embeddings are **pre-normalized** to unit length (\)|\mathbf{v}| = 1$), which means cosine similarity reduces to a simple dot product, speeding up similarity calculations.

Connection: If you are building on the OpenAI ecosystem (GPT-4 for generation, embeddings for retrieval), using the same provider for both simplifies your infrastructure and ensures compatible tokenization.

def example_1_basic_openai_embeddings():
    """Generate embeddings using OpenAI API"""
    print("\n" + "="*70)
    print("Example 1: Basic OpenAI Embeddings")
    print("="*70)
    
    # Initialize client
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # Single text embedding
    text = "Machine learning is transforming the world"
    
    print(f"\nText: '{text}'")
    print("Generating embedding with OpenAI...")
    
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    
    embedding = response.data[0].embedding
    
    print(f"\nModel: text-embedding-3-small")
    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")
    print(f"Total tokens used: {response.usage.total_tokens}")
    
    print("\nπŸ’‘ OpenAI embeddings are normalized (unit vectors)")
    print(f"Embedding norm: {np.linalg.norm(embedding):.6f}")

Example 2: Batch EmbeddingsΒΆ

What: Embed multiple texts in a single API call by passing a list to the input parameter.

Why: Batch processing is essential for cost and latency optimization. Each API call has network round-trip overhead (typically 100-300ms), so embedding 5 texts in one call is much faster than making 5 separate calls. The OpenAI API supports up to 2,048 texts per batch request, and the total input cannot exceed the model’s token limit. Batching also simplifies rate-limit management since you make fewer requests per second.

Connection: When indexing a large document collection for RAG, you would chunk all documents, batch them into groups of 100-2,000, and embed each batch in a single API call. This can reduce indexing time from hours to minutes.

def example_2_batch_embeddings():
    """Process multiple texts in one API call"""
    print("\n" + "="*70)
    print("Example 2: Batch Processing")
    print("="*70)
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    texts = [
        "Natural language processing",
        "Computer vision and image recognition",
        "Speech recognition systems",
        "Autonomous vehicles",
        "Recommendation algorithms"
    ]
    
    print(f"\nProcessing {len(texts)} texts in one API call...")
    
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    
    embeddings = [item.embedding for item in response.data]
    
    print(f"\nSuccess!")
    print(f"Number of embeddings: {len(embeddings)}")
    print(f"Each embedding dimension: {len(embeddings[0])}")
    print(f"Total tokens used: {response.usage.total_tokens}")
    
    # Calculate all pairwise similarities
    print("\n" + "-"*70)
    print("Similarity Matrix:")
    print("-"*70)
    
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            similarity = 1 - cosine(embeddings[i], embeddings[j])
            print(f"\n{i+1}. '{texts[i]}'")
            print(f"   vs")
            print(f"{j+1}. '{texts[j]}'")
            print(f"   Similarity: {similarity:.4f}")

Example 3: Comparing OpenAI Embedding ModelsΒΆ

What: Generate embeddings for the same text using text-embedding-3-small (1,536d), text-embedding-3-large (3,072d), and the legacy text-embedding-ada-002 (1,536d), then compare their dimensions, norms, and cost profiles.

Why: The three models occupy different points on the quality/cost spectrum. text-embedding-3-small offers the best value for most applications. text-embedding-3-large provides the highest quality at 6.5x the cost – worth it for applications where retrieval accuracy directly impacts revenue (e.g., legal search, medical literature). The legacy ada-002 is 5x more expensive than 3-small with lower quality and should be migrated away from.

Connection: The newer 3-small and 3-large models also support the dimensions parameter for Matryoshka Representation Learning, which lets you reduce embedding size while preserving most of the quality – a feature explored in Example 5.

def example_3_comparing_models():
    """Compare different OpenAI embedding models"""
    print("\n" + "="*70)
    print("Example 3: Comparing OpenAI Models")
    print("="*70)
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    text = "Artificial intelligence and machine learning"
    
    models = [
        "text-embedding-3-small",  # 1536 dimensions, $0.02/1M tokens
        "text-embedding-3-large",  # 3072 dimensions, $0.13/1M tokens
        "text-embedding-ada-002",  # 1536 dimensions, $0.10/1M tokens (legacy)
    ]
    
    print(f"\nText: '{text}'")
    print("\nComparing models:")
    print("-"*70)
    
    results = {}
    
    for model in models:
        response = client.embeddings.create(
            model=model,
            input=text
        )
        
        embedding = response.data[0].embedding
        results[model] = embedding
        
        print(f"\n{model}:")
        print(f"  Dimension: {len(embedding)}")
        print(f"  Tokens used: {response.usage.total_tokens}")
        print(f"  Norm: {np.linalg.norm(embedding):.6f}")
        print(f"  Mean: {np.mean(embedding):.6f}")
        print(f"  Std: {np.std(embedding):.6f}")
    
    print("\n" + "-"*70)
    print("Model Recommendations:")
    print("-"*70)
    print("""
- text-embedding-3-small: Best for cost-effective production
- text-embedding-3-large: Best for highest quality
- text-embedding-ada-002: Legacy model, use 3-small instead
    """)

Example 4: Semantic Search With OpenAI EmbeddingsΒΆ

What: Build a semantic search engine by embedding a knowledge base and user queries, then ranking documents by cosine similarity.

Why: Semantic search with OpenAI embeddings follows the same encode-compare-rank pattern as local models, but with higher baseline quality. The workflow is: (1) embed all documents once and store the vectors, (2) embed the user query at runtime, (3) compute similarity between the query vector and all document vectors, (4) return the top-\(k\) results. Because OpenAI embeddings are unit-normalized, cosine similarity simplifies to a dot product: \(\text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q} \cdot \mathbf{d}\).

Connection: This is the retrieval step in RAG (Retrieval Augmented Generation). In a production RAG system, the document embeddings would be stored in a vector database (Pinecone, Weaviate, Qdrant), and the top-\(k\) retrieved documents would be injected into the LLM prompt as context.

def example_4_semantic_search():
    """Build a semantic search engine with OpenAI embeddings"""
    print("\n" + "="*70)
    print("Example 4: Semantic Search")
    print("="*70)
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # Knowledge base
    documents = [
        "Python is a high-level programming language",
        "Machine learning models learn from data",
        "Neural networks are inspired by the human brain",
        "JavaScript is used for web development",
        "Deep learning is a subset of machine learning",
        "React is a JavaScript library for building UIs",
        "Transformers revolutionized natural language processing",
        "CSS is used for styling web pages"
    ]
    
    print(f"\nKnowledge base: {len(documents)} documents")
    print("Generating embeddings for all documents...")
    
    # Embed all documents
    doc_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=documents
    )
    
    doc_embeddings = [item.embedding for item in doc_response.data]
    
    # Search queries
    queries = [
        "What is machine learning?",
        "Tell me about web development",
        "Explain neural networks"
    ]
    
    # Embed queries
    query_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=queries
    )
    
    query_embeddings = [item.embedding for item in query_response.data]
    
    # Search for each query
    print("\n" + "-"*70)
    print("Search Results:")
    print("-"*70)
    
    for query, query_embedding in zip(queries, query_embeddings):
        print(f"\nQuery: '{query}'")
        print("\nTop 3 Results:")
        
        # Calculate similarities
        similarities = []
        for doc, doc_embedding in zip(documents, doc_embeddings):
            similarity = 1 - cosine(query_embedding, doc_embedding)
            similarities.append((doc, similarity))
        
        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Show top 3
        for i, (doc, sim) in enumerate(similarities[:3], 1):
            print(f"  {i}. [{sim:.4f}] {doc}")

Example 5: Reduced Dimensions (Matryoshka Embeddings)ΒΆ

What: Generate embeddings with reduced dimensionality (512 or 256 instead of 1,536) using the dimensions parameter, and compare the storage savings.

Why: The text-embedding-3 models support Matryoshka Representation Learning, where the first \(d\) dimensions of the full embedding contain a valid lower-dimensional representation. This means you can truncate 1,536-d embeddings to 256-d with only a small quality loss, reducing storage by 6x and speeding up similarity search proportionally. For a knowledge base of 1 million documents, this drops storage from ~5.9 GB to ~0.98 GB (at float32).

How: Storage per embedding = \(d \times 4\) bytes (float32). For 1M embeddings: 1536d = 5.9 GB, 512d = 2.0 GB, 256d = 0.98 GB.

Connection: Matryoshka embeddings are especially valuable in cost-sensitive production systems. You can use full-dimension embeddings for high-accuracy reranking and reduced dimensions for fast first-pass retrieval – a two-stage approach common in large-scale search.

def example_5_reduced_dimensions():
    """Use reduced dimensions to save storage"""
    print("\n" + "="*70)
    print("Example 5: Reduced Dimensions")
    print("="*70)
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    text = "Reducing embedding dimensions saves storage"
    
    # Standard dimensions
    print("\nGenerating embeddings with different dimensions:")
    
    dimensions_to_test = [1536, 512, 256]
    
    for dim in dimensions_to_test:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
            dimensions=dim  # Reduce dimensions
        )
        
        embedding = response.data[0].embedding
        
        # Storage calculation
        storage_bytes = len(embedding) * 4  # 4 bytes per float32
        storage_mb_per_million = (storage_bytes * 1_000_000) / (1024 * 1024)
        
        print(f"\nDimensions: {dim}")
        print(f"  Actual length: {len(embedding)}")
        print(f"  Storage per embedding: {storage_bytes} bytes")
        print(f"  Storage per 1M embeddings: {storage_mb_per_million:.1f} MB")
    
    print("\n" + "-"*70)
    print("Trade-off:")
    print("-"*70)
    print("""
- Higher dimensions = Better quality, more storage
- Lower dimensions = Faster search, less storage
- 256-512 dimensions often sufficient for most tasks
- Test with your data to find optimal dimension
    """)

Example 6: Caching StrategyΒΆ

What: Implement a hash-based in-memory cache that avoids redundant API calls for previously embedded texts.

Why: Embedding APIs charge per token, and many workloads contain duplicate or repeated texts (e.g., the same FAQ question asked by different users, or re-indexing unchanged documents). A simple MD5-hash cache can eliminate 30-60% of API calls in typical workloads. The cache key combines the model name and input text, so switching models does not cause incorrect cache hits. In production, you would persist this cache to Redis or a database.

Connection: Caching is a fundamental cost-optimization technique for any API-dependent system. Combined with batching (Example 2) and reduced dimensions (Example 5), caching can reduce embedding costs by 80%+ compared to naive implementations.

def example_6_caching_strategy():
    """Implement caching to reduce API costs"""
    print("\n" + "="*70)
    print("Example 6: Caching Strategy")
    print("="*70)
    
    import hashlib
    import json
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # Simple file-based cache
    cache = {}
    
    def get_cached_embedding(text, model="text-embedding-3-small"):
        """Get embedding with caching"""
        # Create cache key
        cache_key = hashlib.md5(f"{model}:{text}".encode()).hexdigest()
        
        # Check cache
        if cache_key in cache:
            print(f"  βœ… Cache hit for: '{text[:50]}...'")
            return cache[cache_key]
        
        # Generate embedding
        print(f"  πŸ”„ API call for: '{text[:50]}...'")
        response = client.embeddings.create(
            model=model,
            input=text
        )
        
        embedding = response.data[0].embedding
        cache[cache_key] = embedding
        
        return embedding
    
    # Test caching
    texts = [
        "Machine learning is amazing",
        "Deep learning uses neural networks",
        "Machine learning is amazing",  # Duplicate!
        "Python is great for ML",
        "Machine learning is amazing",  # Duplicate again!
    ]
    
    print("\nProcessing texts with caching:")
    print("-"*70)
    
    for text in texts:
        embedding = get_cached_embedding(text)
    
    print(f"\n" + "-"*70)
    print("Cache Statistics:")
    print("-"*70)
    print(f"Total texts processed: {len(texts)}")
    print(f"Unique texts in cache: {len(cache)}")
    print(f"API calls saved: {len(texts) - len(cache)}")
    print(f"Cost reduction: {((len(texts) - len(cache)) / len(texts) * 100):.1f}%")

Example 7: Cost EstimationΒΆ

What: Calculate estimated yearly costs for three workload scenarios (small, medium, large) across all three OpenAI embedding models.

Why: Embedding costs can grow quickly at scale. A β€œsmall app” with 10K documents and 100K monthly queries costs pennies, but a β€œlarge app” with 1M documents and 10M monthly queries can cost hundreds of dollars per year. Understanding these numbers upfront lets you make informed decisions about model selection, caching strategy, and whether to use a local model instead. The key variables are: number of documents, average tokens per document, monthly query volume, and price per token.

Connection: For high-volume applications (>10M queries/month), local models like all-MiniLM-L6-v2 or BGE-base often provide better economics despite lower quality, because inference cost is fixed (hardware) rather than per-query (API).

def example_7_cost_estimation():
    """Estimate costs for different workloads"""
    print("\n" + "="*70)
    print("Example 7: Cost Estimation")
    print("="*70)
    
    # Pricing (as of 2024)
    pricing = {
        "text-embedding-3-small": 0.02 / 1_000_000,  # per token
        "text-embedding-3-large": 0.13 / 1_000_000,  # per token
        "text-embedding-ada-002": 0.10 / 1_000_000,  # per token
    }
    
    # Workload scenarios
    scenarios = [
        {
            "name": "Small App",
            "documents": 10_000,
            "avg_tokens": 100,
            "monthly_queries": 100_000,
        },
        {
            "name": "Medium App",
            "documents": 100_000,
            "avg_tokens": 200,
            "monthly_queries": 1_000_000,
        },
        {
            "name": "Large App",
            "documents": 1_000_000,
            "avg_tokens": 300,
            "monthly_queries": 10_000_000,
        }
    ]
    
    print("\nCost Estimates:")
    print("="*70)
    
    for scenario in scenarios:
        print(f"\n{scenario['name']}:")
        print(f"  Documents: {scenario['documents']:,}")
        print(f"  Avg tokens per doc: {scenario['avg_tokens']}")
        print(f"  Monthly queries: {scenario['monthly_queries']:,}")
        
        # One-time document embedding cost
        doc_tokens = scenario['documents'] * scenario['avg_tokens']
        
        # Monthly query cost
        query_tokens = scenario['monthly_queries'] * scenario['avg_tokens']
        
        print("\n  Monthly Costs:")
        for model, price_per_token in pricing.items():
            doc_cost = doc_tokens * price_per_token
            query_cost = query_tokens * price_per_token
            total_monthly = query_cost  # Assuming docs embedded once
            yearly_cost = doc_cost + (query_cost * 12)
            
            print(f"\n    {model}:")
            print(f"      One-time docs: ${doc_cost:.2f}")
            print(f"      Monthly queries: ${query_cost:.2f}")
            print(f"      Year 1 total: ${yearly_cost:.2f}")

Example 8: Error HandlingΒΆ

What: Implement a robust get_embedding_with_retry() function that handles rate limits with exponential backoff and gracefully catches API errors.

Why: Production systems must handle API failures gracefully. The most common errors are: RateLimitError (you are sending too many requests per minute – back off and retry), APIError (transient server-side issue – retry after a pause), and AuthenticationError (invalid API key – do not retry). Exponential backoff (\(2^{\text{attempt}}\) seconds) prevents thundering-herd problems where all retries fire simultaneously.

Connection: The retry-with-backoff pattern is universal across cloud APIs (AWS, GCP, OpenAI, etc.). Libraries like tenacity or backoff provide decorators that implement this pattern declaratively, reducing boilerplate in production code.

def example_8_error_handling():
    """Handle API errors gracefully"""
    print("\n" + "="*70)
    print("Example 8: Error Handling")
    print("="*70)
    
    from openai import OpenAIError, RateLimitError, APIError
    import time
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    def get_embedding_with_retry(text, max_retries=3):
        """Get embedding with automatic retry on errors"""
        for attempt in range(max_retries):
            try:
                response = client.embeddings.create(
                    model="text-embedding-3-small",
                    input=text
                )
                return response.data[0].embedding
            
            except RateLimitError:
                if attempt < max_retries - 1:
                    wait_time = (2 ** attempt)  # Exponential backoff
                    print(f"  ⚠️  Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    print("  ❌ Rate limit exceeded, max retries reached")
                    raise
            
            except APIError as e:
                print(f"  ❌ API error: {e}")
                if attempt < max_retries - 1:
                    time.sleep(1)
                else:
                    raise
            
            except OpenAIError as e:
                print(f"  ❌ OpenAI error: {e}")
                raise
    
    print("\nTesting error handling:")
    
    # Test with valid text
    text = "This should work fine"
    print(f"\nText: '{text}'")
    embedding = get_embedding_with_retry(text)
    print(f"βœ… Success! Embedding dimension: {len(embedding)}")

Main – Run All ExamplesΒΆ

Execute all eight examples in sequence. Requires a valid OPENAI_API_KEY environment variable. The examples progress from basic API usage through production best practices (batching, caching, cost estimation, error handling).

def main():
    """Run all examples"""
    print("\n" + "="*70)
    print("OPENAI EMBEDDINGS GUIDE")
    print("="*70)
    
    # Check for API key
    if not os.getenv("OPENAI_API_KEY"):
        print("\n❌ ERROR: OPENAI_API_KEY environment variable not set")
        print("\nTo use this guide:")
        print("  1. Get an API key from https://platform.openai.com/api-keys")
        print("  2. Set environment variable:")
        print("     export OPENAI_API_KEY='your-key-here'")
        print("  3. Run this script again")
        return
    
    print("\nThis guide covers:")
    print("  1. Basic OpenAI embeddings")
    print("  2. Batch processing")
    print("  3. Comparing models")
    print("  4. Semantic search")
    print("  5. Reduced dimensions")
    print("  6. Caching strategy")
    print("  7. Cost estimation")
    print("  8. Error handling")
    
    try:
        example_1_basic_openai_embeddings()
        example_2_batch_embeddings()
        example_3_comparing_models()
        example_4_semantic_search()
        example_5_reduced_dimensions()
        example_6_caching_strategy()
        example_7_cost_estimation()
        example_8_error_handling()
        
        print("\n" + "="*70)
        print("KEY TAKEAWAYS")
        print("="*70)
        print("""
1. **OpenAI Advantages**:
   - Highest quality embeddings
   - No infrastructure needed
   - Always up-to-date models
   - Simple API

2. **Best Practices**:
   - Use batch processing (up to 2048 texts per call)
   - Implement caching for repeated texts
   - Handle rate limits with exponential backoff
   - Monitor costs with usage tracking

3. **Model Selection**:
   - text-embedding-3-small: Best cost/quality balance
   - text-embedding-3-large: When quality is critical
   - Consider reduced dimensions to save storage

4. **Cost Optimization**:
   - Cache embeddings for repeated texts
   - Use smaller dimensions when possible
   - Batch API calls efficiently
   - Monitor usage regularly

5. **When to Use OpenAI**:
   βœ… Need highest quality
   βœ… No ML infrastructure
   βœ… Rapid prototyping
   βœ… Moderate volume (<10M/month)
   
   ❌ High volume (expensive)
   ❌ Sensitive data (privacy concerns)
   ❌ Need offline processing
   ❌ Tight budget

6. **Production Tips**:
   - Implement robust error handling
   - Set up monitoring and alerting
   - Have fallback to local models
   - Track costs per feature
        """)
        
    except Exception as e:
        print(f"\n❌ Error: {e}")
        print("\nMake sure you have:")
        print("  1. Valid OPENAI_API_KEY set")
        print("  2. Installed: pip install openai numpy scipy")


if __name__ == "__main__":
    main()