Run this notebook: Open in Colab Open in Kaggle

LLM Production Optimization¶

Overview¶

Master the art of deploying LLMs cost-effectively and efficiently in production. Learn caching, batching, monitoring, and optimization strategies.

Part of: Phase 8 - MLOps

Prerequisites:

Prompt Engineering (Phase 10)
Local LLMs (Phase 13)
Basic MLOps

Outcome: Deploy optimized, cost-effective LLM applications

What You’ll Learn¶

Cost Optimization¶

Token usage tracking and reduction
Semantic caching strategies
Batch processing for efficiency
Model selection (cost vs. capability)
Prompt compression techniques
Fallback strategies (cheap → expensive)

Performance Optimization¶

Response time optimization
Streaming responses
Parallel processing
Prefetching and speculation
Edge deployment
CDN for static responses

Monitoring & Observability¶

LLM metrics (latency, tokens, cost)
Quality monitoring
Error tracking
User feedback loops
A/B testing
Cost alerts

Infrastructure¶

Load balancing
Auto-scaling
Rate limiting
Circuit breakers
Retry strategies
Fallback models

# Required packages (uncomment to install)
# !pip install openai redis tiktoken

1. Semantic Caching¶

Semantic caching uses embedding similarity to serve cached responses for queries that are semantically equivalent — even if the wording is different. This is distinct from provider-level prompt caching (covered in 09_llm_infrastructure.ipynb), which reuses KV-cache at the token level.

How it works: Hash the first few dimensions of the query embedding to create a cache key. Similar queries map to the same bucket and get a cache hit.

import hashlib
import os


def get_embedding_hash(text: str, dimensions: int = 10) -> str:
    """Create a semantic hash for similar queries.

    Uses the first *dimensions* components of the embedding vector.
    In production, use a vector DB similarity search instead of hashing.
    """
    # Placeholder: simulate an embedding with a deterministic hash.
    # In production, replace with:
    #   from openai import OpenAI
    #   client = OpenAI()
    #   response = client.embeddings.create(
    #       model="text-embedding-3-small", input=text
    #   )
    #   embedding = response.data[0].embedding
    return hashlib.sha256(text.lower().strip().encode()).hexdigest()[:16]


# --- In-memory cache for demonstration (swap for Redis in production) ---
_cache: dict[str, str] = {}


def cached_completion(prompt: str) -> tuple[str, int]:
    """Check cache before calling the LLM.

    Returns (response_text, token_count).  token_count == 0 on cache hit.
    """
    cache_key = get_embedding_hash(prompt)

    # Check cache
    cached = _cache.get(cache_key)
    if cached:
        return cached, 0  # $0 cost!

    # Simulate an LLM call (replace with real client.chat.completions.create)
    result = f"[LLM response for: {prompt[:60]}]"
    tokens = len(prompt.split()) + len(result.split())

    # Cache the result
    _cache[cache_key] = result

    return result, tokens


# --- Demo ---
result1, cost1 = cached_completion("What is Python?")
print(f"Query 1 — tokens used: {cost1}, response: {result1}")

result2, cost2 = cached_completion("What is Python?")  # exact match → cache hit
print(f"Query 2 — tokens used: {cost2}, response: {result2}")

print(f"\nCache entries: {len(_cache)}")

Production semantic caching with Redis¶

import redis
from openai import OpenAI

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

def cached_completion(prompt: str, ttl_seconds: int = 3600):
    cache_key = get_embedding_hash(prompt)
    cached = cache.get(cache_key)
    if cached:
        return cached.decode(), 0

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content
    tokens = response.usage.total_tokens
    cache.setex(cache_key, ttl_seconds, result)
    return result, tokens

2. Batch Processing¶

Process many prompts concurrently rather than one at a time. This can deliver 5-20x throughput improvement depending on API rate limits.

import asyncio
import time


async def _mock_llm_call(prompt: str) -> str:
    """Simulate an LLM API call with ~100ms latency."""
    await asyncio.sleep(0.1)
    return f"[Response for: {prompt[:40]}]"


async def process_batch(prompts: list[str], batch_size: int = 10) -> list[str]:
    """Process prompts in concurrent batches."""
    results: list[str] = []

    for i in range(0, len(prompts), batch_size):
        batch = prompts[i : i + batch_size]
        tasks = [_mock_llm_call(p) for p in batch]
        responses = await asyncio.gather(*tasks)
        results.extend(responses)

    return results


# --- Demo ---
prompts = [f"Explain concept #{i}" for i in range(20)]

# Sequential baseline
start = time.perf_counter()
sequential = []
for p in prompts:
    sequential.append(asyncio.get_event_loop().run_until_complete(_mock_llm_call(p)))
seq_time = time.perf_counter() - start

# Batched
start = time.perf_counter()
batched = asyncio.get_event_loop().run_until_complete(process_batch(prompts, batch_size=10))
batch_time = time.perf_counter() - start

print(f"Sequential: {seq_time:.2f}s  ({len(prompts)} prompts)")
print(f"Batched:    {batch_time:.2f}s  ({len(prompts)} prompts)")
print(f"Speedup:    {seq_time / batch_time:.1f}x")

3. Prompt Compression¶

Reducing token count directly reduces cost. Concise prompts also tend to reduce latency.

def estimate_tokens(text: str) -> int:
    """Rough token estimate (~4 chars per token for English)."""
    return max(1, len(text) // 4)


# --- Verbose vs concise ---
verbose_prompt = """
I would like you to please help me understand the following concept.
Could you please explain it in a way that is easy to understand?
I am particularly interested in learning about the main points.
Please make sure to cover all the important aspects.
"""

concise_prompt = "Explain [concept] clearly. Cover main points."

verbose_tokens = estimate_tokens(verbose_prompt)
concise_tokens = estimate_tokens(concise_prompt)
savings_pct = (1 - concise_tokens / verbose_tokens) * 100

print(f"Verbose: ~{verbose_tokens} tokens")
print(f"Concise: ~{concise_tokens} tokens")
print(f"Savings: ~{savings_pct:.0f}%")

print("\n--- Compression techniques ---")
techniques = [
    ("Remove filler words", '"Please kindly help me" → "Help me"'),
    ("Use abbreviations", '"Natural Language Processing" → "NLP"'),
    ("Remove redundancy", '"explain and describe" → "explain"'),
    ("Use structured format", '"In bullet points" (let format do the work)'),
    ("System prompt for style", "Set tone once, don't repeat per message"),
]
for name, example in techniques:
    print(f"  • {name}: {example}")

4. Model Fallback (Cheap → Expensive)¶

Route simple queries to cheap models and reserve expensive models for complex tasks.

See also: 09_llm_infrastructure.ipynb for production-grade fallback routing with LiteLLM.

# Pricing reference (per 1M input tokens, approximate, April 2026)
MODEL_TIERS = {
    "simple":  {"model": "gpt-4.1-mini",  "cost_per_1m": 0.40},
    "medium":  {"model": "gpt-4.1-nano",   "cost_per_1m": 0.10},
    "complex": {"model": "gpt-4.1",        "cost_per_1m": 2.00},
}

COMPLEX_KEYWORDS = {"analyze", "compare", "design", "code", "debug", "architect"}


def estimate_complexity(prompt: str) -> str:
    """Heuristic complexity estimation."""
    prompt_lower = prompt.lower()
    if any(kw in prompt_lower for kw in COMPLEX_KEYWORDS):
        return "complex"
    if len(prompt.split()) > 100:
        return "medium"
    return "simple"


def smart_completion(prompt: str, complexity: str = "auto") -> dict:
    """Use the cheapest model that can handle the task."""
    if complexity == "auto":
        complexity = estimate_complexity(prompt)

    tier = MODEL_TIERS[complexity]
    # In production: client.chat.completions.create(model=tier["model"], ...)
    return {
        "model": tier["model"],
        "cost_per_1m": tier["cost_per_1m"],
        "complexity": complexity,
        "response": f"[{tier['model']} response]",
    }


# --- Demo ---
queries = [
    "What is Python?",
    "Summarize this 500-word article about climate change",
    "Design a microservice architecture for a real-time fraud detection system",
]

for q in queries:
    result = smart_completion(q)
    print(f"[{result['complexity']:>7}] {result['model']:<16} ${result['cost_per_1m']}/1M  ← {q[:50]}")

print("\n--- Potential savings ---")
print("If 80% of queries are simple and 20% complex:")
naive_cost = 1_000_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
smart_cost = (
    800_000 * MODEL_TIERS["simple"]["cost_per_1m"] / 1_000_000
    + 200_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
)
print(f"  Naive (all gpt-4.1):  ${naive_cost:.2f} per 1M tokens")
print(f"  Smart routing:        ${smart_cost:.2f} per 1M tokens")
print(f"  Savings:              {(1 - smart_cost / naive_cost) * 100:.0f}%")

5. Cost Optimization Reference¶

Token Reduction Techniques¶

Technique	Savings	Effort
Remove filler words	10-20%	Low
Use abbreviations	5-15%	Low
Compress examples	20-40%	Medium
Smaller model	50-95%	Medium
Semantic caching	60-90%	High
Fine-tuned model	50-80%	High

Model Selection Guide (2026)¶

Simple tasks (FAQ, classification):
└─ GPT-4.1-mini, Claude Haiku 4.5, or local Qwen 3 4B

Medium tasks (summarization, extraction):
└─ GPT-4.1-mini, Gemini Flash, or local Qwen 3 8B

Complex tasks (reasoning, analysis):
└─ Claude Sonnet 4.6, GPT-5.4, or Gemini 3.1 Pro

Very complex (code, math, research):
└─ Claude Opus 4.6, o3, or o4-mini

Caching Strategy Levels¶

Level 1: Exact match cache (Redis)
├─ Hit rate: 20-30%
├─ Cost savings: High
└─ Latency: <1ms

Level 2: Semantic cache (Vector DB)
├─ Hit rate: 40-60%
├─ Cost savings: Medium
└─ Latency: 10-50ms

Level 3: Prefetch common queries
├─ Hit rate: 10-20%
├─ Cost savings: Medium
└─ Latency: <1ms

6. Production Monitoring¶

import functools
import time
from dataclasses import dataclass, field


@dataclass
class LLMMetrics:
    """Metrics for a single LLM invocation."""

    model: str
    tokens_input: int
    tokens_output: int
    latency_ms: float
    cost_usd: float
    cache_hit: bool
    error: str | None = None


# Simple in-memory metrics store (swap for Prometheus / Datadog in prod)
_metrics_log: list[LLMMetrics] = []


def track_llm_call(func):
    """Decorator to track LLM metrics."""

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        try:
            result, meta = func(*args, **kwargs)
            _metrics_log.append(
                LLMMetrics(
                    model=meta.get("model", "unknown"),
                    tokens_input=meta.get("tokens_in", 0),
                    tokens_output=meta.get("tokens_out", 0),
                    latency_ms=(time.perf_counter() - start) * 1000,
                    cost_usd=meta.get("cost", 0.0),
                    cache_hit=meta.get("cache_hit", False),
                )
            )
            return result
        except Exception as exc:
            _metrics_log.append(
                LLMMetrics(
                    model="unknown",
                    tokens_input=0,
                    tokens_output=0,
                    latency_ms=(time.perf_counter() - start) * 1000,
                    cost_usd=0.0,
                    cache_hit=False,
                    error=str(exc),
                )
            )
            raise

    return wrapper


# --- Demo usage ---
@track_llm_call
def my_llm_call(prompt: str):
    """Example tracked function."""
    time.sleep(0.05)  # simulate latency
    return f"Answer to: {prompt[:30]}", {
        "model": "gpt-4.1-mini",
        "tokens_in": len(prompt.split()),
        "tokens_out": 25,
        "cost": 0.00003,
        "cache_hit": False,
    }


# Fire a few calls
for i in range(5):
    my_llm_call(f"Question number {i}")

# Report
total_cost = sum(m.cost_usd for m in _metrics_log)
avg_latency = sum(m.latency_ms for m in _metrics_log) / len(_metrics_log)
total_tokens = sum(m.tokens_input + m.tokens_output for m in _metrics_log)

print(f"Calls:         {len(_metrics_log)}")
print(f"Total tokens:  {total_tokens}")
print(f"Avg latency:   {avg_latency:.1f}ms")
print(f"Total cost:    ${total_cost:.5f}")

7. Best Practices¶

Cost Optimization¶

Track all token usage
Cache aggressively
Use cheapest capable model
Compress prompts
Batch similar requests
Set budget alerts
Review costs weekly

DON’T

Use the most expensive model for everything
Ignore caching
Process one-by-one
Keep verbose prompts
Skip monitoring
Forget rate limits

Performance Optimization¶

Stream responses
Use async/parallel
Implement timeouts
Add retries with backoff
Monitor latency
Use CDN for static content
Prefetch common queries

DON’T

Block on LLM calls
Ignore streaming
Skip error handling
Use synchronous code
Forget timeout limits

Resources¶

Tools¶

LangSmith — LLM monitoring
Helicone — LLM observability
Portkey — Gateway & caching
LiteLLM — Unified interface

Caching Solutions¶

Redis (in-memory)
GPTCache (semantic)
Momento (managed)
Upstash (serverless)

Expected Savings¶

Well-optimized LLM application:

Technique	Savings
Caching	60-80% cost reduction
Model selection	50-70% cost reduction
Prompt optimization	10-20% cost reduction
Streaming	3-10x latency improvement
Batching	5-20x throughput improvement

Total: 80-95% cost reduction possible when all techniques are applied!

Start optimizing: Track your current costs first, then apply techniques incrementally.

Measure everything: You can’t optimize what you don’t measure.

LLM Production Optimization¶

Overview¶

What You’ll Learn¶

Cost Optimization¶

Performance Optimization¶

Monitoring & Observability¶

Infrastructure¶

1. Semantic Caching¶

Production semantic caching with Redis¶

2. Batch Processing¶

3. Prompt Compression¶

4. Model Fallback (Cheap → Expensive)¶

5. Cost Optimization Reference¶

Token Reduction Techniques¶

Model Selection Guide (2026)¶

Caching Strategy Levels¶

6. Production Monitoring¶

7. Best Practices¶

Cost Optimization¶

Performance Optimization¶

Resources¶

Tools¶

Caching Solutions¶

Optimization Checklist¶

Expected Savings¶