LLM Production OptimizationΒΆ

OverviewΒΆ

Master the art of deploying LLMs cost-effectively and efficiently in production. Learn caching, batching, monitoring, and optimization strategies.

Part of: Phase 8 - MLOps

Prerequisites:

  • Prompt Engineering (Phase 10)

  • Local LLMs (Phase 13)

  • Basic MLOps

Outcome: Deploy optimized, cost-effective LLM applications

What You’ll LearnΒΆ

Cost OptimizationΒΆ

  • Token usage tracking and reduction

  • Semantic caching strategies

  • Batch processing for efficiency

  • Model selection (cost vs. capability)

  • Prompt compression techniques

  • Fallback strategies (cheap β†’ expensive)

Performance OptimizationΒΆ

  • Response time optimization

  • Streaming responses

  • Parallel processing

  • Prefetching and speculation

  • Edge deployment

  • CDN for static responses

Monitoring & ObservabilityΒΆ

  • LLM metrics (latency, tokens, cost)

  • Quality monitoring

  • Error tracking

  • User feedback loops

  • A/B testing

  • Cost alerts

InfrastructureΒΆ

  • Load balancing

  • Auto-scaling

  • Rate limiting

  • Circuit breakers

  • Retry strategies

  • Fallback models

# Required packages (uncomment to install)
# !pip install openai redis tiktoken

1. Semantic CachingΒΆ

Semantic caching uses embedding similarity to serve cached responses for queries that are semantically equivalent β€” even if the wording is different. This is distinct from provider-level prompt caching (covered in 09_llm_infrastructure.ipynb), which reuses KV-cache at the token level.

How it works: Hash the first few dimensions of the query embedding to create a cache key. Similar queries map to the same bucket and get a cache hit.

import hashlib
import os


def get_embedding_hash(text: str, dimensions: int = 10) -> str:
    """Create a semantic hash for similar queries.

    Uses the first *dimensions* components of the embedding vector.
    In production, use a vector DB similarity search instead of hashing.
    """
    # Placeholder: simulate an embedding with a deterministic hash.
    # In production, replace with:
    #   from openai import OpenAI
    #   client = OpenAI()
    #   response = client.embeddings.create(
    #       model="text-embedding-3-small", input=text
    #   )
    #   embedding = response.data[0].embedding
    return hashlib.sha256(text.lower().strip().encode()).hexdigest()[:16]


# --- In-memory cache for demonstration (swap for Redis in production) ---
_cache: dict[str, str] = {}


def cached_completion(prompt: str) -> tuple[str, int]:
    """Check cache before calling the LLM.

    Returns (response_text, token_count).  token_count == 0 on cache hit.
    """
    cache_key = get_embedding_hash(prompt)

    # Check cache
    cached = _cache.get(cache_key)
    if cached:
        return cached, 0  # $0 cost!

    # Simulate an LLM call (replace with real client.chat.completions.create)
    result = f"[LLM response for: {prompt[:60]}]"
    tokens = len(prompt.split()) + len(result.split())

    # Cache the result
    _cache[cache_key] = result

    return result, tokens


# --- Demo ---
result1, cost1 = cached_completion("What is Python?")
print(f"Query 1 β€” tokens used: {cost1}, response: {result1}")

result2, cost2 = cached_completion("What is Python?")  # exact match β†’ cache hit
print(f"Query 2 β€” tokens used: {cost2}, response: {result2}")

print(f"\nCache entries: {len(_cache)}")

Production semantic caching with RedisΒΆ

import redis
from openai import OpenAI

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

def cached_completion(prompt: str, ttl_seconds: int = 3600):
    cache_key = get_embedding_hash(prompt)
    cached = cache.get(cache_key)
    if cached:
        return cached.decode(), 0

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content
    tokens = response.usage.total_tokens
    cache.setex(cache_key, ttl_seconds, result)
    return result, tokens

2. Batch ProcessingΒΆ

Process many prompts concurrently rather than one at a time. This can deliver 5-20x throughput improvement depending on API rate limits.

import asyncio
import time


async def _mock_llm_call(prompt: str) -> str:
    """Simulate an LLM API call with ~100ms latency."""
    await asyncio.sleep(0.1)
    return f"[Response for: {prompt[:40]}]"


async def process_batch(prompts: list[str], batch_size: int = 10) -> list[str]:
    """Process prompts in concurrent batches."""
    results: list[str] = []

    for i in range(0, len(prompts), batch_size):
        batch = prompts[i : i + batch_size]
        tasks = [_mock_llm_call(p) for p in batch]
        responses = await asyncio.gather(*tasks)
        results.extend(responses)

    return results


# --- Demo ---
prompts = [f"Explain concept #{i}" for i in range(20)]

# Sequential baseline
start = time.perf_counter()
sequential = []
for p in prompts:
    sequential.append(asyncio.get_event_loop().run_until_complete(_mock_llm_call(p)))
seq_time = time.perf_counter() - start

# Batched
start = time.perf_counter()
batched = asyncio.get_event_loop().run_until_complete(process_batch(prompts, batch_size=10))
batch_time = time.perf_counter() - start

print(f"Sequential: {seq_time:.2f}s  ({len(prompts)} prompts)")
print(f"Batched:    {batch_time:.2f}s  ({len(prompts)} prompts)")
print(f"Speedup:    {seq_time / batch_time:.1f}x")

3. Prompt CompressionΒΆ

Reducing token count directly reduces cost. Concise prompts also tend to reduce latency.

def estimate_tokens(text: str) -> int:
    """Rough token estimate (~4 chars per token for English)."""
    return max(1, len(text) // 4)


# --- Verbose vs concise ---
verbose_prompt = """
I would like you to please help me understand the following concept.
Could you please explain it in a way that is easy to understand?
I am particularly interested in learning about the main points.
Please make sure to cover all the important aspects.
"""

concise_prompt = "Explain [concept] clearly. Cover main points."

verbose_tokens = estimate_tokens(verbose_prompt)
concise_tokens = estimate_tokens(concise_prompt)
savings_pct = (1 - concise_tokens / verbose_tokens) * 100

print(f"Verbose: ~{verbose_tokens} tokens")
print(f"Concise: ~{concise_tokens} tokens")
print(f"Savings: ~{savings_pct:.0f}%")

print("\n--- Compression techniques ---")
techniques = [
    ("Remove filler words", '"Please kindly help me" β†’ "Help me"'),
    ("Use abbreviations", '"Natural Language Processing" β†’ "NLP"'),
    ("Remove redundancy", '"explain and describe" β†’ "explain"'),
    ("Use structured format", '"In bullet points" (let format do the work)'),
    ("System prompt for style", "Set tone once, don't repeat per message"),
]
for name, example in techniques:
    print(f"  β€’ {name}: {example}")

4. Model Fallback (Cheap β†’ Expensive)ΒΆ

Route simple queries to cheap models and reserve expensive models for complex tasks.

See also: 09_llm_infrastructure.ipynb for production-grade fallback routing with LiteLLM.

# Pricing reference (per 1M input tokens, approximate, April 2026)
MODEL_TIERS = {
    "simple":  {"model": "gpt-4.1-mini",  "cost_per_1m": 0.40},
    "medium":  {"model": "gpt-4.1-nano",   "cost_per_1m": 0.10},
    "complex": {"model": "gpt-4.1",        "cost_per_1m": 2.00},
}

COMPLEX_KEYWORDS = {"analyze", "compare", "design", "code", "debug", "architect"}


def estimate_complexity(prompt: str) -> str:
    """Heuristic complexity estimation."""
    prompt_lower = prompt.lower()
    if any(kw in prompt_lower for kw in COMPLEX_KEYWORDS):
        return "complex"
    if len(prompt.split()) > 100:
        return "medium"
    return "simple"


def smart_completion(prompt: str, complexity: str = "auto") -> dict:
    """Use the cheapest model that can handle the task."""
    if complexity == "auto":
        complexity = estimate_complexity(prompt)

    tier = MODEL_TIERS[complexity]
    # In production: client.chat.completions.create(model=tier["model"], ...)
    return {
        "model": tier["model"],
        "cost_per_1m": tier["cost_per_1m"],
        "complexity": complexity,
        "response": f"[{tier['model']} response]",
    }


# --- Demo ---
queries = [
    "What is Python?",
    "Summarize this 500-word article about climate change",
    "Design a microservice architecture for a real-time fraud detection system",
]

for q in queries:
    result = smart_completion(q)
    print(f"[{result['complexity']:>7}] {result['model']:<16} ${result['cost_per_1m']}/1M  ← {q[:50]}")

print("\n--- Potential savings ---")
print("If 80% of queries are simple and 20% complex:")
naive_cost = 1_000_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
smart_cost = (
    800_000 * MODEL_TIERS["simple"]["cost_per_1m"] / 1_000_000
    + 200_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
)
print(f"  Naive (all gpt-4.1):  ${naive_cost:.2f} per 1M tokens")
print(f"  Smart routing:        ${smart_cost:.2f} per 1M tokens")
print(f"  Savings:              {(1 - smart_cost / naive_cost) * 100:.0f}%")

5. Cost Optimization ReferenceΒΆ

Token Reduction TechniquesΒΆ

Technique

Savings

Effort

Remove filler words

10-20%

Low

Use abbreviations

5-15%

Low

Compress examples

20-40%

Medium

Smaller model

50-95%

Medium

Semantic caching

60-90%

High

Fine-tuned model

50-80%

High

Model Selection Guide (2026)ΒΆ

Simple tasks (FAQ, classification):
└─ GPT-4.1-mini, Claude Haiku 4.5, or local Qwen 3 4B

Medium tasks (summarization, extraction):
└─ GPT-4.1-mini, Gemini Flash, or local Qwen 3 8B

Complex tasks (reasoning, analysis):
└─ Claude Sonnet 4.6, GPT-5.4, or Gemini 3.1 Pro

Very complex (code, math, research):
└─ Claude Opus 4.6, o3, or o4-mini

Caching Strategy LevelsΒΆ

Level 1: Exact match cache (Redis)
β”œβ”€ Hit rate: 20-30%
β”œβ”€ Cost savings: High
└─ Latency: <1ms

Level 2: Semantic cache (Vector DB)
β”œβ”€ Hit rate: 40-60%
β”œβ”€ Cost savings: Medium
└─ Latency: 10-50ms

Level 3: Prefetch common queries
β”œβ”€ Hit rate: 10-20%
β”œβ”€ Cost savings: Medium
└─ Latency: <1ms

6. Production MonitoringΒΆ

import functools
import time
from dataclasses import dataclass, field


@dataclass
class LLMMetrics:
    """Metrics for a single LLM invocation."""

    model: str
    tokens_input: int
    tokens_output: int
    latency_ms: float
    cost_usd: float
    cache_hit: bool
    error: str | None = None


# Simple in-memory metrics store (swap for Prometheus / Datadog in prod)
_metrics_log: list[LLMMetrics] = []


def track_llm_call(func):
    """Decorator to track LLM metrics."""

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        try:
            result, meta = func(*args, **kwargs)
            _metrics_log.append(
                LLMMetrics(
                    model=meta.get("model", "unknown"),
                    tokens_input=meta.get("tokens_in", 0),
                    tokens_output=meta.get("tokens_out", 0),
                    latency_ms=(time.perf_counter() - start) * 1000,
                    cost_usd=meta.get("cost", 0.0),
                    cache_hit=meta.get("cache_hit", False),
                )
            )
            return result
        except Exception as exc:
            _metrics_log.append(
                LLMMetrics(
                    model="unknown",
                    tokens_input=0,
                    tokens_output=0,
                    latency_ms=(time.perf_counter() - start) * 1000,
                    cost_usd=0.0,
                    cache_hit=False,
                    error=str(exc),
                )
            )
            raise

    return wrapper


# --- Demo usage ---
@track_llm_call
def my_llm_call(prompt: str):
    """Example tracked function."""
    time.sleep(0.05)  # simulate latency
    return f"Answer to: {prompt[:30]}", {
        "model": "gpt-4.1-mini",
        "tokens_in": len(prompt.split()),
        "tokens_out": 25,
        "cost": 0.00003,
        "cache_hit": False,
    }


# Fire a few calls
for i in range(5):
    my_llm_call(f"Question number {i}")

# Report
total_cost = sum(m.cost_usd for m in _metrics_log)
avg_latency = sum(m.latency_ms for m in _metrics_log) / len(_metrics_log)
total_tokens = sum(m.tokens_input + m.tokens_output for m in _metrics_log)

print(f"Calls:         {len(_metrics_log)}")
print(f"Total tokens:  {total_tokens}")
print(f"Avg latency:   {avg_latency:.1f}ms")
print(f"Total cost:    ${total_cost:.5f}")

7. Best PracticesΒΆ

Cost OptimizationΒΆ

DO

  • Track all token usage

  • Cache aggressively

  • Use cheapest capable model

  • Compress prompts

  • Batch similar requests

  • Set budget alerts

  • Review costs weekly

DON’T

  • Use the most expensive model for everything

  • Ignore caching

  • Process one-by-one

  • Keep verbose prompts

  • Skip monitoring

  • Forget rate limits

Performance OptimizationΒΆ

DO

  • Stream responses

  • Use async/parallel

  • Implement timeouts

  • Add retries with backoff

  • Monitor latency

  • Use CDN for static content

  • Prefetch common queries

DON’T

  • Block on LLM calls

  • Ignore streaming

  • Skip error handling

  • Use synchronous code

  • Forget timeout limits

ResourcesΒΆ

ToolsΒΆ

Caching SolutionsΒΆ

  • Redis (in-memory)

  • GPTCache (semantic)

  • Momento (managed)

  • Upstash (serverless)

Optimization ChecklistΒΆ

  • Track all token usage

  • Implement semantic caching

  • Use appropriate model for task

  • Compress prompts (remove filler)

  • Batch similar requests

  • Set up monitoring dashboard

  • Configure cost alerts

  • Test fallback strategies

  • Implement rate limiting

  • Monitor cache hit rate

  • Review costs monthly

  • Optimize based on metrics

Expected SavingsΒΆ

Well-optimized LLM application:

Technique

Savings

Caching

60-80% cost reduction

Model selection

50-70% cost reduction

Prompt optimization

10-20% cost reduction

Streaming

3-10x latency improvement

Batching

5-20x throughput improvement

Total: 80-95% cost reduction possible when all techniques are applied!

Start optimizing: Track your current costs first, then apply techniques incrementally.

Measure everything: You can’t optimize what you don’t measure.