LLM Production OptimizationΒΆ
OverviewΒΆ
Master the art of deploying LLMs cost-effectively and efficiently in production. Learn caching, batching, monitoring, and optimization strategies.
Part of: Phase 8 - MLOps
Prerequisites:
Prompt Engineering (Phase 10)
Local LLMs (Phase 13)
Basic MLOps
Outcome: Deploy optimized, cost-effective LLM applications
What Youβll LearnΒΆ
Cost OptimizationΒΆ
Token usage tracking and reduction
Semantic caching strategies
Batch processing for efficiency
Model selection (cost vs. capability)
Prompt compression techniques
Fallback strategies (cheap β expensive)
Performance OptimizationΒΆ
Response time optimization
Streaming responses
Parallel processing
Prefetching and speculation
Edge deployment
CDN for static responses
Monitoring & ObservabilityΒΆ
LLM metrics (latency, tokens, cost)
Quality monitoring
Error tracking
User feedback loops
A/B testing
Cost alerts
InfrastructureΒΆ
Load balancing
Auto-scaling
Rate limiting
Circuit breakers
Retry strategies
Fallback models
# Required packages (uncomment to install)
# !pip install openai redis tiktoken
1. Semantic CachingΒΆ
Semantic caching uses embedding similarity to serve cached responses for queries that are
semantically equivalent β even if the wording is different. This is distinct from provider-level
prompt caching (covered in 09_llm_infrastructure.ipynb), which reuses KV-cache at the token level.
How it works: Hash the first few dimensions of the query embedding to create a cache key. Similar queries map to the same bucket and get a cache hit.
import hashlib
import os
def get_embedding_hash(text: str, dimensions: int = 10) -> str:
"""Create a semantic hash for similar queries.
Uses the first *dimensions* components of the embedding vector.
In production, use a vector DB similarity search instead of hashing.
"""
# Placeholder: simulate an embedding with a deterministic hash.
# In production, replace with:
# from openai import OpenAI
# client = OpenAI()
# response = client.embeddings.create(
# model="text-embedding-3-small", input=text
# )
# embedding = response.data[0].embedding
return hashlib.sha256(text.lower().strip().encode()).hexdigest()[:16]
# --- In-memory cache for demonstration (swap for Redis in production) ---
_cache: dict[str, str] = {}
def cached_completion(prompt: str) -> tuple[str, int]:
"""Check cache before calling the LLM.
Returns (response_text, token_count). token_count == 0 on cache hit.
"""
cache_key = get_embedding_hash(prompt)
# Check cache
cached = _cache.get(cache_key)
if cached:
return cached, 0 # $0 cost!
# Simulate an LLM call (replace with real client.chat.completions.create)
result = f"[LLM response for: {prompt[:60]}]"
tokens = len(prompt.split()) + len(result.split())
# Cache the result
_cache[cache_key] = result
return result, tokens
# --- Demo ---
result1, cost1 = cached_completion("What is Python?")
print(f"Query 1 β tokens used: {cost1}, response: {result1}")
result2, cost2 = cached_completion("What is Python?") # exact match β cache hit
print(f"Query 2 β tokens used: {cost2}, response: {result2}")
print(f"\nCache entries: {len(_cache)}")
Production semantic caching with RedisΒΆ
import redis
from openai import OpenAI
client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
def cached_completion(prompt: str, ttl_seconds: int = 3600):
cache_key = get_embedding_hash(prompt)
cached = cache.get(cache_key)
if cached:
return cached.decode(), 0
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
)
result = response.choices[0].message.content
tokens = response.usage.total_tokens
cache.setex(cache_key, ttl_seconds, result)
return result, tokens
2. Batch ProcessingΒΆ
Process many prompts concurrently rather than one at a time. This can deliver 5-20x throughput improvement depending on API rate limits.
import asyncio
import time
async def _mock_llm_call(prompt: str) -> str:
"""Simulate an LLM API call with ~100ms latency."""
await asyncio.sleep(0.1)
return f"[Response for: {prompt[:40]}]"
async def process_batch(prompts: list[str], batch_size: int = 10) -> list[str]:
"""Process prompts in concurrent batches."""
results: list[str] = []
for i in range(0, len(prompts), batch_size):
batch = prompts[i : i + batch_size]
tasks = [_mock_llm_call(p) for p in batch]
responses = await asyncio.gather(*tasks)
results.extend(responses)
return results
# --- Demo ---
prompts = [f"Explain concept #{i}" for i in range(20)]
# Sequential baseline
start = time.perf_counter()
sequential = []
for p in prompts:
sequential.append(asyncio.get_event_loop().run_until_complete(_mock_llm_call(p)))
seq_time = time.perf_counter() - start
# Batched
start = time.perf_counter()
batched = asyncio.get_event_loop().run_until_complete(process_batch(prompts, batch_size=10))
batch_time = time.perf_counter() - start
print(f"Sequential: {seq_time:.2f}s ({len(prompts)} prompts)")
print(f"Batched: {batch_time:.2f}s ({len(prompts)} prompts)")
print(f"Speedup: {seq_time / batch_time:.1f}x")
3. Prompt CompressionΒΆ
Reducing token count directly reduces cost. Concise prompts also tend to reduce latency.
def estimate_tokens(text: str) -> int:
"""Rough token estimate (~4 chars per token for English)."""
return max(1, len(text) // 4)
# --- Verbose vs concise ---
verbose_prompt = """
I would like you to please help me understand the following concept.
Could you please explain it in a way that is easy to understand?
I am particularly interested in learning about the main points.
Please make sure to cover all the important aspects.
"""
concise_prompt = "Explain [concept] clearly. Cover main points."
verbose_tokens = estimate_tokens(verbose_prompt)
concise_tokens = estimate_tokens(concise_prompt)
savings_pct = (1 - concise_tokens / verbose_tokens) * 100
print(f"Verbose: ~{verbose_tokens} tokens")
print(f"Concise: ~{concise_tokens} tokens")
print(f"Savings: ~{savings_pct:.0f}%")
print("\n--- Compression techniques ---")
techniques = [
("Remove filler words", '"Please kindly help me" β "Help me"'),
("Use abbreviations", '"Natural Language Processing" β "NLP"'),
("Remove redundancy", '"explain and describe" β "explain"'),
("Use structured format", '"In bullet points" (let format do the work)'),
("System prompt for style", "Set tone once, don't repeat per message"),
]
for name, example in techniques:
print(f" β’ {name}: {example}")
4. Model Fallback (Cheap β Expensive)ΒΆ
Route simple queries to cheap models and reserve expensive models for complex tasks.
See also:
09_llm_infrastructure.ipynbfor production-grade fallback routing with LiteLLM.
# Pricing reference (per 1M input tokens, approximate, April 2026)
MODEL_TIERS = {
"simple": {"model": "gpt-4.1-mini", "cost_per_1m": 0.40},
"medium": {"model": "gpt-4.1-nano", "cost_per_1m": 0.10},
"complex": {"model": "gpt-4.1", "cost_per_1m": 2.00},
}
COMPLEX_KEYWORDS = {"analyze", "compare", "design", "code", "debug", "architect"}
def estimate_complexity(prompt: str) -> str:
"""Heuristic complexity estimation."""
prompt_lower = prompt.lower()
if any(kw in prompt_lower for kw in COMPLEX_KEYWORDS):
return "complex"
if len(prompt.split()) > 100:
return "medium"
return "simple"
def smart_completion(prompt: str, complexity: str = "auto") -> dict:
"""Use the cheapest model that can handle the task."""
if complexity == "auto":
complexity = estimate_complexity(prompt)
tier = MODEL_TIERS[complexity]
# In production: client.chat.completions.create(model=tier["model"], ...)
return {
"model": tier["model"],
"cost_per_1m": tier["cost_per_1m"],
"complexity": complexity,
"response": f"[{tier['model']} response]",
}
# --- Demo ---
queries = [
"What is Python?",
"Summarize this 500-word article about climate change",
"Design a microservice architecture for a real-time fraud detection system",
]
for q in queries:
result = smart_completion(q)
print(f"[{result['complexity']:>7}] {result['model']:<16} ${result['cost_per_1m']}/1M β {q[:50]}")
print("\n--- Potential savings ---")
print("If 80% of queries are simple and 20% complex:")
naive_cost = 1_000_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
smart_cost = (
800_000 * MODEL_TIERS["simple"]["cost_per_1m"] / 1_000_000
+ 200_000 * MODEL_TIERS["complex"]["cost_per_1m"] / 1_000_000
)
print(f" Naive (all gpt-4.1): ${naive_cost:.2f} per 1M tokens")
print(f" Smart routing: ${smart_cost:.2f} per 1M tokens")
print(f" Savings: {(1 - smart_cost / naive_cost) * 100:.0f}%")
5. Cost Optimization ReferenceΒΆ
Token Reduction TechniquesΒΆ
Technique |
Savings |
Effort |
|---|---|---|
Remove filler words |
10-20% |
Low |
Use abbreviations |
5-15% |
Low |
Compress examples |
20-40% |
Medium |
Smaller model |
50-95% |
Medium |
Semantic caching |
60-90% |
High |
Fine-tuned model |
50-80% |
High |
Model Selection Guide (2026)ΒΆ
Simple tasks (FAQ, classification):
ββ GPT-4.1-mini, Claude Haiku 4.5, or local Qwen 3 4B
Medium tasks (summarization, extraction):
ββ GPT-4.1-mini, Gemini Flash, or local Qwen 3 8B
Complex tasks (reasoning, analysis):
ββ Claude Sonnet 4.6, GPT-5.4, or Gemini 3.1 Pro
Very complex (code, math, research):
ββ Claude Opus 4.6, o3, or o4-mini
Caching Strategy LevelsΒΆ
Level 1: Exact match cache (Redis)
ββ Hit rate: 20-30%
ββ Cost savings: High
ββ Latency: <1ms
Level 2: Semantic cache (Vector DB)
ββ Hit rate: 40-60%
ββ Cost savings: Medium
ββ Latency: 10-50ms
Level 3: Prefetch common queries
ββ Hit rate: 10-20%
ββ Cost savings: Medium
ββ Latency: <1ms
6. Production MonitoringΒΆ
import functools
import time
from dataclasses import dataclass, field
@dataclass
class LLMMetrics:
"""Metrics for a single LLM invocation."""
model: str
tokens_input: int
tokens_output: int
latency_ms: float
cost_usd: float
cache_hit: bool
error: str | None = None
# Simple in-memory metrics store (swap for Prometheus / Datadog in prod)
_metrics_log: list[LLMMetrics] = []
def track_llm_call(func):
"""Decorator to track LLM metrics."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result, meta = func(*args, **kwargs)
_metrics_log.append(
LLMMetrics(
model=meta.get("model", "unknown"),
tokens_input=meta.get("tokens_in", 0),
tokens_output=meta.get("tokens_out", 0),
latency_ms=(time.perf_counter() - start) * 1000,
cost_usd=meta.get("cost", 0.0),
cache_hit=meta.get("cache_hit", False),
)
)
return result
except Exception as exc:
_metrics_log.append(
LLMMetrics(
model="unknown",
tokens_input=0,
tokens_output=0,
latency_ms=(time.perf_counter() - start) * 1000,
cost_usd=0.0,
cache_hit=False,
error=str(exc),
)
)
raise
return wrapper
# --- Demo usage ---
@track_llm_call
def my_llm_call(prompt: str):
"""Example tracked function."""
time.sleep(0.05) # simulate latency
return f"Answer to: {prompt[:30]}", {
"model": "gpt-4.1-mini",
"tokens_in": len(prompt.split()),
"tokens_out": 25,
"cost": 0.00003,
"cache_hit": False,
}
# Fire a few calls
for i in range(5):
my_llm_call(f"Question number {i}")
# Report
total_cost = sum(m.cost_usd for m in _metrics_log)
avg_latency = sum(m.latency_ms for m in _metrics_log) / len(_metrics_log)
total_tokens = sum(m.tokens_input + m.tokens_output for m in _metrics_log)
print(f"Calls: {len(_metrics_log)}")
print(f"Total tokens: {total_tokens}")
print(f"Avg latency: {avg_latency:.1f}ms")
print(f"Total cost: ${total_cost:.5f}")
7. Best PracticesΒΆ
Cost OptimizationΒΆ
DO
Track all token usage
Cache aggressively
Use cheapest capable model
Compress prompts
Batch similar requests
Set budget alerts
Review costs weekly
DONβT
Use the most expensive model for everything
Ignore caching
Process one-by-one
Keep verbose prompts
Skip monitoring
Forget rate limits
Performance OptimizationΒΆ
DO
Stream responses
Use async/parallel
Implement timeouts
Add retries with backoff
Monitor latency
Use CDN for static content
Prefetch common queries
DONβT
Block on LLM calls
Ignore streaming
Skip error handling
Use synchronous code
Forget timeout limits
ResourcesΒΆ
ToolsΒΆ
Caching SolutionsΒΆ
Redis (in-memory)
GPTCache (semantic)
Momento (managed)
Upstash (serverless)
Optimization ChecklistΒΆ
Track all token usage
Implement semantic caching
Use appropriate model for task
Compress prompts (remove filler)
Batch similar requests
Set up monitoring dashboard
Configure cost alerts
Test fallback strategies
Implement rate limiting
Monitor cache hit rate
Review costs monthly
Optimize based on metrics
Expected SavingsΒΆ
Well-optimized LLM application:
Technique |
Savings |
|---|---|
Caching |
60-80% cost reduction |
Model selection |
50-70% cost reduction |
Prompt optimization |
10-20% cost reduction |
Streaming |
3-10x latency improvement |
Batching |
5-20x throughput improvement |
Total: 80-95% cost reduction possible when all techniques are applied!
Start optimizing: Track your current costs first, then apply techniques incrementally.
Measure everything: You canβt optimize what you donβt measure.