Deploying Fine-Tuned LLMs to ProductionΒΆ
Youβve fine-tuned and evaluated your model. Now itβs time to serve it. This notebook covers the complete deployment journey β from merging LoRA adapters to running a production API.
Topics covered:
Deployment options comparison: vLLM, TGI, Ollama, llama.cpp
Merging LoRA adapters into the base model
Model quantization: GGUF, GPTQ, AWQ
vLLM: fastest inference engine (OpenAI-compatible API)
Ollama: local deployment with Modelfile
FastAPI wrapper for custom endpoints
Docker containerization
Performance benchmarking
Cost estimation for cloud deployment
Production monitoring
1. Deployment Options ComparisonΒΆ
Choosing the right serving framework depends on your throughput needs, hardware, and latency requirements.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.0)
plt.rcParams["figure.dpi"] = 120
# Deployment options comparison
options = [
{
"name": "vLLM",
"best_for": "High-throughput production API",
"throughput": "Highest",
"hardware": "NVIDIA GPU (CUDA)",
"api_compat": "Full OpenAI compatible",
"setup": "Easy (pip install)",
"quantization": "GPTQ, AWQ, FP8",
"lora_support": "Yes (multi-LoRA serving)",
"latency": "Low",
"batching": "Continuous batching (PagedAttention)",
},
{
"name": "TGI (Text Generation Inference)",
"best_for": "Hugging Face ecosystem, production",
"throughput": "Very High",
"hardware": "NVIDIA GPU, AMD, CPU",
"api_compat": "OpenAI compatible",
"setup": "Medium (Docker recommended)",
"quantization": "GPTQ, AWQ, BitsAndBytes",
"lora_support": "Yes",
"latency": "Low",
"batching": "Continuous batching",
},
{
"name": "Ollama",
"best_for": "Local development, edge deployment",
"throughput": "Medium",
"hardware": "CPU, Apple Silicon (MPS), NVIDIA",
"api_compat": "OpenAI compatible",
"setup": "Very Easy (single binary)",
"quantization": "GGUF (Q4, Q5, Q8)",
"lora_support": "Yes (via Modelfile)",
"latency": "Medium",
"batching": "Basic",
},
{
"name": "llama.cpp",
"best_for": "Edge, CPU inference, embedded",
"throughput": "Low-Medium",
"hardware": "CPU, GPU (optional), Apple Silicon",
"api_compat": "OpenAI compatible (via server)",
"setup": "Medium (compile from source)",
"quantization": "GGUF (Q2 to Q8)",
"lora_support": "Limited",
"latency": "Medium-High",
"batching": "Basic",
},
{
"name": "LitServe / FastAPI",
"best_for": "Custom logic, flexible APIs",
"throughput": "Medium (depends on backend)",
"hardware": "Any",
"api_compat": "Custom",
"setup": "Easy (pip install)",
"quantization": "Via transformers (BitsAndBytes)",
"lora_support": "Yes (via PEFT)",
"latency": "Medium",
"batching": "Manual",
},
]
df = pd.DataFrame(options).set_index("name")
print("Deployment Framework Comparison (2025)")
print("=" * 80)
for name, row in df.iterrows():
print(f"\n{name}")
print(f" Best for: {row['best_for']}")
print(f" Hardware: {row['hardware']}")
print(f" Throughput: {row['throughput']} | Latency: {row['latency']}")
print(f" Batching: {row['batching']}")
print(f" Quantization: {row['quantization']}")
print(f" LoRA support: {row['lora_support']}")
print(f" Setup: {row['setup']}")
2. Merging LoRA Adapters into the Base ModelΒΆ
Before deploying with most inference engines, you need to merge the LoRA adapter weights back into the base model. This:
Eliminates the adapter overhead at inference time
Produces a standard HuggingFace model
Is compatible with all deployment frameworks
Increases model file size (adapter: ~100 MB -> full model: ~15 GB for 7B)
import torch
import os
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
def merge_lora_adapter(
base_model_name: str,
adapter_path: str,
output_path: str,
torch_dtype: torch.dtype = torch.bfloat16,
device_map: str = "auto",
) -> str:
"""
Merge a LoRA adapter into the base model and save the merged result.
Args:
base_model_name: HuggingFace model ID or local path
adapter_path: Path to the saved PEFT adapter
output_path: Where to save the merged model
torch_dtype: Precision for the merged model (bfloat16 recommended)
Returns:
output_path
"""
print(f"Loading base model: {base_model_name}")
# Load WITHOUT quantization for merging (quantized weights can't be merged)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch_dtype,
device_map=device_map,
trust_remote_code=True,
)
print(f"Loading LoRA adapter from: {adapter_path}")
model = PeftModel.from_pretrained(base_model, adapter_path)
print("Merging adapter weights...")
merged = model.merge_and_unload() # Fuses adapter weights into base model
print(f"Saving merged model to: {output_path}")
os.makedirs(output_path, exist_ok=True)
merged.save_pretrained(output_path, safe_serialization=True) # safetensors format
# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.save_pretrained(output_path)
# Report size
total_size_gb = sum(
f.stat().st_size for f in Path(output_path).rglob("*") if f.is_file()
) / 1e9
print(f"Merged model saved. Size: {total_size_gb:.2f} GB")
return output_path
# Usage example:
# merge_lora_adapter(
# base_model_name="Qwen/Qwen2.5-7B-Instruct",
# adapter_path="./dpo-adapter",
# output_path="./merged-model",
# )
print("merge_lora_adapter function defined.")
print("\nKey notes:")
print(" - Load the base model WITHOUT quantization (no BitsAndBytesConfig)")
print(" - Use safe_serialization=True for safetensors format (more portable)")
print(" - The merged model is a standard HuggingFace model")
print(" - Compatible with all inference engines: vLLM, TGI, Ollama, etc.")
3. Model Quantization for DeploymentΒΆ
Quantization reduces model size and speeds up inference by representing weights in fewer bits.
Format |
Bits |
Size (7B) |
Speed |
Quality Loss |
Best Used With |
|---|---|---|---|---|---|
BF16 |
16 |
~14 GB |
Fast |
None |
vLLM, TGI |
GPTQ |
4 |
~4 GB |
Fast |
Minimal |
vLLM, AutoGPTQ |
AWQ |
4 |
~4 GB |
Fast |
Minimal |
vLLM, TGI |
FP8 |
8 |
~7 GB |
Very Fast |
Tiny |
vLLM (H100) |
GGUF Q8 |
8 |
~7 GB |
Medium |
Minimal |
Ollama, llama.cpp |
GGUF Q4 |
4 |
~4 GB |
Medium |
Small |
Ollama, llama.cpp |
GGUF Q2 |
2 |
~2 GB |
Slow |
Noticeable |
Edge devices |
# AWQ Quantization (recommended for GPU deployment)
# AWQ (Activation-aware Weight Quantization) is state-of-the-art as of 2025.
# Install: pip install autoawq
AWQ_CODE = '''
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Quantize to 4-bit AWQ
model_path = "./merged-model" # or HuggingFace model ID
output_path = "./merged-model-awq"
# AWQ calibration config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4, # 4-bit weights
"version": "GEMM", # Fast matrix multiply kernel
}
# Load and quantize
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Calibration dataset (128 samples is enough)
from datasets import load_dataset
calib_data = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:128]")
calib_texts = [" ".join(m["content"] for m in ex["messages"]) for ex in calib_data]
# Quantize (takes ~10-30 min on GPU for a 7B model)
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=calib_texts,
)
# Save quantized model
model.save_quantized(output_path)
tokenizer.save_pretrained(output_path)
print(f"AWQ model saved to {output_path}")
# Size: ~3.9 GB for a 7B model (vs ~14 GB in BF16)
'''
print("AWQ Quantization (recommended for GPU deployment)")
print("=" * 55)
print(AWQ_CODE)
# GGUF Quantization (for Ollama and llama.cpp)
# GGUF is the format used by Ollama and llama.cpp.
# Convert using llama.cpp's convert scripts.
GGUF_CONVERSION_SCRIPT = '''
# Step 1: Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
# Step 2: Convert HuggingFace model to GGUF (fp16 first)
python convert_hf_to_gguf.py ./merged-model \\
--outfile ./merged-model-f16.gguf \\
--outtype f16
# Step 3: Quantize to desired precision
# Q4_K_M is the best quality/size tradeoff for most use cases
./llama-quantize ./merged-model-f16.gguf ./merged-model-Q4_K_M.gguf Q4_K_M
# Other quantization levels:
# Q2_K β smallest, most quality loss (not recommended for production)
# Q4_0 β good balance
# Q4_K_M β recommended: good quality, ~4 GB for 7B
# Q5_K_M β better quality, ~5 GB for 7B
# Q8_0 β near-lossless, ~7 GB for 7B
# F16 β no quantization, ~14 GB for 7B
'''
print("GGUF Quantization (for Ollama and llama.cpp)")
print(GGUF_CONVERSION_SCRIPT)
4. vLLM: Fastest Inference EngineΒΆ
vLLM is the de-facto standard for high-throughput LLM serving. Key innovations:
PagedAttention: manages KV cache memory like virtual memory pages β eliminates fragmentation
Continuous batching: dynamically adds new requests to the batch without waiting for others to finish
OpenAI-compatible API: drop-in replacement for the OpenAI API
Multi-LoRA serving: serve multiple LoRA adapters from one base model simultaneously
# vLLM: Installation and starting the server
# Install: pip install vllm
VLLM_COMMANDS = '''
# ---- Option 1: Start vLLM API server (command line) ----
# This exposes an OpenAI-compatible HTTP API on port 8000
# Standard model:
python -m vllm.entrypoints.openai.api_server \\
--model ./merged-model \\
--port 8000 \\
--served-model-name my-finetuned-model \\
--max-model-len 4096 \\
--gpu-memory-utilization 0.90
# AWQ quantized model:
python -m vllm.entrypoints.openai.api_server \\
--model ./merged-model-awq \\
--quantization awq \\
--port 8000
# With LoRA adapter (WITHOUT merging):
python -m vllm.entrypoints.openai.api_server \\
--model Qwen/Qwen2.5-7B-Instruct \\
--enable-lora \\
--lora-modules my-task=./lora-adapter \\
--port 8000
# Multi-LoRA: serve MULTIPLE adapters simultaneously!
python -m vllm.entrypoints.openai.api_server \\
--model Qwen/Qwen2.5-7B-Instruct \\
--enable-lora \\
--lora-modules \\
customer-support=./adapter-support \\
code-gen=./adapter-code \\
summarizer=./adapter-summarizer \\
--port 8000
'''
print("vLLM Server Commands")
print(VLLM_COMMANDS)
# vLLM Python API β use directly in code (offline mode)
VLLM_PYTHON_CODE = '''
from vllm import LLM, SamplingParams
# Initialize the engine (loads model once)
llm = LLM(
model="./merged-model",
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.90,
enforce_eager=False, # Use CUDA graphs for speed
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
repetition_penalty=1.05,
)
# Generate (batched β send all prompts at once for max efficiency)
prompts = [
"What is quantum computing?",
"Explain LoRA in simple terms.",
"Write a Python function to parse JSON.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
response = output.outputs[0].text
print(f"Prompt: {prompt[:50]}...")
print(f"Response: {response[:200]}...")
print()
'''
print("vLLM Python API (offline batch inference)")
print("=" * 50)
print(VLLM_PYTHON_CODE)
# Calling vLLM via OpenAI-compatible client
# Once vllm server is running, use the openai library to talk to it
import json
VLLM_CLIENT_CODE = '''
from openai import OpenAI
# Point to local vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require auth by default
)
# ---- Chat completion ----
response = client.chat.completions.create(
model="my-finetuned-model", # must match --served-model-name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent."},
],
temperature=0.7,
max_tokens=512,
stream=False,
)
print(response.choices[0].message.content)
# ---- Streaming ----
stream = client.chat.completions.create(
model="my-finetuned-model",
messages=[{"role": "user", "content": "Write a poem about AI."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# ---- With specific LoRA adapter ----
response = client.chat.completions.create(
model="code-gen", # Selects the code-gen LoRA adapter
messages=[{"role": "user", "content": "Write bubble sort in Python."}],
)
'''
print("Using the OpenAI client with vLLM")
print("=" * 50)
print(VLLM_CLIENT_CODE)
print("\nKey advantages of vLLM's OpenAI compatibility:")
print(" - Drop-in replacement for OpenAI API")
print(" - Switch between OpenAI and your model by changing base_url")
print(" - All OpenAI client libraries work: Python, JS, etc.")
print(" - Streaming, function calling, tool use all supported")
5. Ollama: Local DeploymentΒΆ
Ollama is the easiest way to run models locally. It handles quantization, download, and serving automatically. Perfect for development, prototyping, and resource-constrained environments.
OLLAMA_SETUP = '''
# ---- Install Ollama ----
# macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from https://ollama.com/download
# ---- Run a built-in model ----
ollama run qwen2.5:7b
ollama run llama3.2:3b
ollama run phi4
# ---- List available models ----
ollama list
# ---- Start Ollama server (for API access) ----
ollama serve # Runs on http://localhost:11434
'''
print("Ollama Setup")
print(OLLAMA_SETUP)
# Creating a Modelfile for your custom fine-tuned model
# The Modelfile is like a Dockerfile for LLMs
# First, convert your merged model to GGUF (see quantization section above)
# Then create a Modelfile:
MODELFILE_CONTENT = '''
# Modelfile for a fine-tuned customer support model
# Base: use a local GGUF file
FROM ./merged-model-Q4_K_M.gguf
# Or import from a merged HuggingFace model directory:
# FROM ./merged-model
# System prompt β sets the model's persona
SYSTEM """
You are a helpful customer support assistant for AcmeCorp.
You are friendly, accurate, and concise.
Always acknowledge the customer's frustration before solving the problem.
If you do not know the answer, say so honestly and offer to escalate.
"""
# Default parameters
PARAMETER temperature 0.3 # Lower = more deterministic (good for support)
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_predict 512 # Max tokens to generate
PARAMETER repeat_penalty 1.1 # Prevent repetition
# Chat template (use the model's native format)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ end }}{{ .Response }}<|im_end|>
"""
'''
print("Ollama Modelfile for custom fine-tuned model:")
print(MODELFILE_CONTENT)
# Save it to disk
with open("/tmp/Modelfile", "w") as f:
f.write(MODELFILE_CONTENT)
print("Modelfile written to /tmp/Modelfile")
OLLAMA_CREATE_COMMANDS = '''
# Create the model from Modelfile
ollama create acmecorp-support -f /tmp/Modelfile
# Run it interactively
ollama run acmecorp-support
# Push to Ollama Hub (optional)
ollama push acmecorp-support
'''
print("\nOllama CLI commands:")
print(OLLAMA_CREATE_COMMANDS)
# Using the Ollama API (OpenAI-compatible)
import urllib.request
import json as _json
def ollama_chat(prompt: str, model: str = "qwen2.5:1.5b", system: str = None) -> str:
"""
Send a chat request to a running Ollama server.
Requires: ollama serve (running on localhost:11434)
"""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
payload = _json.dumps({
"model": model,
"messages": messages,
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/chat",
data=payload,
headers={"Content-Type": "application/json"},
)
try:
with urllib.request.urlopen(req, timeout=60) as resp:
data = _json.loads(resp.read())
return data["message"]["content"]
except Exception as e:
return f"[Ollama not running or model not found: {e}]"
# Test Ollama (will fail if server not running β that's OK)
response = ollama_chat(
prompt="What is machine learning? Answer in 2 sentences.",
model="qwen2.5:1.5b",
)
print(f"Ollama response:\n{response}")
# OpenAI-compatible endpoint
OLLAMA_OPENAI_CODE = '''
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by client, but ignored by Ollama
)
response = client.chat.completions.create(
model="acmecorp-support",
messages=[{"role": "user", "content": "I haven't received my order."}],
)
print(response.choices[0].message.content)
'''
print("\nOllama OpenAI-compatible API:")
print(OLLAMA_OPENAI_CODE)
6. FastAPI Wrapper Around vLLMΒΆ
For custom business logic β auth, rate limiting, preprocessing, logging β wrap vLLM with FastAPI.
FASTAPI_APP = '''
"""
Production FastAPI wrapper around vLLM.
Save as: api_server.py
Run: uvicorn api_server:app --host 0.0.0.0 --port 8080
"""
import time
import uuid
import asyncio
from typing import Optional, List
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
# ---- Request/Response Models ----
class ChatMessage(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: List[ChatMessage]
temperature: float = 0.7
max_tokens: int = 512
stream: bool = False
user_id: Optional[str] = None # For rate limiting
class ChatResponse(BaseModel):
id: str
content: str
tokens_generated: int
latency_ms: float
# ---- Global engine (loaded once at startup) ----
engine: AsyncLLMEngine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global engine
engine_args = AsyncEngineArgs(
model="./merged-model",
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
yield
# Cleanup if needed
# ---- FastAPI App ----
app = FastAPI(
title="Fine-tuned LLM API",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# ---- Middleware: Request logging ----
@app.middleware("http")
async def log_requests(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration_ms = (time.time() - start) * 1000
print(f"{request.method} {request.url.path} [{response.status_code}] {duration_ms:.0f}ms")
return response
# ---- Preprocessing: apply system prompt ----
SYSTEM_PROMPT = "You are a helpful, accurate, and concise AI assistant."
def apply_chat_template(messages: List[ChatMessage]) -> str:
"""Format messages using ChatML format."""
text = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
for msg in messages:
text += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
text += "<|im_start|>assistant\n"
return text
# ---- Endpoints ----
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "my-finetuned-model"}
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
if engine is None:
raise HTTPException(503, "Model not loaded")
start = time.time()
request_id = str(uuid.uuid4())
# Apply chat template
prompt = apply_chat_template(request.messages)
# Sampling params
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens,
stop=["<|im_end|>", "<|endoftext|>"],
)
# Generate
results_generator = engine.generate(prompt, sampling_params, request_id)
final_output = None
async for request_output in results_generator:
final_output = request_output
output = final_output.outputs[0]
latency_ms = (time.time() - start) * 1000
return ChatResponse(
id=request_id,
content=output.text,
tokens_generated=len(output.token_ids),
latency_ms=latency_ms,
)
@app.get("/metrics")
async def metrics():
"""Basic metrics endpoint (use Prometheus for production)."""
return {
"model": "my-finetuned-model",
"status": "running",
}
'''
print("FastAPI + vLLM Production Server")
print("=" * 50)
print(FASTAPI_APP)
7. Docker ContainerizationΒΆ
DOCKERFILE_VLLM = '''
# Dockerfile for vLLM-based LLM API
# Build: docker build -t my-llm-api .
# Run: docker run --gpus all -p 8000:8000 my-llm-api
FROM vllm/vllm-openai:latest
# Set working directory
WORKDIR /app
# Copy model (alternatively, mount volume or download at runtime)
COPY ./merged-model /app/model
# Install additional dependencies
RUN pip install --no-cache-dir fastapi uvicorn
# Expose port
EXPOSE 8000
# Start vLLM server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "/app/model",
"--port", "8000",
"--served-model-name", "my-finetuned-model",
"--max-model-len", "4096",
"--gpu-memory-utilization", "0.90"]
'''
DOCKER_COMPOSE = '''
# docker-compose.yml β for local multi-service deployment
version: "3.8"
services:
llm-api:
image: my-llm-api
ports:
- "8000:8000"
volumes:
- ./merged-model:/app/model:ro # Mount model as read-only
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
environment:
- CUDA_VISIBLE_DEVICES=0
# Optional: Nginx reverse proxy
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- llm-api
'''
print("Dockerfile:")
print(DOCKERFILE_VLLM)
print("\ndocker-compose.yml:")
print(DOCKER_COMPOSE)
8. Performance BenchmarkingΒΆ
import time
import statistics
from concurrent.futures import ThreadPoolExecutor, as_completed
def benchmark_endpoint(
endpoint_url: str,
test_prompts: list,
n_requests: int = 20,
concurrency: int = 4,
max_tokens: int = 200,
) -> dict:
"""
Benchmark an LLM API endpoint.
Measures: latency (p50, p95, p99), throughput (req/s), tokens/s.
"""
import urllib.request
import json
import random
latencies = []
errors = 0
total_tokens = 0
def single_request(prompt: str) -> dict:
payload = json.dumps({
"model": "my-finetuned-model",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.0,
}).encode()
req = urllib.request.Request(
f"{endpoint_url}/v1/chat/completions",
data=payload,
headers={"Content-Type": "application/json"},
)
start = time.perf_counter()
try:
with urllib.request.urlopen(req, timeout=120) as resp:
data = json.loads(resp.read())
latency = (time.perf_counter() - start) * 1000 # ms
tokens = data.get("usage", {}).get("completion_tokens", max_tokens)
return {"latency_ms": latency, "tokens": tokens, "error": False}
except Exception as e:
return {"latency_ms": None, "tokens": 0, "error": True, "message": str(e)}
print(f"Benchmarking {endpoint_url}")
print(f"Requests: {n_requests} | Concurrency: {concurrency}")
bench_start = time.perf_counter()
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [
executor.submit(single_request, random.choice(test_prompts))
for _ in range(n_requests)
]
for future in as_completed(futures):
result = future.result()
if result["error"]:
errors += 1
else:
latencies.append(result["latency_ms"])
total_tokens += result["tokens"]
total_time = time.perf_counter() - bench_start
if not latencies:
return {"error": "All requests failed"}
latencies.sort()
n = len(latencies)
return {
"total_requests": n_requests,
"successful": n,
"errors": errors,
"error_rate": f"{errors / n_requests:.1%}",
"latency_p50_ms": round(latencies[int(n * 0.50)], 1),
"latency_p95_ms": round(latencies[int(n * 0.95)], 1),
"latency_p99_ms": round(latencies[min(int(n * 0.99), n - 1)], 1),
"latency_mean_ms": round(statistics.mean(latencies), 1),
"throughput_rps": round(n / total_time, 2),
"tokens_per_second": round(total_tokens / total_time, 1),
}
# Simulate benchmark results (replace with real calls when server is running)
import random
def simulate_benchmark(framework: str, base_latency: float, tokens_per_sec: float) -> dict:
"""Simulated benchmark for comparison visualization."""
latencies = sorted([max(50, random.gauss(base_latency, base_latency * 0.15)) for _ in range(100)])
return {
"framework": framework,
"latency_p50_ms": round(latencies[50], 0),
"latency_p95_ms": round(latencies[95], 0),
"tokens_per_second": round(random.gauss(tokens_per_sec, tokens_per_sec * 0.05), 0),
"throughput_rps": round(tokens_per_sec / 150, 2), # Assume 150 tokens/response
}
benchmark_results = [
simulate_benchmark("vLLM (BF16)", 420, 2800),
simulate_benchmark("vLLM (AWQ-4bit)", 310, 3600),
simulate_benchmark("TGI", 480, 2400),
simulate_benchmark("Ollama (Q4_K_M, GPU)", 680, 1800),
simulate_benchmark("Ollama (Q4_K_M, CPU)", 3200, 320),
simulate_benchmark("FastAPI+HF (BF16)", 850, 1200),
]
df_bench = pd.DataFrame(benchmark_results)
print("Benchmark Results (7B model, 200 output tokens, A100 GPU unless noted)")
print("=" * 75)
print(df_bench.to_string(index=False))
# Visualize benchmark results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
frameworks = df_bench["framework"]
x = np.arange(len(frameworks))
colors = sns.color_palette("muted", len(frameworks))
# P50 Latency
ax1 = axes[0]
bars1 = ax1.barh(x, df_bench["latency_p50_ms"], color=colors)
ax1.set_yticks(x)
ax1.set_yticklabels(frameworks, fontsize=9)
ax1.set_xlabel("Milliseconds")
ax1.set_title("P50 Latency (lower = better)", fontweight="bold")
for bar in bars1:
ax1.text(bar.get_width() + 20, bar.get_y() + bar.get_height()/2,
f"{int(bar.get_width())} ms", va="center", fontsize=8)
# Tokens per second
ax2 = axes[1]
bars2 = ax2.barh(x, df_bench["tokens_per_second"], color=colors)
ax2.set_yticks(x)
ax2.set_yticklabels(frameworks, fontsize=9)
ax2.set_xlabel("Tokens/second")
ax2.set_title("Throughput: Tokens/s (higher = better)", fontweight="bold")
for bar in bars2:
ax2.text(bar.get_width() + 20, bar.get_y() + bar.get_height()/2,
f"{int(bar.get_width())} tok/s", va="center", fontsize=8)
# P95 Latency (tail latency)
ax3 = axes[2]
bars3 = ax3.barh(x, df_bench["latency_p95_ms"], color=colors, alpha=0.7)
ax3.set_yticks(x)
ax3.set_yticklabels(frameworks, fontsize=9)
ax3.set_xlabel("Milliseconds")
ax3.set_title("P95 Tail Latency (lower = better)", fontweight="bold")
plt.suptitle("Inference Framework Performance Comparison (7B model, A100 GPU)",
fontsize=13, fontweight="bold")
plt.tight_layout()
plt.savefig("benchmark_comparison.png", bbox_inches="tight", dpi=150)
plt.show()
print("Chart saved as benchmark_comparison.png")
9. Cost Estimation for Cloud DeploymentΒΆ
def estimate_monthly_cost(
requests_per_day: int,
avg_input_tokens: int,
avg_output_tokens: int,
model_size_b: float, # billions of parameters
) -> dict:
"""
Estimate cloud GPU costs for serving a fine-tuned model.
Prices as of early 2025 (spot instances, major clouds).
"""
# Cloud GPU pricing ($/hr, on-demand) β 2025 approximations
gpu_options = {
"A10G (24GB) β AWS g5.xlarge": {"price_hr": 1.006, "max_model_b": 13},
"L4 (24GB) β GCP g2-standard-4": {"price_hr": 0.70, "max_model_b": 13},
"A100 40GB β RunPod (spot)": {"price_hr": 1.19, "max_model_b": 34},
"A100 80GB β RunPod (spot)": {"price_hr": 1.89, "max_model_b": 70},
"H100 80GB β RunPod (spot)": {"price_hr": 2.49, "max_model_b": 70},
"RTX 4090 (24GB) β Lambda Labs": {"price_hr": 0.50, "max_model_b": 13},
}
# Estimated throughput: tokens/second at batch size 1 (conservative)
# vLLM with AWQ quantization, single GPU
tokens_per_second_estimates = {
range(0, 4): 1500, # ~3B model
range(4, 10): 900, # ~7B model
range(10, 15): 450, # ~13B model
range(15, 40): 200, # ~34B model
range(40, 80): 90, # ~70B model
}
tps = 900 # Default: 7B
for r, t in tokens_per_second_estimates.items():
if int(model_size_b) in r:
tps = t
break
# Compute required GPU-hours per day
total_tokens_per_day = requests_per_day * (avg_input_tokens + avg_output_tokens)
seconds_needed_per_day = total_tokens_per_day / tps
gpu_hours_per_day = seconds_needed_per_day / 3600
results = {}
for gpu_name, info in gpu_options.items():
if model_size_b <= info["max_model_b"]:
daily_cost = gpu_hours_per_day * info["price_hr"]
monthly_cost = daily_cost * 30
# Add 20% for always-on buffer (you can't be at 100% utilization)
monthly_cost_with_buffer = monthly_cost * 1.20
results[gpu_name] = {
"hourly_rate": f"${info['price_hr']:.3f}",
"gpu_hours_per_day": round(gpu_hours_per_day, 2),
"monthly_cost": f"${monthly_cost_with_buffer:,.2f}",
}
return {
"assumptions": {
"requests_per_day": requests_per_day,
"avg_input_tokens": avg_input_tokens,
"avg_output_tokens": avg_output_tokens,
"model_size_b": model_size_b,
"estimated_tps": tps,
"total_tokens_per_day": total_tokens_per_day,
},
"gpu_costs": results,
}
# Example: 10,000 requests/day with a 7B model
cost_estimate = estimate_monthly_cost(
requests_per_day=10_000,
avg_input_tokens=200,
avg_output_tokens=300,
model_size_b=7,
)
print("Cloud Deployment Cost Estimate")
print("=" * 60)
print("\nAssumptions:")
for k, v in cost_estimate["assumptions"].items():
print(f" {k}: {v:,}" if isinstance(v, int) else f" {k}: {v}")
print("\nEstimated Monthly GPU Cost (including 20% buffer):")
for gpu, costs in cost_estimate["gpu_costs"].items():
print(f" {gpu}")
print(f" Rate: {costs['hourly_rate']}/hr | GPU-hrs/day: {costs['gpu_hours_per_day']:.1f} | Monthly: {costs['monthly_cost']}")
print("\nComparison: OpenAI GPT-4o API cost for same workload:")
reqs = 10_000
in_tokens = 10_000 * 200
out_tokens = 10_000 * 300
# GPT-4o pricing: $2.50/1M input, $10.00/1M output (2025)
openai_daily = (in_tokens / 1e6) * 2.50 + (out_tokens / 1e6) * 10.00
openai_monthly = openai_daily * 30
print(f" OpenAI GPT-4o: ${openai_monthly:,.2f}/month")
print(f" (Self-hosted saves significant cost at scale)")
10. Production MonitoringΒΆ
# Production monitoring with structured logging
# In production, integrate with Prometheus + Grafana or Datadog
import logging
import json as _json
import time
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class RequestLog:
"""Structured log entry for each LLM request."""
request_id: str
timestamp: str
model: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
latency_ms: float
tokens_per_second: float
error: Optional[str] = None
user_id: Optional[str] = None
# Content quality signals
response_length_chars: int = 0
had_refusal: bool = False
class LLMMonitor:
"""Lightweight production monitoring for LLM APIs."""
def __init__(self, model_name: str):
self.model_name = model_name
self.request_logs = []
# Setup structured JSON logging
logging.basicConfig(
level=logging.INFO,
format="%(message)s",
)
self.logger = logging.getLogger("llm_monitor")
def log_request(self, log: RequestLog):
self.request_logs.append(log)
self.logger.info(_json.dumps(asdict(log)))
def get_metrics(self, last_n: int = 100) -> dict:
"""Compute rolling metrics for the last N requests."""
recent = self.request_logs[-last_n:]
if not recent:
return {}
successful = [r for r in recent if r.error is None]
if not successful:
return {"error_rate": 1.0}
latencies = [r.latency_ms for r in successful]
tps_list = [r.tokens_per_second for r in successful]
latencies.sort()
return {
"total_requests": len(recent),
"error_rate": 1 - len(successful) / len(recent),
"latency_p50_ms": round(latencies[len(latencies) // 2], 1),
"latency_p95_ms": round(latencies[int(len(latencies) * 0.95)], 1),
"avg_tokens_per_sec": round(sum(tps_list) / len(tps_list), 1),
"total_tokens": sum(r.total_tokens for r in successful),
"refusal_rate": sum(r.had_refusal for r in successful) / len(successful),
}
# Simulate production traffic
import uuid
from datetime import datetime
monitor = LLMMonitor("qwen2.5-7b-finetuned")
# Simulate 50 requests
for i in range(50):
prompt_tokens = random.randint(50, 300)
output_tokens = random.randint(100, 400)
latency_ms = random.gauss(450, 80)
tps = output_tokens / (latency_ms / 1000)
log = RequestLog(
request_id=str(uuid.uuid4())[:8],
timestamp=datetime.utcnow().isoformat(),
model="qwen2.5-7b-finetuned",
prompt_tokens=prompt_tokens,
completion_tokens=output_tokens,
total_tokens=prompt_tokens + output_tokens,
latency_ms=round(latency_ms, 1),
tokens_per_second=round(tps, 1),
response_length_chars=output_tokens * 4,
had_refusal=random.random() < 0.03, # 3% refusal rate
error=None if random.random() > 0.02 else "timeout",
)
monitor.log_request(log)
metrics = monitor.get_metrics(last_n=50)
print("\nProduction Monitoring β Last 50 Requests")
print("=" * 45)
for k, v in metrics.items():
print(f" {k}: {v}")
# Monitoring dashboard visualization
fig, axes = plt.subplots(2, 2, figsize=(13, 9))
# Simulate time-series data
hours = list(range(24))
traffic = [max(0, int(random.gauss(300 + 400 * abs(h - 12) / 12, 30))) for h in hours]
latency_trend = [random.gauss(430 + 80 * (t / 1000), 30) for t in traffic]
error_rate = [random.gauss(0.015, 0.005) for _ in hours]
tps_trend = [random.gauss(950 - t * 0.1, 50) for t in traffic]
# Plot 1: Request volume
ax1 = axes[0, 0]
ax1.fill_between(hours, traffic, alpha=0.3, color="#5b9bd5")
ax1.plot(hours, traffic, color="#5b9bd5", linewidth=2)
ax1.set_title("Requests per Hour", fontweight="bold")
ax1.set_xlabel("Hour of Day (UTC)")
ax1.set_ylabel("Requests")
# Plot 2: Latency
ax2 = axes[0, 1]
ax2.plot(hours, latency_trend, color="#e67e22", linewidth=2, marker="o", markersize=3)
ax2.axhline(500, color="red", linestyle="--", label="SLA (500ms)", alpha=0.7)
ax2.set_title("P50 Latency (ms)", fontweight="bold")
ax2.set_xlabel("Hour of Day (UTC)")
ax2.set_ylabel("Latency (ms)")
ax2.legend()
# Plot 3: Error rate
ax3 = axes[1, 0]
error_pct = [e * 100 for e in error_rate]
ax3.bar(hours, error_pct, color=["#e74c3c" if e > 3 else "#70ad47" for e in error_pct], alpha=0.8)
ax3.axhline(2.0, color="orange", linestyle="--", label="Alert threshold (2%)")
ax3.set_title("Error Rate (%)", fontweight="bold")
ax3.set_xlabel("Hour of Day (UTC)")
ax3.set_ylabel("Error Rate (%)")
ax3.legend()
# Plot 4: Throughput
ax4 = axes[1, 1]
ax4.plot(hours, tps_trend, color="#70ad47", linewidth=2)
ax4.fill_between(hours, tps_trend, alpha=0.2, color="#70ad47")
ax4.set_title("Throughput (tokens/sec)", fontweight="bold")
ax4.set_xlabel("Hour of Day (UTC)")
ax4.set_ylabel("Tokens/sec")
plt.suptitle("LLM Production Monitoring Dashboard (24h)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("monitoring_dashboard.png", bbox_inches="tight", dpi=150)
plt.show()
print("Dashboard saved as monitoring_dashboard.png")
11. Production Deployment ChecklistΒΆ
12. SGLang: The Fastest Inference Engine in 2025-2026ΒΆ
As of 2025-2026, SGLang has overtaken vLLM in throughput benchmarks on modern hardware (H100/A100).
Why SGLang is faster:
RadixAttention: a novel KV cache management algorithm that uses a radix tree to automatically reuse cached key-value tensors across requests that share a common prefix (system prompt, RAG context, few-shot examples).
Efficient continuous batching with overlap of CPU scheduling and GPU computation.
Outperforms vLLM by ~30% on H100: ~16,200 tokens/second vs vLLMβs ~12,500 tokens/second on Llama-3-8B benchmarks.
Fully OpenAI-compatible API β the same client code that works with vLLM works with SGLang.
Metric |
SGLang |
vLLM |
TGI |
|---|---|---|---|
Throughput (H100, Llama-3-8B) |
~16,200 tok/s |
~12,500 tok/s |
~9,800 tok/s |
KV cache reuse |
Yes (RadixAttention) |
Partial (prefix caching) |
No |
OpenAI API compat |
Full |
Full |
Full |
Multi-LoRA serving |
Yes |
Yes |
Yes |
FP8 (H100) |
Yes |
Yes |
Yes |
Multi-GPU (tensor parallel) |
Yes |
Yes |
Yes |
Ease of setup |
Easy (pip) |
Easy (pip) |
Medium (Docker) |
# SGLang: Installation and starting the server
# Install: pip install sglang[all]
# Requires Python 3.9+ and a CUDA-capable GPU (Ampere or newer recommended)
SGLANG_INSTALL = '''
# Install SGLang with all optional dependencies (FlashInfer, triton kernels, etc.)
pip install sglang[all]
# For H100 / Hopper architecture (best performance with FP8):
pip install sglang[all] --extra-index-url https://flashinfer.ai/whl/cu124/torch2.4/
'''
SGLANG_SERVER_COMMANDS = '''
# ---- Start SGLang server (OpenAI-compatible API on port 30000) ----
# Standard launch (BF16):
python -m sglang.launch_server \\
--model-path Qwen/Qwen2.5-7B-Instruct \\
--port 30000
# With a local merged model:
python -m sglang.launch_server \\
--model-path ./merged-model \\
--port 30000 \\
--served-model-name my-finetuned-model
# FP8 quantization on H100 (highest throughput):
python -m sglang.launch_server \\
--model-path Qwen/Qwen2.5-7B-Instruct \\
--port 30000 \\
--quantization fp8
# Multi-GPU tensor parallelism (2 GPUs):
python -m sglang.launch_server \\
--model-path meta-llama/Llama-3-70B-Instruct \\
--port 30000 \\
--tp 2 # Tensor parallel across 2 GPUs
# Multi-GPU, 4 GPUs, with FP8:
python -m sglang.launch_server \\
--model-path Qwen/Qwen2.5-72B-Instruct \\
--port 30000 \\
--tp 4 \\
--quantization fp8
# Enable RadixAttention prefix caching explicitly (on by default):
python -m sglang.launch_server \\
--model-path Qwen/Qwen2.5-7B-Instruct \\
--port 30000 \\
--enable-prefix-caching # Enables RadixAttention KV cache reuse
'''
print("SGLang Installation and Server Launch")
print("=" * 55)
print(SGLANG_INSTALL)
print(SGLANG_SERVER_COMMANDS)
# Using SGLang via the OpenAI-compatible client
# SGLang exposes the same /v1/chat/completions and /v1/completions endpoints as vLLM.
# The ONLY change from vLLM client code is the port number (30000 vs 8000).
SGLANG_CLIENT_CODE = '''
from openai import OpenAI
# Point to local SGLang server (same API as vLLM β just a different port)
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="not-needed",
)
# ---- Standard chat completion ----
response = client.chat.completions.create(
model="my-finetuned-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
# ---- Streaming ----
stream = client.chat.completions.create(
model="my-finetuned-model",
messages=[{"role": "user", "content": "Write a poem about AI."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
'''
# RadixAttention demonstration: why prefix caching is so powerful
RADIX_ATTENTION_DEMO = '''
# RadixAttention: How prefix caching works with SGLang
#
# Suppose you have a RAG pipeline with a 2,000-token system prompt + retrieved docs.
# Without prefix caching: every request re-computes the full 2,000-token KV cache.
# With RadixAttention: the common prefix is computed ONCE and reused for all requests.
#
# Speedup example (H100, 7B model, 2,000-token shared prefix):
# Without caching: 2,000 (prefill) + 200 (decode) = ~2,200 tokens to process
# With RadixAttention: 0 (prefill cached) + 200 (decode) = ~200 tokens
# Result: ~10x faster time-to-first-token for all but the first request
#
# This is transformative for:
# - RAG pipelines (shared retrieved context)
# - Multi-turn agents (shared conversation history)
# - Few-shot examples (shared demonstrations)
# - System prompts (always shared across users)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="not-needed")
# Long shared system prompt (simulating RAG context)
SYSTEM_PROMPT = """
You are an expert assistant. Here is the relevant documentation:
[...2,000 tokens of retrieved context...]
Answer questions based ONLY on the information above.
"""
# These two requests share the same prefix β SGLang caches the KV tensors
# after the first request, so the second is dramatically faster.
for user_question in [
"What is the main topic of the documentation?",
"Summarize the key points in 3 bullets.",
"What are the limitations described?",
]:
response = client.chat.completions.create(
model="my-finetuned-model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # Same prefix every time
{"role": "user", "content": user_question},
],
max_tokens=200,
)
print(f"Q: {user_question}")
print(f"A: {response.choices[0].message.content[:100]}...")
print()
# 1st request: full prefill computation
# 2nd and 3rd: prefix cache HIT β only decode tokens are computed
'''
print("SGLang: OpenAI-Compatible Client Usage")
print("=" * 50)
print(SGLANG_CLIENT_CODE)
print("\nRadixAttention: Prefix Caching Deep Dive")
print("=" * 50)
print(RADIX_ATTENTION_DEMO)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
# Full comparison table: SGLang vs vLLM vs llama.cpp vs Ollama
frameworks_full = [
{
"Framework": "SGLang",
"Throughput": "Highest (~16,200 tok/s H100)",
"KV Cache Reuse": "Yes (RadixAttention)",
"Hardware": "NVIDIA GPU (Ampere+)",
"Quantization": "FP8, AWQ, GPTQ, BF16",
"Multi-GPU": "Yes (tensor parallel)",
"OpenAI API": "Full",
"Setup": "Easy (pip install sglang[all])",
"Best For": "Production API, RAG, agents, high-throughput",
},
{
"Framework": "vLLM",
"Throughput": "Very High (~12,500 tok/s H100)",
"KV Cache Reuse": "Partial (prefix caching opt-in)",
"Hardware": "NVIDIA GPU, AMD (ROCm)",
"Quantization": "AWQ, GPTQ, FP8, BF16",
"Multi-GPU": "Yes (tensor + pipeline parallel)",
"OpenAI API": "Full",
"Setup": "Easy (pip install vllm)",
"Best For": "Production API, multi-LoRA serving, mature ecosystem",
},
{
"Framework": "TGI",
"Throughput": "High (~9,800 tok/s H100)",
"KV Cache Reuse": "No",
"Hardware": "NVIDIA GPU, AMD, CPU",
"Quantization": "AWQ, GPTQ, BitsAndBytes",
"Multi-GPU": "Yes",
"OpenAI API": "Partial",
"Setup": "Medium (Docker recommended)",
"Best For": "HuggingFace ecosystem, Inference Endpoints",
},
{
"Framework": "Ollama",
"Throughput": "Medium (GPU) / Low (CPU)",
"KV Cache Reuse": "No",
"Hardware": "CPU, Apple Silicon (MPS), NVIDIA",
"Quantization": "GGUF (Q2-Q8)",
"Multi-GPU": "No",
"OpenAI API": "Full",
"Setup": "Very Easy (single binary)",
"Best For": "Local dev, prototyping, edge/offline, Apple Silicon",
},
{
"Framework": "llama.cpp",
"Throughput": "Low-Medium",
"KV Cache Reuse": "No",
"Hardware": "CPU (any), GPU optional, Apple Silicon",
"Quantization": "GGUF (Q2-Q8, including mixed-precision)",
"Multi-GPU": "Limited",
"OpenAI API": "Partial (via server mode)",
"Setup": "Medium (compile from source)",
"Best For": "Embedded, edge, CPU-only, maximum portability",
},
]
df_full = pd.DataFrame(frameworks_full).set_index("Framework")
print("Inference Framework Full Comparison (2025-2026)")
print("=" * 80)
for name, row in df_full.iterrows():
print(f"\n{name}")
for col, val in row.items():
print(f" {col:<20} {val}")
# Throughput comparison chart
fig, ax = plt.subplots(figsize=(10, 5))
fw_names = ["SGLang\n(H100 FP8)", "vLLM\n(H100 BF16)", "TGI\n(H100 BF16)",
"Ollama\n(A100 Q4)", "llama.cpp\n(CPU Q4)"]
tps_vals = [16200, 12500, 9800, 3200, 480]
bar_colors = ["#2ecc71", "#3498db", "#9b59b6", "#e67e22", "#95a5a6"]
bars = ax.barh(fw_names, tps_vals, color=bar_colors, alpha=0.85, edgecolor="white", linewidth=1.2)
for bar, val in zip(bars, tps_vals):
ax.text(bar.get_width() + 150, bar.get_y() + bar.get_height() / 2,
f"{val:,} tok/s", va="center", fontsize=10, fontweight="bold")
ax.set_xlabel("Tokens per Second (higher = better)", fontsize=11)
ax.set_title("Inference Throughput Benchmark\nLlama-3-8B, 2025-2026 Hardware", fontsize=13, fontweight="bold")
ax.set_xlim(0, 19000)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.savefig("sglang_throughput_comparison.png", bbox_inches="tight", dpi=150)
plt.show()
print("\nChart saved as sglang_throughput_comparison.png")
When to Use SGLang vs vLLMΒΆ
Both are excellent choices. Here is a practical decision guide:
Choose SGLang when:
You need maximum throughput on modern NVIDIA hardware (H100, A100, RTX 4090).
Your workload has shared prefixes: RAG pipelines, agent loops, few-shot prompts, or multi-turn conversations β RadixAttention gives a massive speedup.
You are building an agentic system where many requests share the same tool definitions / system prompt.
You want FP8 quantization on H100 with minimal setup effort.
You are greenfield: no existing vLLM tooling to migrate.
Choose vLLM when:
Your team already has vLLM in production and switching cost is high.
You need multi-LoRA serving from a single base model (vLLMβs multi-LoRA is more mature).
You are on AMD (ROCm) hardware β vLLM has better AMD support.
You rely on vLLM-specific integrations in LangChain, LlamaIndex, or Ray Serve.
Choose Ollama when:
Local development, prototyping, or demos on a laptop.
Apple Silicon (M1/M2/M3/M4) β Ollama uses Metal/MPS efficiently.
No GPU available (CPU inference with GGUF quantization).
You want zero-dependency single-binary deployment on edge nodes.
Choose llama.cpp when:
Maximum portability: runs on any hardware including Raspberry Pi.
Embedded / IoT deployment where binary size matters.
You need Q2 or Q3 quantization for extreme memory constraints.
Use Case Best Choice
----------- -----------
Production API, RAG/agents SGLang (RadixAttention wins here)
Production API, multi-LoRA needed vLLM
H100 / FP8, max throughput SGLang
AMD GPU (ROCm) vLLM
Local dev / Apple Silicon Ollama
Edge / CPU-only / embedded llama.cpp
HuggingFace Inference Endpoints TGI
DEPLOYMENT_CHECKLIST = """
Production Deployment Checklist
================================
MODEL PREPARATION
[ ] Merge LoRA adapter into base model (or use multi-LoRA serving)
[ ] Choose quantization format:
- GPU (A100/H100): BF16 or AWQ-4bit
- Consumer GPU: GPTQ-4bit or AWQ-4bit
- CPU/Edge: GGUF Q4_K_M or Q5_K_M
[ ] Test quantized model quality (BERTScore / LLM judge)
[ ] Verify chat template is correctly configured
[ ] Set system prompt defaults
INFRASTRUCTURE
[ ] Choose serving framework:
- Production API: vLLM (recommended)
- Local dev: Ollama
- Custom logic: FastAPI + vLLM
[ ] Configure max model length (context window)
[ ] Set GPU memory utilization (0.85-0.90 recommended)
[ ] Configure max concurrent requests
[ ] Set up health check endpoint
[ ] Configure rate limiting
SECURITY
[ ] Add API key authentication (if public endpoint)
[ ] Enable TLS/HTTPS
[ ] Implement input validation and length limits
[ ] Content filtering for safety-critical applications
[ ] PII detection for compliance
RELIABILITY
[ ] Set request timeout (recommend 30-120s)
[ ] Implement retry logic with exponential backoff
[ ] Configure graceful shutdown
[ ] Set up load balancer (multiple GPU nodes for scale)
[ ] Implement circuit breaker
MONITORING
[ ] Structured request logging (JSON)
[ ] Latency metrics (p50, p95, p99)
[ ] Error rate alerting (threshold: 1-2%)
[ ] Token usage tracking (cost control)
[ ] GPU utilization monitoring
[ ] Refusal rate monitoring (safety signal)
[ ] Distributed tracing (optional, for complex pipelines)
EVALUATION IN PRODUCTION
[ ] A/B testing framework for model updates
[ ] Collect user feedback signals (thumbs up/down)
[ ] Sample and review production responses weekly
[ ] Monitor for distribution shift
[ ] Track hallucination incidents
"""
print(DEPLOYMENT_CHECKLIST)
Key TakeawaysΒΆ
Merge LoRA before deploying with most frameworks (except vLLMβs multi-LoRA serving).
AWQ quantization is the best quality/speed tradeoff for GPU deployment (4-bit, minimal quality loss).
vLLM is fastest for production: continuous batching, PagedAttention, OpenAI-compatible API.
Ollama is easiest for local and edge deployment: single binary, GGUF format, works on CPU.
OpenAI compatibility means you can switch between your model and OpenAI by changing one URL.
Self-hosting saves significant cost at scale vs OpenAI/Anthropic API pricing.
Monitor the right things: latency (p95), error rate, token usage, refusal rate.
Multi-LoRA serving lets you serve multiple fine-tuned adapters from a single base model β very cost-efficient.
Deployment Decision Guide (2025)ΒΆ
Use case Recommended Stack
----------- -----------------
Production API, high traffic vLLM + AWQ quantization + Docker
Production API, custom logic FastAPI + vLLM backend
Multiple adapters, one base model vLLM multi-LoRA serving
Local development / prototyping Ollama
Edge devices / no GPU Ollama (GGUF Q4_K_M) or llama.cpp
Hugging Face ecosystem TGI (Text Generation Inference)
Apple Silicon (M-series) Ollama (uses MPS/Metal)
Congratulations!ΒΆ
You have completed the LLM Fine-tuning course:
03_lora_basics.ipynbβ LoRA and QLoRA fine-tuning05_dpo_alignment.ipynbβ DPO alignment and RLHF06_evaluation.ipynbβ comprehensive LLM evaluation07_deployment.ipynbβ production deployment (this notebook)
You now have the full pipeline: Pretrained model -> SFT -> Alignment -> Evaluate -> Deploy.