Section 4: Model Performance ValidationΒΆ
LLMs, Computer Vision & Speech ModelsΒΆ
Duration: 5 hours
Difficulty: IntermediateβAdvanced
4.1 Why Model Performance Validation MattersΒΆ
Hardware vendors must prove their accelerators run real models correctly and fast. This goes beyond kernel-level testing β it validates:
Correctness: Model produces the right output (text, classification, transcription)
Performance: Throughput (tokens/sec, images/sec) meets targets
Efficiency: GPU utilization, memory efficiency, power efficiency
4.2 LLM ValidationΒΆ
LLM Correctness: Text Generation QualityΒΆ
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def validate_llm_generation(model_name, prompts, device='cuda', dtype=torch.float16):
"""Validate LLM generates coherent, correct text on target device."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=dtype, device_map=device
)
model.eval()
results = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy β deterministic
temperature=1.0,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
results.append({"prompt": prompt, "output": generated})
print(f"Prompt: {prompt[:50]}...")
print(f"Output: {generated[:200]}...")
print()
return results
# Validation prompts (known-answer tests)
VALIDATION_PROMPTS = [
"The capital of France is",
"2 + 2 =",
"def fibonacci(n):\n",
"Translate to French: Hello, how are you?",
]
LLM Numerical Parity (Cross-Platform)ΒΆ
Cross-platform logit comparison is the gold standard for LLM validation: run the same model with identical weights on the CPU (FP32 reference) and the target device (FP16/BF16), then compare the raw logit tensors element-by-element. Beyond raw numerical difference, the test checks top-k prediction overlap β if the top-1 token differs between platforms, the model would generate different text, which is a user-visible bug. A top-10 overlap of less than 8/10 typically indicates a kernel-level issue in GEMM, attention, or LayerNorm. This test is especially important when validating new hardware because LLMs amplify small per-layer errors through hundreds of sequential transformer blocks.
def validate_llm_logits_parity(model_name, prompt, device='cuda',
dtype=torch.float16):
"""Compare logits between CPU (FP32) and device under test."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(prompt, return_tensors="pt")
# CPU reference (FP32)
model_cpu = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float32
).eval()
with torch.no_grad():
logits_ref = model_cpu(**inputs).logits
# Device under test
model_dev = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=dtype, device_map=device
).eval()
inputs_dev = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
logits_dev = model_dev(**inputs_dev).logits.float().cpu()
# Compare
max_diff = (logits_dev - logits_ref).abs().max().item()
mean_diff = (logits_dev - logits_ref).abs().mean().item()
# Check top-k predictions match
topk_ref = logits_ref[0, -1].topk(10).indices
topk_dev = logits_dev[0, -1].topk(10).indices
top1_match = topk_ref[0] == topk_dev[0]
topk_match = len(set(topk_ref.tolist()) & set(topk_dev.tolist()))
print(f"Logits max_diff: {max_diff:.6f}, mean_diff: {mean_diff:.6f}")
print(f"Top-1 match: {top1_match}, Top-10 overlap: {topk_match}/10")
return {
"max_diff": max_diff,
"mean_diff": mean_diff,
"top1_match": top1_match.item(),
"topk_overlap": topk_match,
}
LLM Performance BenchmarkingΒΆ
LLM throughput benchmarking measures the two key performance metrics that customers care about: tokens per second (overall throughput) and time to first token (TTFT, user-perceived latency). The benchmark sweeps across batch sizes because LLM serving systems like vLLM and TGI dynamically batch requests β a hardware platform must show near-linear throughput scaling with batch size until the memory wall. The test generates a fixed number of output tokens with greedy decoding (do_sample=False) to ensure deterministic timing. Hardware vendors use these numbers to calculate cost per 1M tokens and compare against competitors, making throughput benchmarking a strategic priority.
import time
def benchmark_llm_throughput(model_name, device='cuda', dtype=torch.float16,
batch_sizes=[1, 4, 8, 16],
input_len=512, output_len=128):
"""Benchmark LLM throughput: tokens/second."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=dtype, device_map=device
).eval()
results = []
for batch_size in batch_sizes:
# Create batch of identical prompts
prompt = "The quick brown fox " * (input_len // 5)
inputs = tokenizer(
[prompt] * batch_size,
return_tensors="pt",
padding=True,
truncation=True,
max_length=input_len
).to(device)
# Warmup
with torch.no_grad():
model.generate(**inputs, max_new_tokens=10, do_sample=False)
torch.cuda.synchronize()
# Benchmark
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
**inputs, max_new_tokens=output_len, do_sample=False
)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
total_tokens = batch_size * output_len
tokens_per_sec = total_tokens / elapsed
time_to_first_token = elapsed / output_len # Approximate TTFT
result = {
"batch_size": batch_size,
"tokens_per_sec": tokens_per_sec,
"latency_ms": elapsed * 1000,
"approx_ttft_ms": time_to_first_token * 1000,
}
results.append(result)
print(f"Batch {batch_size}: {tokens_per_sec:.1f} tok/s, "
f"latency={elapsed*1000:.0f}ms")
return results
LLM Key MetricsΒΆ
The metrics table below defines the standard KPIs used across the industry for LLM hardware validation.
Metric |
Description |
Target (Example) |
|---|---|---|
TTFT |
Time to first token |
< 100ms (batch=1) |
TPS |
Tokens per second (generation) |
> 50 tok/s per user |
Throughput |
Total tokens/sec across batches |
> 1000 tok/s |
Memory |
Peak GPU memory usage |
< 80% of HBM |
Perplexity |
Language model quality |
Must match reference +/- 0.5 |
TTFT is dominated by the prefill phase (processing the entire prompt in one forward pass), which is compute-bound and benefits from high TFLOPS. TPS during generation is dominated by the decode phase (one token at a time), which is memory-bandwidth-bound because the model weights must be loaded from HBM for each token. This fundamental difference means hardware validation must benchmark both phases independently to identify whether a platform is compute-limited or bandwidth-limited for a given model size.
4.3 Computer Vision Model ValidationΒΆ
Classification Accuracy ValidationΒΆ
import torch
import torchvision.transforms as T
from torchvision import models, datasets
def validate_cv_accuracy(model_name, dataset_path, device='cuda',
dtype=torch.float16, num_samples=1000):
"""Validate CV model accuracy on device matches CPU reference."""
# Load model
model_fn = getattr(models, model_name)
model = model_fn(weights="DEFAULT").eval()
transform = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load validation set
dataset = datasets.ImageFolder(dataset_path, transform=transform)
loader = torch.utils.data.DataLoader(
dataset, batch_size=32, shuffle=False, num_workers=4
)
# CPU reference
correct_cpu = 0
correct_dev = 0
total = 0
model_dev = model.to(device, dtype)
for images, labels in loader:
if total >= num_samples:
break
with torch.no_grad():
# CPU
preds_cpu = model(images).argmax(dim=1)
# Device
preds_dev = model_dev(images.to(device, dtype)).argmax(dim=1).cpu()
correct_cpu += (preds_cpu == labels).sum().item()
correct_dev += (preds_dev == labels).sum().item()
total += labels.size(0)
acc_cpu = correct_cpu / total * 100
acc_dev = correct_dev / total * 100
print(f"{model_name}: CPU={acc_cpu:.1f}%, Device={acc_dev:.1f}%, "
f"diff={abs(acc_cpu - acc_dev):.2f}%")
return abs(acc_cpu - acc_dev) < 1.0 # Accuracy within 1%
CV Throughput BenchmarkingΒΆ
CV throughput is measured in images per second across a sweep of batch sizes, which reveals the hardwareβs compute efficiency curve. At batch=1, latency is dominated by kernel launch overhead and memory access; at large batch sizes, throughput saturates when compute units are fully utilized. The benchmark covers ResNet-50 (convolution-heavy, established baseline), ViT-B/16 (attention-heavy, tests GEMM throughput), and EfficientNet-V2 (depthwise conv + squeeze-excite, tests irregular memory access). Hardware vendors publish these numbers in their MLPerf submissions, and customers use them to select hardware for deployment β a 10% throughput difference at batch=32 can determine a multi-million-dollar procurement decision.
def benchmark_cv_throughput(model_name, batch_sizes=[1, 8, 32, 64, 128],
device='cuda', dtype=torch.float16):
"""Benchmark CV model throughput: images/second."""
model_fn = getattr(models, model_name)
model = model_fn(weights=None).eval().to(device, dtype)
results = []
for batch_size in batch_sizes:
x = torch.randn(batch_size, 3, 224, 224, device=device, dtype=dtype)
# Warmup
for _ in range(5):
with torch.no_grad():
model(x)
torch.cuda.synchronize()
# Benchmark
iterations = 50
start = time.perf_counter()
for _ in range(iterations):
with torch.no_grad():
model(x)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
images_per_sec = batch_size * iterations / elapsed
result = {"batch_size": batch_size, "images_per_sec": images_per_sec}
results.append(result)
print(f"Batch {batch_size}: {images_per_sec:.1f} img/s")
return results
# Benchmark common models
for model_name in ["resnet50", "vit_b_16", "efficientnet_v2_s"]:
print(f"\n--- {model_name} ---")
benchmark_cv_throughput(model_name)
Object Detection ValidationΒΆ
Object detection models like Faster R-CNN produce structured outputs (bounding boxes, confidence scores, class labels) rather than simple logits, requiring a different validation approach. The test compares bounding box coordinates and confidence scores between the CPU reference and the target device, with separate tolerances: boxes must match within 1.0 pixels (accounting for floating-point coordinate rounding), and scores within 0.01. Detection models are particularly sensitive to numerical precision because non-maximum suppression (NMS) uses score thresholds to filter overlapping boxes β a small change in scores can cause different boxes to survive NMS, producing visibly different detection results.
def validate_detection_model(model_name="fasterrcnn_resnet50_fpn_v2",
device='cuda', dtype=torch.float16):
"""Validate detection model produces consistent bounding boxes."""
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT").eval()
torch.manual_seed(42)
x = [torch.randn(3, 640, 640)]
# CPU reference
with torch.no_grad():
ref = model(x)
ref_boxes = ref[0]["boxes"]
ref_scores = ref[0]["scores"]
# Device
model_dev = model.to(device) # Detection models often need FP32
x_dev = [t.to(device) for t in x]
with torch.no_grad():
dev_result = model_dev(x_dev)
dev_boxes = dev_result[0]["boxes"].cpu()
dev_scores = dev_result[0]["scores"].cpu()
# Compare top detections
box_diff = (dev_boxes[:10] - ref_boxes[:10]).abs().max().item()
score_diff = (dev_scores[:10] - ref_scores[:10]).abs().max().item()
print(f"Detection: box_diff={box_diff:.4f}, score_diff={score_diff:.6f}")
return box_diff < 1.0 and score_diff < 0.01
4.4 Speech Model ValidationΒΆ
Whisper (Speech-to-Text) ValidationΒΆ
Whisper is OpenAIβs speech recognition model that has become a de facto standard for ASR (Automatic Speech Recognition) benchmarking on AI hardware. Validation measures both transcription quality (Word Error Rate on standard corpora like LibriSpeech) and throughput (Real-Time Factor, where RTF < 1 means faster-than-realtime processing). The model uses an encoder-decoder transformer architecture where the encoder processes mel spectrograms and the decoder generates text tokens autoregressively. Hardware validation must test both the encoder (parallel, compute-bound like classification) and the decoder (sequential, memory-bound like LLM generation) to ensure both phases perform correctly on the target platform.
def validate_whisper(model_size="base", device='cuda', dtype=torch.float16):
"""Validate Whisper transcription accuracy on device."""
import whisper
import numpy as np
model = whisper.load_model(model_size, device=device)
# Generate test audio (sine wave with speech-like characteristics)
# In practice, use real audio files from a test corpus
audio = np.random.randn(16000 * 10).astype(np.float32) # 10 seconds
# Transcribe
result = model.transcribe(audio, fp16=(dtype == torch.float16))
print(f"Whisper ({model_size}): '{result['text'][:100]}...'")
# For real validation, compare WER (Word Error Rate) against reference
return result
def benchmark_whisper_throughput(model_size="base", device='cuda'):
"""Benchmark Whisper: seconds of audio processed per second."""
import whisper
import numpy as np
model = whisper.load_model(model_size, device=device)
audio_duration_sec = 30
audio = np.random.randn(16000 * audio_duration_sec).astype(np.float32)
start = time.perf_counter()
model.transcribe(audio, fp16=True)
elapsed = time.perf_counter() - start
rtf = elapsed / audio_duration_sec # Real-Time Factor
speedup = audio_duration_sec / elapsed
print(f"Whisper {model_size}: RTF={rtf:.2f}x, "
f"Speed={speedup:.1f}x realtime, "
f"Latency={elapsed*1000:.0f}ms for {audio_duration_sec}s audio")
return {"rtf": rtf, "speedup": speedup}
Text-to-Speech (TTS) ValidationΒΆ
TTS models generate audio waveforms from text, and validation focuses on ensuring the output audio is non-empty, non-corrupted, and at the correct sample rate. Unlike classification tasks with discrete outputs, TTS produces continuous signals where minor numerical differences between platforms are inaudible β the primary concern is detecting catastrophic failures like NaN values in the audio buffer, incorrect sample rates, or silent (all-zero) output segments. Duration validation ensures the output length is proportional to input text length, catching buffer allocation bugs that can cause truncation or silence padding on the target device.
def validate_tts_model(device='cuda'):
"""Validate TTS model produces valid audio output."""
# Example with a generic TTS pipeline
from transformers import pipeline
tts = pipeline("text-to-speech", model="microsoft/speecht5_tts", device=device)
test_texts = [
"Hello, this is a validation test.",
"The quick brown fox jumps over the lazy dog.",
"One two three four five six seven eight nine ten.",
]
for text in test_texts:
output = tts(text)
audio = output["audio"]
sample_rate = output["sampling_rate"]
# Basic validation
assert len(audio) > 0, f"Empty audio for: {text}"
assert sample_rate > 0, "Invalid sample rate"
assert not any(map(lambda x: x != x, audio)), "NaN in audio" # NaN check
duration = len(audio) / sample_rate
print(f"TTS: '{text[:40]}...' β {duration:.1f}s audio ({sample_rate}Hz)")
4.5 Model-Level Validation MatrixΒΆ
Models Every Hardware Vendor Must ValidateΒΆ
Category |
Model |
Parameters |
Key Validation |
|---|---|---|---|
LLM |
Llama 3 8B |
8B |
Perplexity, generation quality |
LLM |
Llama 3 70B |
70B |
Multi-GPU, memory management |
LLM |
Mistral 7B |
7B |
Sliding window attention |
LLM |
GPT-NeoX 20B |
20B |
Rotary embeddings |
CV |
ResNet-50 |
25M |
Classification accuracy, efficiency |
CV |
ViT-B/16 |
86M |
Attention-based vision |
CV |
YOLO v8 |
11-68M |
Real-time detection |
CV |
Stable Diffusion |
1B |
Image generation quality |
Speech |
Whisper Large |
1.5B |
WER on LibriSpeech |
Speech |
Wav2Vec 2.0 |
317M |
ASR accuracy |
Multi |
CLIP |
428M |
Vision-language alignment |
Multi |
LLaVA |
7-13B |
Visual QA correctness |
4.6 Profiling Model PerformanceΒΆ
PyTorch ProfilerΒΆ
def profile_model(model, input_data, device='cuda'):
"""Profile model execution with PyTorch profiler."""
from torch.profiler import profile, ProfilerActivity, schedule
model = model.to(device).eval()
input_data = input_data.to(device)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=3, active=5, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
for _ in range(10):
with torch.no_grad():
model(input_data)
prof.step()
# Print summary
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
return prof
GPU Utilization MonitoringΒΆ
GPU utilization monitoring during benchmarks provides essential context for interpreting performance numbers: high throughput with low GPU utilization (< 80%) suggests the workload is bottlenecked on data loading, CPU preprocessing, or PCIe transfers rather than GPU compute. The monitor samples nvidia-smi metrics every second, capturing GPU utilization percentage, memory utilization, power draw, and temperature. Peak memory usage reveals whether the model fits comfortably in HBM or is close to OOM. Average power during benchmarking is used to compute performance per watt (TFLOPS/W or tokens/sec/W), a critical metric for datacenter TCO calculations that directly affects hardware procurement decisions.
def monitor_gpu_utilization(duration_seconds=60):
"""Monitor GPU utilization during a benchmark."""
import subprocess
samples = []
start = time.time()
while time.time() - start < duration_seconds:
output = subprocess.check_output(
["nvidia-smi",
"--query-gpu=utilization.gpu,utilization.memory,"
"memory.used,memory.total,power.draw,temperature.gpu",
"--format=csv,noheader,nounits"],
text=True
).strip()
values = [v.strip() for v in output.split(',')]
samples.append({
"gpu_util": float(values[0]),
"mem_util": float(values[1]),
"mem_used_mb": float(values[2]),
"mem_total_mb": float(values[3]),
"power_w": float(values[4]),
"temp_c": float(values[5]),
})
time.sleep(1)
# Summary
avg_util = sum(s["gpu_util"] for s in samples) / len(samples)
peak_mem = max(s["mem_used_mb"] for s in samples)
avg_power = sum(s["power_w"] for s in samples) / len(samples)
print(f"Avg GPU util: {avg_util:.0f}%, Peak mem: {peak_mem:.0f}MB, "
f"Avg power: {avg_power:.0f}W")
return samples
4.7 ExercisesΒΆ
LLM Logits Parity: Load a small LLM (e.g., GPT-2) on CPU (FP32) and GPU (FP16). Compare logits for 10 different prompts. What is the average and max difference?
CV Model Sweep: Benchmark ResNet-50, ViT-B/16, and EfficientNet-B0 throughput (images/sec) across batch sizes {1, 8, 32, 64}. Which model is most efficient on your GPU?
Profiling: Use
torch.profilerto profile a Llama-like transformer block. Identify the top-3 most time-consuming operations. Are they compute-bound or memory-bound?Quantization Impact: Compare model accuracy and throughput for FP32, FP16, BF16, and INT8 (if available). At what point does accuracy degrade?
Cross-Device Parity: If you have access to both NVIDIA and AMD GPUs, run the same model on both and compare outputs. Are the top-5 predictions identical?
Key TakeawaysΒΆ
Model-level validation is the ultimate test β itβs what customers care about
Always validate both correctness (right answer) and performance (fast enough)
LLM validation includes perplexity, generation quality, and TTFT/TPS metrics
CV validation requires accuracy parity and throughput benchmarks
Profiling identifies bottlenecks: is the model compute-bound or memory-bound?
Previous: 03_framework_validation.ipynb
Next: 05_e2e_pipeline_validation.ipynb
Back to Overview: README.md