Lab 03: Model Performance ValidationΒΆ
Latency Β· Throughput Β· Profiling Β· Precision Modes Β· LLM BenchmarkingΒΆ
Role alignment: AMD Principal Staff β AI/ML Performance Validation
Reference: 04_model_performance_validation.ipynb
What you will do:ΒΆ
Benchmark model latency and throughput across batch sizes
Profile with
torch.profilerβ find hotspot operatorsMeasure impact of precision (FP32 β FP16 β BF16 β INT8)
Simulate LLM inference: prefill vs decode throughput
Build a model performance report
Works with any PyTorch model. Uses a simple Transformer as the test vehicle.
SetupΒΆ
The setup detects the compute device and prints GPU specifications critical for interpreting benchmark results: GPU name, VRAM capacity, and PyTorch version. VRAM determines the maximum batch size and sequence length that can be tested, while the GPU model determines the expected throughput baseline for comparison. All benchmarks in this lab scale their workload parameters based on the available device to ensure meaningful results regardless of hardware.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.profiler import profile, record_function, ProfilerActivity
import time
import json
from pathlib import Path
from datetime import datetime
if torch.cuda.is_available():
DEVICE = 'cuda'
elif torch.backends.mps.is_available():
DEVICE = 'mps'
else:
DEVICE = 'cpu'
print(f'Device: {DEVICE}')
if DEVICE == 'cuda':
print(f'GPU : {torch.cuda.get_device_name(0)}')
print(f'VRAM : {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')
Test Model: Mini TransformerΒΆ
We use a small Transformer decoder block as the test vehicle β same architecture as LLMs.
Scale it up/down to match your GPUβs memory.
class TransformerBlock(nn.Module):
"""Single Transformer decoder block (attention + FFN)."""
def __init__(self, hidden_dim=512, num_heads=8, ffn_mult=4, dropout=0.0):
super().__init__()
self.norm1 = nn.LayerNorm(hidden_dim)
self.norm2 = nn.LayerNorm(hidden_dim)
self.attn = nn.MultiheadAttention(hidden_dim, num_heads,
dropout=dropout, batch_first=True)
self.ffn = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim * ffn_mult),
nn.GELU(),
nn.Linear(hidden_dim * ffn_mult, hidden_dim),
)
def forward(self, x):
# Pre-norm attention
normed = self.norm1(x)
attn_out, _ = self.attn(normed, normed, normed, need_weights=False)
x = x + attn_out
# Pre-norm FFN
x = x + self.ffn(self.norm2(x))
return x
class MiniTransformer(nn.Module):
"""Stack of Transformer blocks β simulates LLM forward pass."""
def __init__(self, num_layers=4, hidden_dim=512, num_heads=8, vocab_size=32000):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_dim)
self.blocks = nn.ModuleList([TransformerBlock(hidden_dim, num_heads)
for _ in range(num_layers)])
self.norm = nn.LayerNorm(hidden_dim)
self.lm_head = nn.Linear(hidden_dim, vocab_size, bias=False)
def forward(self, input_ids):
x = self.embed(input_ids)
for block in self.blocks:
x = block(x)
return self.lm_head(self.norm(x))
def num_parameters(self):
return sum(p.numel() for p in self.parameters())
# Model config β scale down on CPU
if DEVICE == 'cpu':
cfg = dict(num_layers=2, hidden_dim=128, num_heads=4, vocab_size=1000)
elif DEVICE == 'cuda' and torch.cuda.get_device_properties(0).total_memory < 8e9:
cfg = dict(num_layers=4, hidden_dim=256, num_heads=8, vocab_size=32000)
else:
cfg = dict(num_layers=8, hidden_dim=512, num_heads=8, vocab_size=32000)
model = MiniTransformer(**cfg).to(DEVICE)
print(f'Model parameters: {model.num_parameters() / 1e6:.1f}M')
print(f'Config: {cfg}')
Exercise 3.1 β Latency & Throughput BenchmarkΒΆ
Measure:
Latency = time for one forward pass (ms)
Throughput = tokens processed per second
Sweep batch sizes 1 β 32 and sequence lengths.
def benchmark_inference(model, batch_size, seq_len, iterations=50, warmup=10):
"""
Returns: (latency_ms, throughput_tokens_per_sec)
"""
if DEVICE == 'cpu':
iterations = 5
warmup = 2
model.eval()
vocab = cfg['vocab_size']
input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)
# Warmup
with torch.no_grad():
for _ in range(warmup):
_ = model(input_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
# Benchmark
t0 = time.perf_counter()
with torch.no_grad():
for _ in range(iterations):
_ = model(input_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
latency_ms = elapsed / iterations * 1000
tokens_per_sec = (batch_size * seq_len * iterations) / elapsed
return latency_ms, tokens_per_sec
batch_sizes = [1, 2, 4, 8] if DEVICE != 'cpu' else [1, 2]
seq_len = 128 if DEVICE != 'cpu' else 32
print(f'Inference benchmark (seq_len={seq_len}, FP32):')
print(f"{'Batch':>8} {'Latency (ms)':>14} {'Tokens/sec':>12} {'GPU Mem (MB)':>14}")
print('-' * 58)
bench_results = []
for bs in batch_sizes:
lat, tps = benchmark_inference(model, bs, seq_len)
mem = torch.cuda.memory_allocated() / 1e6 if DEVICE == 'cuda' else 0
print(f'{bs:>8} {lat:>14.2f} {tps:>12.0f} {mem:>14.1f}')
bench_results.append({'batch': bs, 'latency_ms': round(lat, 2), 'tokens_per_sec': int(tps)})
Exercise 3.2 β Precision Mode ComparisonΒΆ
This is critical for AMD validation: does FP16/BF16 give the same throughput improvement on ROCm as on CUDA?
Test FP32 β FP16 β BF16 and measure speedup.
def benchmark_with_dtype(model, batch_size, seq_len, dtype, use_autocast=True):
"""
Benchmark using torch.autocast for AMP (Automatic Mixed Precision).
"""
if DEVICE == 'cpu' and dtype != torch.float32:
return None, None # skip on CPU
model.eval()
vocab = cfg['vocab_size']
input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)
iterations = 30 if DEVICE != 'cpu' else 5
amp_dtype = dtype
device_type = DEVICE # 'cuda', 'mps', or 'cpu'
# Warmup
with torch.no_grad(), torch.autocast(device_type=device_type, dtype=amp_dtype, enabled=use_autocast):
for _ in range(5):
_ = model(input_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad(), torch.autocast(device_type=device_type, dtype=amp_dtype, enabled=use_autocast):
for _ in range(iterations):
_ = model(input_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
latency_ms = elapsed / iterations * 1000
tokens_per_sec = (batch_size * seq_len * iterations) / elapsed
return latency_ms, tokens_per_sec
BS, SEQ = (4, 128) if DEVICE != 'cpu' else (1, 32)
precision_configs = [
('FP32 (baseline)', torch.float32),
('FP16 AMP', torch.float16),
]
if DEVICE == 'cuda':
precision_configs.append(('BF16 AMP', torch.bfloat16))
print(f'Precision comparison (batch={BS}, seq={SEQ}):')
print(f"{'Mode':>20} {'Latency (ms)':>14} {'Tokens/sec':>12} {'Speedup':>10}")
print('-' * 65)
baseline_lat = None
precision_results = []
for mode, dtype in precision_configs:
lat, tps = benchmark_with_dtype(model, BS, SEQ, dtype)
if lat is None:
skip_reason = 'SKIP (not supported on CPU)'
print(f'{mode:>20} {skip_reason:>40}')
continue
if baseline_lat is None:
baseline_lat = lat
speedup = baseline_lat / lat
print(f'{mode:>20} {lat:>14.2f} {tps:>12.0f} {speedup:>9.2f}x')
precision_results.append({'mode': mode, 'latency_ms': round(lat, 2),
'tokens_per_sec': int(tps), 'speedup': round(speedup, 2)})
print('\nExpected: FP16/BF16 ~1.5-3x faster than FP32 on modern GPUs')
Exercise 3.3 β PyTorch ProfilerΒΆ
Use torch.profiler to find which operators consume the most GPU time.
This is step 1 when someone says βwhy is my model slow?β
def profile_model(model, batch_size=2, seq_len=64):
model.eval()
vocab = cfg['vocab_size']
input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)
activities = [ProfilerActivity.CPU]
if DEVICE == 'cuda':
activities.append(ProfilerActivity.CUDA)
with profile(
activities=activities,
record_shapes=True,
profile_memory=True,
with_stack=False,
) as prof:
with record_function('model_inference'):
with torch.no_grad():
for _ in range(5):
_ = model(input_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
return prof
print('Profiling model...')
prof = profile_model(model)
# Top 15 ops by CPU time
print('\n--- Top 15 operators by CPU time ---')
print(prof.key_averages().table(
sort_by='cpu_time_total' if DEVICE != 'cuda' else 'cuda_time_total',
row_limit=15
))
# Save trace for Chrome tracing (optional)
try:
prof.export_chrome_trace('lab03_trace.json')
print('\nChrome trace saved: lab03_trace.json')
print('View at: chrome://tracing (open the JSON file)')
except Exception as e:
print(f'Could not save trace: {e}')
Exercise 3.4 β LLM Inference: Prefill vs DecodeΒΆ
LLM inference has two distinct phases:
Prefill = process the prompt (parallel, compute-bound)
Decode = generate tokens one by one (sequential, memory-bound)
These have very different performance characteristics β validation engineers test both.
def simulate_llm_phases(model, prompt_len, decode_steps=20, batch_size=1):
"""
Simulate prefill + decode.
Prefill: forward pass over full prompt (seq_len = prompt_len)
Decode: forward pass one token at a time (seq_len = 1 each step)
"""
model.eval()
vocab = cfg['vocab_size']
# ---- Prefill ----
prompt_ids = torch.randint(0, vocab, (batch_size, prompt_len), device=DEVICE)
if DEVICE == 'cuda':
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
_ = model(prompt_ids)
if DEVICE == 'cuda':
torch.cuda.synchronize()
prefill_ms = (time.perf_counter() - t0) * 1000
# ---- Decode ----
decode_times = []
decode_input = torch.randint(0, vocab, (batch_size, 1), device=DEVICE)
with torch.no_grad():
for _ in range(decode_steps):
if DEVICE == 'cuda':
torch.cuda.synchronize()
t0 = time.perf_counter()
_ = model(decode_input)
if DEVICE == 'cuda':
torch.cuda.synchronize()
decode_times.append((time.perf_counter() - t0) * 1000)
avg_decode_ms = sum(decode_times) / len(decode_times)
prefill_tps = (batch_size * prompt_len) / (prefill_ms / 1000)
decode_tps = (batch_size * 1) / (avg_decode_ms / 1000) # tokens/sec
return {
'prompt_len': prompt_len,
'batch_size': batch_size,
'prefill_ms': round(prefill_ms, 2),
'prefill_tokens_per_sec': round(prefill_tps, 1),
'decode_avg_ms': round(avg_decode_ms, 2),
'decode_tokens_per_sec': round(decode_tps, 1),
}
prompt_lengths = [64, 128, 512, 1024] if DEVICE != 'cpu' else [16, 32]
decode_steps = 10 if DEVICE != 'cpu' else 5
print('LLM Inference Phases:')
print(f"{'Prompt Len':>12} {'Prefill (ms)':>14} {'Prefill tok/s':>14} "
f"{'Decode (ms)':>12} {'Decode tok/s':>13}")
print('-' * 75)
phase_results = []
for plen in prompt_lengths:
r = simulate_llm_phases(model, plen, decode_steps)
print(f'{plen:>12} {r["prefill_ms"]:>14.2f} {r["prefill_tokens_per_sec"]:>14.0f} '
f'{r["decode_avg_ms"]:>12.2f} {r["decode_tokens_per_sec"]:>13.0f}')
phase_results.append(r)
print('\nKey insight: decode tok/s << prefill tok/s β decode is memory-bound (loads weights each step)')
Exercise 3.5 β Memory Usage AnalysisΒΆ
Track peak GPU memory usage vs batch size and sequence length.
Critical for fitting models into VRAM (AMD MI300X has 192GB β but it still runs out).
def measure_peak_memory(model, batch_size, seq_len):
"""Returns peak GPU memory in MB during a forward pass."""
if DEVICE != 'cuda':
return 0.0
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
vocab = cfg['vocab_size']
input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)
with torch.no_grad():
_ = model(input_ids)
torch.cuda.synchronize()
return torch.cuda.max_memory_allocated() / 1e6
if DEVICE == 'cuda':
batches = [1, 2, 4, 8, 16]
seq_lens = [64, 128, 256, 512]
print('Peak VRAM (MB) β batch Γ seq_len:')
header = f"{'':>6}" + ''.join(f' seq={s:>5}' for s in seq_lens)
print(header)
print('-' * len(header))
for bs in batches:
row = f'bs={bs:>2}'
for sl in seq_lens:
try:
mem = measure_peak_memory(model, bs, sl)
row += f' {mem:>8.1f}'
except RuntimeError:
row += f" {'OOM':>8}"
print(row)
print('\nNote: memory scales with batch Γ seq_lenΒ² for attention')
else:
print('Memory tracking only available on CUDA β skipping on', DEVICE)
Exercise 3.6 β Performance ReportΒΆ
The performance report consolidates all benchmark results into a structured JSON document suitable for CI/CD pipeline consumption and historical archival. It captures the full test environment (device, GPU model, PyTorch version, model configuration) alongside batch-sweep throughput numbers, precision-mode comparisons, and LLM phase simulation results. This format mirrors the release validation reports generated by AMD and NVIDIA hardware teams, where every benchmark run produces a JSON artifact that is stored in a database for cross-version and cross-platform regression analysis.
report = {
'report_type': 'Model Performance Validation',
'generated_at': datetime.now().isoformat(),
'device': DEVICE,
'gpu': torch.cuda.get_device_name(0) if DEVICE == 'cuda' else DEVICE,
'model_params_M': round(model.num_parameters() / 1e6, 1),
'model_config': cfg,
'pytorch_version': torch.__version__,
'batch_sweep': bench_results,
'precision_comparison': precision_results,
'llm_phase_simulation': phase_results,
}
Path('lab03_model_perf_report.json').write_text(json.dumps(report, indent=2))
print(json.dumps(report, indent=2))
SummaryΒΆ
Concept |
What it validates |
AMD JD relevance |
|---|---|---|
Latency/throughput sweep |
Model perf across batch sizes |
Core validation metric |
Precision modes (FP16/BF16) |
AMP speedup on ROCm |
Key AMD differentiator vs NVIDIA |
torch.profiler |
Operator-level hotspot detection |
βUse modern toolsβ per JD |
Prefill vs decode |
LLM inference phases |
LLM workload validation |
Peak memory |
VRAM planning, OOM prevention |
Datacenter deployment |
Previous: lab_02_kernel_validation.ipynb
Next: lab_04_regression_suite.ipynb β Build a cross-version regression framework
Back to Overview: README.md