Run this notebook: Open in Colab Open in Kaggle

Lab 03: Model Performance Validation¶

Latency · Throughput · Profiling · Precision Modes · LLM Benchmarking¶

Role alignment: AMD Principal Staff – AI/ML Performance Validation
Reference: 04_model_performance_validation.ipynb

What you will do:¶

Benchmark model latency and throughput across batch sizes
Profile with torch.profiler — find hotspot operators
Measure impact of precision (FP32 → FP16 → BF16 → INT8)
Simulate LLM inference: prefill vs decode throughput
Build a model performance report

Works with any PyTorch model. Uses a simple Transformer as the test vehicle.

Setup¶

The setup detects the compute device and prints GPU specifications critical for interpreting benchmark results: GPU name, VRAM capacity, and PyTorch version. VRAM determines the maximum batch size and sequence length that can be tested, while the GPU model determines the expected throughput baseline for comparison. All benchmarks in this lab scale their workload parameters based on the available device to ensure meaningful results regardless of hardware.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.profiler import profile, record_function, ProfilerActivity
import time
import json
from pathlib import Path
from datetime import datetime

if torch.cuda.is_available():
    DEVICE = 'cuda'
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
else:
    DEVICE = 'cpu'

print(f'Device: {DEVICE}')
if DEVICE == 'cuda':
    print(f'GPU   : {torch.cuda.get_device_name(0)}')
    print(f'VRAM  : {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')

Test Model: Mini Transformer¶

We use a small Transformer decoder block as the test vehicle — same architecture as LLMs.
Scale it up/down to match your GPU’s memory.

class TransformerBlock(nn.Module):
    """Single Transformer decoder block (attention + FFN)."""

    def __init__(self, hidden_dim=512, num_heads=8, ffn_mult=4, dropout=0.0):
        super().__init__()
        self.norm1 = nn.LayerNorm(hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.attn  = nn.MultiheadAttention(hidden_dim, num_heads,
                                           dropout=dropout, batch_first=True)
        self.ffn   = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim * ffn_mult),
            nn.GELU(),
            nn.Linear(hidden_dim * ffn_mult, hidden_dim),
        )

    def forward(self, x):
        # Pre-norm attention
        normed = self.norm1(x)
        attn_out, _ = self.attn(normed, normed, normed, need_weights=False)
        x = x + attn_out
        # Pre-norm FFN
        x = x + self.ffn(self.norm2(x))
        return x


class MiniTransformer(nn.Module):
    """Stack of Transformer blocks — simulates LLM forward pass."""

    def __init__(self, num_layers=4, hidden_dim=512, num_heads=8, vocab_size=32000):
        super().__init__()
        self.embed   = nn.Embedding(vocab_size, hidden_dim)
        self.blocks  = nn.ModuleList([TransformerBlock(hidden_dim, num_heads)
                                       for _ in range(num_layers)])
        self.norm    = nn.LayerNorm(hidden_dim)
        self.lm_head = nn.Linear(hidden_dim, vocab_size, bias=False)

    def forward(self, input_ids):
        x = self.embed(input_ids)
        for block in self.blocks:
            x = block(x)
        return self.lm_head(self.norm(x))

    def num_parameters(self):
        return sum(p.numel() for p in self.parameters())


# Model config — scale down on CPU
if DEVICE == 'cpu':
    cfg = dict(num_layers=2, hidden_dim=128, num_heads=4, vocab_size=1000)
elif DEVICE == 'cuda' and torch.cuda.get_device_properties(0).total_memory < 8e9:
    cfg = dict(num_layers=4, hidden_dim=256, num_heads=8, vocab_size=32000)
else:
    cfg = dict(num_layers=8, hidden_dim=512, num_heads=8, vocab_size=32000)

model = MiniTransformer(**cfg).to(DEVICE)
print(f'Model parameters: {model.num_parameters() / 1e6:.1f}M')
print(f'Config: {cfg}')

Exercise 3.1 – Latency & Throughput Benchmark¶

Measure:

Latency = time for one forward pass (ms)
Throughput = tokens processed per second

Sweep batch sizes 1 → 32 and sequence lengths.

def benchmark_inference(model, batch_size, seq_len, iterations=50, warmup=10):
    """
    Returns: (latency_ms, throughput_tokens_per_sec)
    """
    if DEVICE == 'cpu':
        iterations = 5
        warmup = 2

    model.eval()
    vocab = cfg['vocab_size']
    input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)

    # Warmup
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(input_ids)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()

    # Benchmark
    t0 = time.perf_counter()
    with torch.no_grad():
        for _ in range(iterations):
            _ = model(input_ids)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    latency_ms = elapsed / iterations * 1000
    tokens_per_sec = (batch_size * seq_len * iterations) / elapsed
    return latency_ms, tokens_per_sec


batch_sizes = [1, 2, 4, 8] if DEVICE != 'cpu' else [1, 2]
seq_len = 128 if DEVICE != 'cpu' else 32

print(f'Inference benchmark  (seq_len={seq_len}, FP32):')
print(f"{'Batch':>8}  {'Latency (ms)':>14}  {'Tokens/sec':>12}  {'GPU Mem (MB)':>14}")
print('-' * 58)

bench_results = []
for bs in batch_sizes:
    lat, tps = benchmark_inference(model, bs, seq_len)
    mem = torch.cuda.memory_allocated() / 1e6 if DEVICE == 'cuda' else 0
    print(f'{bs:>8}  {lat:>14.2f}  {tps:>12.0f}  {mem:>14.1f}')
    bench_results.append({'batch': bs, 'latency_ms': round(lat, 2), 'tokens_per_sec': int(tps)})

Exercise 3.2 – Precision Mode Comparison¶

This is critical for AMD validation: does FP16/BF16 give the same throughput improvement on ROCm as on CUDA?

Test FP32 → FP16 → BF16 and measure speedup.

def benchmark_with_dtype(model, batch_size, seq_len, dtype, use_autocast=True):
    """
    Benchmark using torch.autocast for AMP (Automatic Mixed Precision).
    """
    if DEVICE == 'cpu' and dtype != torch.float32:
        return None, None  # skip on CPU

    model.eval()
    vocab = cfg['vocab_size']
    input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)
    iterations = 30 if DEVICE != 'cpu' else 5

    amp_dtype = dtype
    device_type = DEVICE  # 'cuda', 'mps', or 'cpu'

    # Warmup
    with torch.no_grad(), torch.autocast(device_type=device_type, dtype=amp_dtype, enabled=use_autocast):
        for _ in range(5):
            _ = model(input_ids)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()

    t0 = time.perf_counter()
    with torch.no_grad(), torch.autocast(device_type=device_type, dtype=amp_dtype, enabled=use_autocast):
        for _ in range(iterations):
            _ = model(input_ids)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    latency_ms = elapsed / iterations * 1000
    tokens_per_sec = (batch_size * seq_len * iterations) / elapsed
    return latency_ms, tokens_per_sec


BS, SEQ = (4, 128) if DEVICE != 'cpu' else (1, 32)

precision_configs = [
    ('FP32 (baseline)', torch.float32),
    ('FP16 AMP',        torch.float16),
]
if DEVICE == 'cuda':
    precision_configs.append(('BF16 AMP', torch.bfloat16))

print(f'Precision comparison  (batch={BS}, seq={SEQ}):')
print(f"{'Mode':>20}  {'Latency (ms)':>14}  {'Tokens/sec':>12}  {'Speedup':>10}")
print('-' * 65)

baseline_lat = None
precision_results = []
for mode, dtype in precision_configs:
    lat, tps = benchmark_with_dtype(model, BS, SEQ, dtype)
    if lat is None:
        skip_reason = 'SKIP (not supported on CPU)'
        print(f'{mode:>20}  {skip_reason:>40}')
        continue
    if baseline_lat is None:
        baseline_lat = lat
    speedup = baseline_lat / lat
    print(f'{mode:>20}  {lat:>14.2f}  {tps:>12.0f}  {speedup:>9.2f}x')
    precision_results.append({'mode': mode, 'latency_ms': round(lat, 2),
                               'tokens_per_sec': int(tps), 'speedup': round(speedup, 2)})

print('\nExpected: FP16/BF16 ~1.5-3x faster than FP32 on modern GPUs')

Exercise 3.3 – PyTorch Profiler¶

Use torch.profiler to find which operators consume the most GPU time.
This is step 1 when someone says “why is my model slow?”

def profile_model(model, batch_size=2, seq_len=64):
    model.eval()
    vocab = cfg['vocab_size']
    input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)

    activities = [ProfilerActivity.CPU]
    if DEVICE == 'cuda':
        activities.append(ProfilerActivity.CUDA)

    with profile(
        activities=activities,
        record_shapes=True,
        profile_memory=True,
        with_stack=False,
    ) as prof:
        with record_function('model_inference'):
            with torch.no_grad():
                for _ in range(5):
                    _ = model(input_ids)
            if DEVICE == 'cuda':
                torch.cuda.synchronize()

    return prof


print('Profiling model...')
prof = profile_model(model)

# Top 15 ops by CPU time
print('\n--- Top 15 operators by CPU time ---')
print(prof.key_averages().table(
    sort_by='cpu_time_total' if DEVICE != 'cuda' else 'cuda_time_total',
    row_limit=15
))

# Save trace for Chrome tracing (optional)
try:
    prof.export_chrome_trace('lab03_trace.json')
    print('\nChrome trace saved: lab03_trace.json')
    print('View at: chrome://tracing  (open the JSON file)')
except Exception as e:
    print(f'Could not save trace: {e}')

Exercise 3.4 – LLM Inference: Prefill vs Decode¶

LLM inference has two distinct phases:

Prefill = process the prompt (parallel, compute-bound)
Decode = generate tokens one by one (sequential, memory-bound)

These have very different performance characteristics — validation engineers test both.

def simulate_llm_phases(model, prompt_len, decode_steps=20, batch_size=1):
    """
    Simulate prefill + decode.
    
    Prefill: forward pass over full prompt (seq_len = prompt_len)
    Decode:  forward pass one token at a time (seq_len = 1 each step)
    """
    model.eval()
    vocab = cfg['vocab_size']

    # ---- Prefill ----
    prompt_ids = torch.randint(0, vocab, (batch_size, prompt_len), device=DEVICE)

    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    t0 = time.perf_counter()
    with torch.no_grad():
        _ = model(prompt_ids)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    prefill_ms = (time.perf_counter() - t0) * 1000

    # ---- Decode ----
    decode_times = []
    decode_input = torch.randint(0, vocab, (batch_size, 1), device=DEVICE)

    with torch.no_grad():
        for _ in range(decode_steps):
            if DEVICE == 'cuda':
                torch.cuda.synchronize()
            t0 = time.perf_counter()
            _ = model(decode_input)
            if DEVICE == 'cuda':
                torch.cuda.synchronize()
            decode_times.append((time.perf_counter() - t0) * 1000)

    avg_decode_ms = sum(decode_times) / len(decode_times)

    prefill_tps  = (batch_size * prompt_len) / (prefill_ms / 1000)
    decode_tps   = (batch_size * 1) / (avg_decode_ms / 1000)  # tokens/sec

    return {
        'prompt_len': prompt_len,
        'batch_size': batch_size,
        'prefill_ms': round(prefill_ms, 2),
        'prefill_tokens_per_sec': round(prefill_tps, 1),
        'decode_avg_ms': round(avg_decode_ms, 2),
        'decode_tokens_per_sec': round(decode_tps, 1),
    }


prompt_lengths = [64, 128, 512, 1024] if DEVICE != 'cpu' else [16, 32]
decode_steps   = 10 if DEVICE != 'cpu' else 5

print('LLM Inference Phases:')
print(f"{'Prompt Len':>12}  {'Prefill (ms)':>14}  {'Prefill tok/s':>14}  "
      f"{'Decode (ms)':>12}  {'Decode tok/s':>13}")
print('-' * 75)

phase_results = []
for plen in prompt_lengths:
    r = simulate_llm_phases(model, plen, decode_steps)
    print(f'{plen:>12}  {r["prefill_ms"]:>14.2f}  {r["prefill_tokens_per_sec"]:>14.0f}  '
          f'{r["decode_avg_ms"]:>12.2f}  {r["decode_tokens_per_sec"]:>13.0f}')
    phase_results.append(r)

print('\nKey insight: decode tok/s << prefill tok/s — decode is memory-bound (loads weights each step)')

Exercise 3.5 – Memory Usage Analysis¶

Track peak GPU memory usage vs batch size and sequence length.
Critical for fitting models into VRAM (AMD MI300X has 192GB — but it still runs out).

def measure_peak_memory(model, batch_size, seq_len):
    """Returns peak GPU memory in MB during a forward pass."""
    if DEVICE != 'cuda':
        return 0.0

    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    vocab = cfg['vocab_size']
    input_ids = torch.randint(0, vocab, (batch_size, seq_len), device=DEVICE)

    with torch.no_grad():
        _ = model(input_ids)
    torch.cuda.synchronize()

    return torch.cuda.max_memory_allocated() / 1e6


if DEVICE == 'cuda':
    batches  = [1, 2, 4, 8, 16]
    seq_lens = [64, 128, 256, 512]

    print('Peak VRAM (MB) — batch × seq_len:')
    header = f"{'':>6}" + ''.join(f'  seq={s:>5}' for s in seq_lens)
    print(header)
    print('-' * len(header))

    for bs in batches:
        row = f'bs={bs:>2}'
        for sl in seq_lens:
            try:
                mem = measure_peak_memory(model, bs, sl)
                row += f'  {mem:>8.1f}'
            except RuntimeError:
                row += f"  {'OOM':>8}"
        print(row)
    print('\nNote: memory scales with batch × seq_len² for attention')
else:
    print('Memory tracking only available on CUDA — skipping on', DEVICE)

Exercise 3.6 – Performance Report¶

The performance report consolidates all benchmark results into a structured JSON document suitable for CI/CD pipeline consumption and historical archival. It captures the full test environment (device, GPU model, PyTorch version, model configuration) alongside batch-sweep throughput numbers, precision-mode comparisons, and LLM phase simulation results. This format mirrors the release validation reports generated by AMD and NVIDIA hardware teams, where every benchmark run produces a JSON artifact that is stored in a database for cross-version and cross-platform regression analysis.

report = {
    'report_type': 'Model Performance Validation',
    'generated_at': datetime.now().isoformat(),
    'device': DEVICE,
    'gpu': torch.cuda.get_device_name(0) if DEVICE == 'cuda' else DEVICE,
    'model_params_M': round(model.num_parameters() / 1e6, 1),
    'model_config': cfg,
    'pytorch_version': torch.__version__,
    'batch_sweep': bench_results,
    'precision_comparison': precision_results,
    'llm_phase_simulation': phase_results,
}

Path('lab03_model_perf_report.json').write_text(json.dumps(report, indent=2))
print(json.dumps(report, indent=2))

Summary¶

Concept	What it validates	AMD JD relevance
Latency/throughput sweep	Model perf across batch sizes	Core validation metric
Precision modes (FP16/BF16)	AMP speedup on ROCm	Key AMD differentiator vs NVIDIA
torch.profiler	Operator-level hotspot detection	“Use modern tools” per JD
Prefill vs decode	LLM inference phases	LLM workload validation
Peak memory	VRAM planning, OOM prevention	Datacenter deployment

Previous: lab_02_kernel_validation.ipynb
Next: lab_04_regression_suite.ipynb — Build a cross-version regression framework
Back to Overview: README.md