Section 4: Model Performance ValidationΒΆ

LLMs, Computer Vision & Speech ModelsΒΆ

Duration: 5 hours
Difficulty: Intermediate–Advanced

4.1 Why Model Performance Validation MattersΒΆ

Hardware vendors must prove their accelerators run real models correctly and fast. This goes beyond kernel-level testing β€” it validates:

  1. Correctness: Model produces the right output (text, classification, transcription)

  2. Performance: Throughput (tokens/sec, images/sec) meets targets

  3. Efficiency: GPU utilization, memory efficiency, power efficiency

4.2 LLM ValidationΒΆ

LLM Correctness: Text Generation QualityΒΆ

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def validate_llm_generation(model_name, prompts, device='cuda', dtype=torch.float16):
    """Validate LLM generates coherent, correct text on target device."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    )
    model.eval()

    results = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                do_sample=False,   # Greedy β€” deterministic
                temperature=1.0,
            )
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "output": generated})
        print(f"Prompt: {prompt[:50]}...")
        print(f"Output: {generated[:200]}...")
        print()

    return results


# Validation prompts (known-answer tests)
VALIDATION_PROMPTS = [
    "The capital of France is",
    "2 + 2 =",
    "def fibonacci(n):\n",
    "Translate to French: Hello, how are you?",
]

LLM Numerical Parity (Cross-Platform)ΒΆ

Cross-platform logit comparison is the gold standard for LLM validation: run the same model with identical weights on the CPU (FP32 reference) and the target device (FP16/BF16), then compare the raw logit tensors element-by-element. Beyond raw numerical difference, the test checks top-k prediction overlap – if the top-1 token differs between platforms, the model would generate different text, which is a user-visible bug. A top-10 overlap of less than 8/10 typically indicates a kernel-level issue in GEMM, attention, or LayerNorm. This test is especially important when validating new hardware because LLMs amplify small per-layer errors through hundreds of sequential transformer blocks.

def validate_llm_logits_parity(model_name, prompt, device='cuda',
                                dtype=torch.float16):
    """Compare logits between CPU (FP32) and device under test."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    inputs = tokenizer(prompt, return_tensors="pt")

    # CPU reference (FP32)
    model_cpu = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float32
    ).eval()
    with torch.no_grad():
        logits_ref = model_cpu(**inputs).logits

    # Device under test
    model_dev = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    ).eval()
    inputs_dev = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits_dev = model_dev(**inputs_dev).logits.float().cpu()

    # Compare
    max_diff = (logits_dev - logits_ref).abs().max().item()
    mean_diff = (logits_dev - logits_ref).abs().mean().item()

    # Check top-k predictions match
    topk_ref = logits_ref[0, -1].topk(10).indices
    topk_dev = logits_dev[0, -1].topk(10).indices
    top1_match = topk_ref[0] == topk_dev[0]
    topk_match = len(set(topk_ref.tolist()) & set(topk_dev.tolist()))

    print(f"Logits max_diff: {max_diff:.6f}, mean_diff: {mean_diff:.6f}")
    print(f"Top-1 match: {top1_match}, Top-10 overlap: {topk_match}/10")
    return {
        "max_diff": max_diff,
        "mean_diff": mean_diff,
        "top1_match": top1_match.item(),
        "topk_overlap": topk_match,
    }

LLM Performance BenchmarkingΒΆ

LLM throughput benchmarking measures the two key performance metrics that customers care about: tokens per second (overall throughput) and time to first token (TTFT, user-perceived latency). The benchmark sweeps across batch sizes because LLM serving systems like vLLM and TGI dynamically batch requests – a hardware platform must show near-linear throughput scaling with batch size until the memory wall. The test generates a fixed number of output tokens with greedy decoding (do_sample=False) to ensure deterministic timing. Hardware vendors use these numbers to calculate cost per 1M tokens and compare against competitors, making throughput benchmarking a strategic priority.

import time

def benchmark_llm_throughput(model_name, device='cuda', dtype=torch.float16,
                              batch_sizes=[1, 4, 8, 16],
                              input_len=512, output_len=128):
    """Benchmark LLM throughput: tokens/second."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    ).eval()

    results = []
    for batch_size in batch_sizes:
        # Create batch of identical prompts
        prompt = "The quick brown fox " * (input_len // 5)
        inputs = tokenizer(
            [prompt] * batch_size,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=input_len
        ).to(device)

        # Warmup
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=10, do_sample=False)
        torch.cuda.synchronize()

        # Benchmark
        start = time.perf_counter()
        with torch.no_grad():
            outputs = model.generate(
                **inputs, max_new_tokens=output_len, do_sample=False
            )
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        total_tokens = batch_size * output_len
        tokens_per_sec = total_tokens / elapsed
        time_to_first_token = elapsed / output_len  # Approximate TTFT

        result = {
            "batch_size": batch_size,
            "tokens_per_sec": tokens_per_sec,
            "latency_ms": elapsed * 1000,
            "approx_ttft_ms": time_to_first_token * 1000,
        }
        results.append(result)
        print(f"Batch {batch_size}: {tokens_per_sec:.1f} tok/s, "
              f"latency={elapsed*1000:.0f}ms")

    return results

LLM Key MetricsΒΆ

The metrics table below defines the standard KPIs used across the industry for LLM hardware validation.

Metric

Description

Target (Example)

TTFT

Time to first token

< 100ms (batch=1)

TPS

Tokens per second (generation)

> 50 tok/s per user

Throughput

Total tokens/sec across batches

> 1000 tok/s

Memory

Peak GPU memory usage

< 80% of HBM

Perplexity

Language model quality

Must match reference +/- 0.5

TTFT is dominated by the prefill phase (processing the entire prompt in one forward pass), which is compute-bound and benefits from high TFLOPS. TPS during generation is dominated by the decode phase (one token at a time), which is memory-bandwidth-bound because the model weights must be loaded from HBM for each token. This fundamental difference means hardware validation must benchmark both phases independently to identify whether a platform is compute-limited or bandwidth-limited for a given model size.

4.3 Computer Vision Model ValidationΒΆ

Classification Accuracy ValidationΒΆ

import torch
import torchvision.transforms as T
from torchvision import models, datasets

def validate_cv_accuracy(model_name, dataset_path, device='cuda',
                          dtype=torch.float16, num_samples=1000):
    """Validate CV model accuracy on device matches CPU reference."""
    # Load model
    model_fn = getattr(models, model_name)
    model = model_fn(weights="DEFAULT").eval()

    transform = T.Compose([
        T.Resize(256),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    # Load validation set
    dataset = datasets.ImageFolder(dataset_path, transform=transform)
    loader = torch.utils.data.DataLoader(
        dataset, batch_size=32, shuffle=False, num_workers=4
    )

    # CPU reference
    correct_cpu = 0
    correct_dev = 0
    total = 0

    model_dev = model.to(device, dtype)

    for images, labels in loader:
        if total >= num_samples:
            break

        with torch.no_grad():
            # CPU
            preds_cpu = model(images).argmax(dim=1)
            # Device
            preds_dev = model_dev(images.to(device, dtype)).argmax(dim=1).cpu()

        correct_cpu += (preds_cpu == labels).sum().item()
        correct_dev += (preds_dev == labels).sum().item()
        total += labels.size(0)

    acc_cpu = correct_cpu / total * 100
    acc_dev = correct_dev / total * 100
    print(f"{model_name}: CPU={acc_cpu:.1f}%, Device={acc_dev:.1f}%, "
          f"diff={abs(acc_cpu - acc_dev):.2f}%")
    return abs(acc_cpu - acc_dev) < 1.0  # Accuracy within 1%

CV Throughput BenchmarkingΒΆ

CV throughput is measured in images per second across a sweep of batch sizes, which reveals the hardware’s compute efficiency curve. At batch=1, latency is dominated by kernel launch overhead and memory access; at large batch sizes, throughput saturates when compute units are fully utilized. The benchmark covers ResNet-50 (convolution-heavy, established baseline), ViT-B/16 (attention-heavy, tests GEMM throughput), and EfficientNet-V2 (depthwise conv + squeeze-excite, tests irregular memory access). Hardware vendors publish these numbers in their MLPerf submissions, and customers use them to select hardware for deployment – a 10% throughput difference at batch=32 can determine a multi-million-dollar procurement decision.

def benchmark_cv_throughput(model_name, batch_sizes=[1, 8, 32, 64, 128],
                             device='cuda', dtype=torch.float16):
    """Benchmark CV model throughput: images/second."""
    model_fn = getattr(models, model_name)
    model = model_fn(weights=None).eval().to(device, dtype)

    results = []
    for batch_size in batch_sizes:
        x = torch.randn(batch_size, 3, 224, 224, device=device, dtype=dtype)

        # Warmup
        for _ in range(5):
            with torch.no_grad():
                model(x)
        torch.cuda.synchronize()

        # Benchmark
        iterations = 50
        start = time.perf_counter()
        for _ in range(iterations):
            with torch.no_grad():
                model(x)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        images_per_sec = batch_size * iterations / elapsed
        result = {"batch_size": batch_size, "images_per_sec": images_per_sec}
        results.append(result)
        print(f"Batch {batch_size}: {images_per_sec:.1f} img/s")

    return results

# Benchmark common models
for model_name in ["resnet50", "vit_b_16", "efficientnet_v2_s"]:
    print(f"\n--- {model_name} ---")
    benchmark_cv_throughput(model_name)

Object Detection ValidationΒΆ

Object detection models like Faster R-CNN produce structured outputs (bounding boxes, confidence scores, class labels) rather than simple logits, requiring a different validation approach. The test compares bounding box coordinates and confidence scores between the CPU reference and the target device, with separate tolerances: boxes must match within 1.0 pixels (accounting for floating-point coordinate rounding), and scores within 0.01. Detection models are particularly sensitive to numerical precision because non-maximum suppression (NMS) uses score thresholds to filter overlapping boxes – a small change in scores can cause different boxes to survive NMS, producing visibly different detection results.

def validate_detection_model(model_name="fasterrcnn_resnet50_fpn_v2",
                              device='cuda', dtype=torch.float16):
    """Validate detection model produces consistent bounding boxes."""
    from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2

    model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT").eval()
    torch.manual_seed(42)
    x = [torch.randn(3, 640, 640)]

    # CPU reference
    with torch.no_grad():
        ref = model(x)
    ref_boxes = ref[0]["boxes"]
    ref_scores = ref[0]["scores"]

    # Device
    model_dev = model.to(device)  # Detection models often need FP32
    x_dev = [t.to(device) for t in x]
    with torch.no_grad():
        dev_result = model_dev(x_dev)
    dev_boxes = dev_result[0]["boxes"].cpu()
    dev_scores = dev_result[0]["scores"].cpu()

    # Compare top detections
    box_diff = (dev_boxes[:10] - ref_boxes[:10]).abs().max().item()
    score_diff = (dev_scores[:10] - ref_scores[:10]).abs().max().item()
    print(f"Detection: box_diff={box_diff:.4f}, score_diff={score_diff:.6f}")
    return box_diff < 1.0 and score_diff < 0.01

4.4 Speech Model ValidationΒΆ

Whisper (Speech-to-Text) ValidationΒΆ

Whisper is OpenAI’s speech recognition model that has become a de facto standard for ASR (Automatic Speech Recognition) benchmarking on AI hardware. Validation measures both transcription quality (Word Error Rate on standard corpora like LibriSpeech) and throughput (Real-Time Factor, where RTF < 1 means faster-than-realtime processing). The model uses an encoder-decoder transformer architecture where the encoder processes mel spectrograms and the decoder generates text tokens autoregressively. Hardware validation must test both the encoder (parallel, compute-bound like classification) and the decoder (sequential, memory-bound like LLM generation) to ensure both phases perform correctly on the target platform.

def validate_whisper(model_size="base", device='cuda', dtype=torch.float16):
    """Validate Whisper transcription accuracy on device."""
    import whisper
    import numpy as np

    model = whisper.load_model(model_size, device=device)

    # Generate test audio (sine wave with speech-like characteristics)
    # In practice, use real audio files from a test corpus
    audio = np.random.randn(16000 * 10).astype(np.float32)  # 10 seconds

    # Transcribe
    result = model.transcribe(audio, fp16=(dtype == torch.float16))
    print(f"Whisper ({model_size}): '{result['text'][:100]}...'")

    # For real validation, compare WER (Word Error Rate) against reference
    return result


def benchmark_whisper_throughput(model_size="base", device='cuda'):
    """Benchmark Whisper: seconds of audio processed per second."""
    import whisper
    import numpy as np

    model = whisper.load_model(model_size, device=device)
    audio_duration_sec = 30
    audio = np.random.randn(16000 * audio_duration_sec).astype(np.float32)

    start = time.perf_counter()
    model.transcribe(audio, fp16=True)
    elapsed = time.perf_counter() - start

    rtf = elapsed / audio_duration_sec  # Real-Time Factor
    speedup = audio_duration_sec / elapsed
    print(f"Whisper {model_size}: RTF={rtf:.2f}x, "
          f"Speed={speedup:.1f}x realtime, "
          f"Latency={elapsed*1000:.0f}ms for {audio_duration_sec}s audio")
    return {"rtf": rtf, "speedup": speedup}

Text-to-Speech (TTS) ValidationΒΆ

TTS models generate audio waveforms from text, and validation focuses on ensuring the output audio is non-empty, non-corrupted, and at the correct sample rate. Unlike classification tasks with discrete outputs, TTS produces continuous signals where minor numerical differences between platforms are inaudible – the primary concern is detecting catastrophic failures like NaN values in the audio buffer, incorrect sample rates, or silent (all-zero) output segments. Duration validation ensures the output length is proportional to input text length, catching buffer allocation bugs that can cause truncation or silence padding on the target device.

def validate_tts_model(device='cuda'):
    """Validate TTS model produces valid audio output."""
    # Example with a generic TTS pipeline
    from transformers import pipeline

    tts = pipeline("text-to-speech", model="microsoft/speecht5_tts", device=device)

    test_texts = [
        "Hello, this is a validation test.",
        "The quick brown fox jumps over the lazy dog.",
        "One two three four five six seven eight nine ten.",
    ]

    for text in test_texts:
        output = tts(text)
        audio = output["audio"]
        sample_rate = output["sampling_rate"]

        # Basic validation
        assert len(audio) > 0, f"Empty audio for: {text}"
        assert sample_rate > 0, "Invalid sample rate"
        assert not any(map(lambda x: x != x, audio)), "NaN in audio"  # NaN check

        duration = len(audio) / sample_rate
        print(f"TTS: '{text[:40]}...' β†’ {duration:.1f}s audio ({sample_rate}Hz)")

4.5 Model-Level Validation MatrixΒΆ

Models Every Hardware Vendor Must ValidateΒΆ

Category

Model

Parameters

Key Validation

LLM

Llama 3 8B

8B

Perplexity, generation quality

LLM

Llama 3 70B

70B

Multi-GPU, memory management

LLM

Mistral 7B

7B

Sliding window attention

LLM

GPT-NeoX 20B

20B

Rotary embeddings

CV

ResNet-50

25M

Classification accuracy, efficiency

CV

ViT-B/16

86M

Attention-based vision

CV

YOLO v8

11-68M

Real-time detection

CV

Stable Diffusion

1B

Image generation quality

Speech

Whisper Large

1.5B

WER on LibriSpeech

Speech

Wav2Vec 2.0

317M

ASR accuracy

Multi

CLIP

428M

Vision-language alignment

Multi

LLaVA

7-13B

Visual QA correctness

4.6 Profiling Model PerformanceΒΆ

PyTorch ProfilerΒΆ

def profile_model(model, input_data, device='cuda'):
    """Profile model execution with PyTorch profiler."""
    from torch.profiler import profile, ProfilerActivity, schedule

    model = model.to(device).eval()
    input_data = input_data.to(device)

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=schedule(wait=1, warmup=3, active=5, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    ) as prof:
        for _ in range(10):
            with torch.no_grad():
                model(input_data)
            prof.step()

    # Print summary
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
    return prof

GPU Utilization MonitoringΒΆ

GPU utilization monitoring during benchmarks provides essential context for interpreting performance numbers: high throughput with low GPU utilization (< 80%) suggests the workload is bottlenecked on data loading, CPU preprocessing, or PCIe transfers rather than GPU compute. The monitor samples nvidia-smi metrics every second, capturing GPU utilization percentage, memory utilization, power draw, and temperature. Peak memory usage reveals whether the model fits comfortably in HBM or is close to OOM. Average power during benchmarking is used to compute performance per watt (TFLOPS/W or tokens/sec/W), a critical metric for datacenter TCO calculations that directly affects hardware procurement decisions.

def monitor_gpu_utilization(duration_seconds=60):
    """Monitor GPU utilization during a benchmark."""
    import subprocess

    samples = []
    start = time.time()
    while time.time() - start < duration_seconds:
        output = subprocess.check_output(
            ["nvidia-smi",
             "--query-gpu=utilization.gpu,utilization.memory,"
             "memory.used,memory.total,power.draw,temperature.gpu",
             "--format=csv,noheader,nounits"],
            text=True
        ).strip()
        values = [v.strip() for v in output.split(',')]
        samples.append({
            "gpu_util": float(values[0]),
            "mem_util": float(values[1]),
            "mem_used_mb": float(values[2]),
            "mem_total_mb": float(values[3]),
            "power_w": float(values[4]),
            "temp_c": float(values[5]),
        })
        time.sleep(1)

    # Summary
    avg_util = sum(s["gpu_util"] for s in samples) / len(samples)
    peak_mem = max(s["mem_used_mb"] for s in samples)
    avg_power = sum(s["power_w"] for s in samples) / len(samples)
    print(f"Avg GPU util: {avg_util:.0f}%, Peak mem: {peak_mem:.0f}MB, "
          f"Avg power: {avg_power:.0f}W")
    return samples

4.7 ExercisesΒΆ

  1. LLM Logits Parity: Load a small LLM (e.g., GPT-2) on CPU (FP32) and GPU (FP16). Compare logits for 10 different prompts. What is the average and max difference?

  2. CV Model Sweep: Benchmark ResNet-50, ViT-B/16, and EfficientNet-B0 throughput (images/sec) across batch sizes {1, 8, 32, 64}. Which model is most efficient on your GPU?

  3. Profiling: Use torch.profiler to profile a Llama-like transformer block. Identify the top-3 most time-consuming operations. Are they compute-bound or memory-bound?

  4. Quantization Impact: Compare model accuracy and throughput for FP32, FP16, BF16, and INT8 (if available). At what point does accuracy degrade?

  5. Cross-Device Parity: If you have access to both NVIDIA and AMD GPUs, run the same model on both and compare outputs. Are the top-5 predictions identical?

Key TakeawaysΒΆ

  • Model-level validation is the ultimate test β€” it’s what customers care about

  • Always validate both correctness (right answer) and performance (fast enough)

  • LLM validation includes perplexity, generation quality, and TTFT/TPS metrics

  • CV validation requires accuracy parity and throughput benchmarks

  • Profiling identifies bottlenecks: is the model compute-bound or memory-bound?

Previous: 03_framework_validation.ipynb
Next: 05_e2e_pipeline_validation.ipynb Back to Overview: README.md