Run this notebook: Open in Colab Open in Kaggle

Section 4: Model Performance Validation¶

LLMs, Computer Vision & Speech Models¶

Duration: 5 hours
Difficulty: Intermediate–Advanced

4.1 Why Model Performance Validation Matters¶

Hardware vendors must prove their accelerators run real models correctly and fast. This goes beyond kernel-level testing — it validates:

Correctness: Model produces the right output (text, classification, transcription)
Performance: Throughput (tokens/sec, images/sec) meets targets
Efficiency: GPU utilization, memory efficiency, power efficiency

4.2 LLM Validation¶

LLM Correctness: Text Generation Quality¶

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def validate_llm_generation(model_name, prompts, device='cuda', dtype=torch.float16):
    """Validate LLM generates coherent, correct text on target device."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    )
    model.eval()

    results = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                do_sample=False,   # Greedy — deterministic
                temperature=1.0,
            )
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "output": generated})
        print(f"Prompt: {prompt[:50]}...")
        print(f"Output: {generated[:200]}...")
        print()

    return results


# Validation prompts (known-answer tests)
VALIDATION_PROMPTS = [
    "The capital of France is",
    "2 + 2 =",
    "def fibonacci(n):\n",
    "Translate to French: Hello, how are you?",
]

LLM Numerical Parity (Cross-Platform)¶

Cross-platform logit comparison is the gold standard for LLM validation: run the same model with identical weights on the CPU (FP32 reference) and the target device (FP16/BF16), then compare the raw logit tensors element-by-element. Beyond raw numerical difference, the test checks top-k prediction overlap – if the top-1 token differs between platforms, the model would generate different text, which is a user-visible bug. A top-10 overlap of less than 8/10 typically indicates a kernel-level issue in GEMM, attention, or LayerNorm. This test is especially important when validating new hardware because LLMs amplify small per-layer errors through hundreds of sequential transformer blocks.

def validate_llm_logits_parity(model_name, prompt, device='cuda',
                                dtype=torch.float16):
    """Compare logits between CPU (FP32) and device under test."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    inputs = tokenizer(prompt, return_tensors="pt")

    # CPU reference (FP32)
    model_cpu = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float32
    ).eval()
    with torch.no_grad():
        logits_ref = model_cpu(**inputs).logits

    # Device under test
    model_dev = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    ).eval()
    inputs_dev = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits_dev = model_dev(**inputs_dev).logits.float().cpu()

    # Compare
    max_diff = (logits_dev - logits_ref).abs().max().item()
    mean_diff = (logits_dev - logits_ref).abs().mean().item()

    # Check top-k predictions match
    topk_ref = logits_ref[0, -1].topk(10).indices
    topk_dev = logits_dev[0, -1].topk(10).indices
    top1_match = topk_ref[0] == topk_dev[0]
    topk_match = len(set(topk_ref.tolist()) & set(topk_dev.tolist()))

    print(f"Logits max_diff: {max_diff:.6f}, mean_diff: {mean_diff:.6f}")
    print(f"Top-1 match: {top1_match}, Top-10 overlap: {topk_match}/10")
    return {
        "max_diff": max_diff,
        "mean_diff": mean_diff,
        "top1_match": top1_match.item(),
        "topk_overlap": topk_match,
    }

LLM Performance Benchmarking¶

LLM throughput benchmarking measures the two key performance metrics that customers care about: tokens per second (overall throughput) and time to first token (TTFT, user-perceived latency). The benchmark sweeps across batch sizes because LLM serving systems like vLLM and TGI dynamically batch requests – a hardware platform must show near-linear throughput scaling with batch size until the memory wall. The test generates a fixed number of output tokens with greedy decoding (do_sample=False) to ensure deterministic timing. Hardware vendors use these numbers to calculate cost per 1M tokens and compare against competitors, making throughput benchmarking a strategic priority.

import time

def benchmark_llm_throughput(model_name, device='cuda', dtype=torch.float16,
                              batch_sizes=[1, 4, 8, 16],
                              input_len=512, output_len=128):
    """Benchmark LLM throughput: tokens/second."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=dtype, device_map=device
    ).eval()

    results = []
    for batch_size in batch_sizes:
        # Create batch of identical prompts
        prompt = "The quick brown fox " * (input_len // 5)
        inputs = tokenizer(
            [prompt] * batch_size,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=input_len
        ).to(device)

        # Warmup
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=10, do_sample=False)
        torch.cuda.synchronize()

        # Benchmark
        start = time.perf_counter()
        with torch.no_grad():
            outputs = model.generate(
                **inputs, max_new_tokens=output_len, do_sample=False
            )
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        total_tokens = batch_size * output_len
        tokens_per_sec = total_tokens / elapsed
        time_to_first_token = elapsed / output_len  # Approximate TTFT

        result = {
            "batch_size": batch_size,
            "tokens_per_sec": tokens_per_sec,
            "latency_ms": elapsed * 1000,
            "approx_ttft_ms": time_to_first_token * 1000,
        }
        results.append(result)
        print(f"Batch {batch_size}: {tokens_per_sec:.1f} tok/s, "
              f"latency={elapsed*1000:.0f}ms")

    return results

LLM Key Metrics¶

The metrics table below defines the standard KPIs used across the industry for LLM hardware validation.

Metric	Description	Target (Example)
TTFT	Time to first token	< 100ms (batch=1)
TPS	Tokens per second (generation)	> 50 tok/s per user
Throughput	Total tokens/sec across batches	> 1000 tok/s
Memory	Peak GPU memory usage	< 80% of HBM
Perplexity	Language model quality	Must match reference +/- 0.5

TTFT is dominated by the prefill phase (processing the entire prompt in one forward pass), which is compute-bound and benefits from high TFLOPS. TPS during generation is dominated by the decode phase (one token at a time), which is memory-bandwidth-bound because the model weights must be loaded from HBM for each token. This fundamental difference means hardware validation must benchmark both phases independently to identify whether a platform is compute-limited or bandwidth-limited for a given model size.

4.3 Computer Vision Model Validation¶

Classification Accuracy Validation¶

import torch
import torchvision.transforms as T
from torchvision import models, datasets

def validate_cv_accuracy(model_name, dataset_path, device='cuda',
                          dtype=torch.float16, num_samples=1000):
    """Validate CV model accuracy on device matches CPU reference."""
    # Load model
    model_fn = getattr(models, model_name)
    model = model_fn(weights="DEFAULT").eval()

    transform = T.Compose([
        T.Resize(256),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    # Load validation set
    dataset = datasets.ImageFolder(dataset_path, transform=transform)
    loader = torch.utils.data.DataLoader(
        dataset, batch_size=32, shuffle=False, num_workers=4
    )

    # CPU reference
    correct_cpu = 0
    correct_dev = 0
    total = 0

    model_dev = model.to(device, dtype)

    for images, labels in loader:
        if total >= num_samples:
            break

        with torch.no_grad():
            # CPU
            preds_cpu = model(images).argmax(dim=1)
            # Device
            preds_dev = model_dev(images.to(device, dtype)).argmax(dim=1).cpu()

        correct_cpu += (preds_cpu == labels).sum().item()
        correct_dev += (preds_dev == labels).sum().item()
        total += labels.size(0)

    acc_cpu = correct_cpu / total * 100
    acc_dev = correct_dev / total * 100
    print(f"{model_name}: CPU={acc_cpu:.1f}%, Device={acc_dev:.1f}%, "
          f"diff={abs(acc_cpu - acc_dev):.2f}%")
    return abs(acc_cpu - acc_dev) < 1.0  # Accuracy within 1%

CV Throughput Benchmarking¶

CV throughput is measured in images per second across a sweep of batch sizes, which reveals the hardware’s compute efficiency curve. At batch=1, latency is dominated by kernel launch overhead and memory access; at large batch sizes, throughput saturates when compute units are fully utilized. The benchmark covers ResNet-50 (convolution-heavy, established baseline), ViT-B/16 (attention-heavy, tests GEMM throughput), and EfficientNet-V2 (depthwise conv + squeeze-excite, tests irregular memory access). Hardware vendors publish these numbers in their MLPerf submissions, and customers use them to select hardware for deployment – a 10% throughput difference at batch=32 can determine a multi-million-dollar procurement decision.

def benchmark_cv_throughput(model_name, batch_sizes=[1, 8, 32, 64, 128],
                             device='cuda', dtype=torch.float16):
    """Benchmark CV model throughput: images/second."""
    model_fn = getattr(models, model_name)
    model = model_fn(weights=None).eval().to(device, dtype)

    results = []
    for batch_size in batch_sizes:
        x = torch.randn(batch_size, 3, 224, 224, device=device, dtype=dtype)

        # Warmup
        for _ in range(5):
            with torch.no_grad():
                model(x)
        torch.cuda.synchronize()

        # Benchmark
        iterations = 50
        start = time.perf_counter()
        for _ in range(iterations):
            with torch.no_grad():
                model(x)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        images_per_sec = batch_size * iterations / elapsed
        result = {"batch_size": batch_size, "images_per_sec": images_per_sec}
        results.append(result)
        print(f"Batch {batch_size}: {images_per_sec:.1f} img/s")

    return results

# Benchmark common models
for model_name in ["resnet50", "vit_b_16", "efficientnet_v2_s"]:
    print(f"\n--- {model_name} ---")
    benchmark_cv_throughput(model_name)

Object Detection Validation¶

Object detection models like Faster R-CNN produce structured outputs (bounding boxes, confidence scores, class labels) rather than simple logits, requiring a different validation approach. The test compares bounding box coordinates and confidence scores between the CPU reference and the target device, with separate tolerances: boxes must match within 1.0 pixels (accounting for floating-point coordinate rounding), and scores within 0.01. Detection models are particularly sensitive to numerical precision because non-maximum suppression (NMS) uses score thresholds to filter overlapping boxes – a small change in scores can cause different boxes to survive NMS, producing visibly different detection results.

def validate_detection_model(model_name="fasterrcnn_resnet50_fpn_v2",
                              device='cuda', dtype=torch.float16):
    """Validate detection model produces consistent bounding boxes."""
    from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2

    model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT").eval()
    torch.manual_seed(42)
    x = [torch.randn(3, 640, 640)]

    # CPU reference
    with torch.no_grad():
        ref = model(x)
    ref_boxes = ref[0]["boxes"]
    ref_scores = ref[0]["scores"]

    # Device
    model_dev = model.to(device)  # Detection models often need FP32
    x_dev = [t.to(device) for t in x]
    with torch.no_grad():
        dev_result = model_dev(x_dev)
    dev_boxes = dev_result[0]["boxes"].cpu()
    dev_scores = dev_result[0]["scores"].cpu()

    # Compare top detections
    box_diff = (dev_boxes[:10] - ref_boxes[:10]).abs().max().item()
    score_diff = (dev_scores[:10] - ref_scores[:10]).abs().max().item()
    print(f"Detection: box_diff={box_diff:.4f}, score_diff={score_diff:.6f}")
    return box_diff < 1.0 and score_diff < 0.01

4.4 Speech Model Validation¶

Whisper (Speech-to-Text) Validation¶

Whisper is OpenAI’s speech recognition model that has become a de facto standard for ASR (Automatic Speech Recognition) benchmarking on AI hardware. Validation measures both transcription quality (Word Error Rate on standard corpora like LibriSpeech) and throughput (Real-Time Factor, where RTF < 1 means faster-than-realtime processing). The model uses an encoder-decoder transformer architecture where the encoder processes mel spectrograms and the decoder generates text tokens autoregressively. Hardware validation must test both the encoder (parallel, compute-bound like classification) and the decoder (sequential, memory-bound like LLM generation) to ensure both phases perform correctly on the target platform.

def validate_whisper(model_size="base", device='cuda', dtype=torch.float16):
    """Validate Whisper transcription accuracy on device."""
    import whisper
    import numpy as np

    model = whisper.load_model(model_size, device=device)

    # Generate test audio (sine wave with speech-like characteristics)
    # In practice, use real audio files from a test corpus
    audio = np.random.randn(16000 * 10).astype(np.float32)  # 10 seconds

    # Transcribe
    result = model.transcribe(audio, fp16=(dtype == torch.float16))
    print(f"Whisper ({model_size}): '{result['text'][:100]}...'")

    # For real validation, compare WER (Word Error Rate) against reference
    return result


def benchmark_whisper_throughput(model_size="base", device='cuda'):
    """Benchmark Whisper: seconds of audio processed per second."""
    import whisper
    import numpy as np

    model = whisper.load_model(model_size, device=device)
    audio_duration_sec = 30
    audio = np.random.randn(16000 * audio_duration_sec).astype(np.float32)

    start = time.perf_counter()
    model.transcribe(audio, fp16=True)
    elapsed = time.perf_counter() - start

    rtf = elapsed / audio_duration_sec  # Real-Time Factor
    speedup = audio_duration_sec / elapsed
    print(f"Whisper {model_size}: RTF={rtf:.2f}x, "
          f"Speed={speedup:.1f}x realtime, "
          f"Latency={elapsed*1000:.0f}ms for {audio_duration_sec}s audio")
    return {"rtf": rtf, "speedup": speedup}

Text-to-Speech (TTS) Validation¶

TTS models generate audio waveforms from text, and validation focuses on ensuring the output audio is non-empty, non-corrupted, and at the correct sample rate. Unlike classification tasks with discrete outputs, TTS produces continuous signals where minor numerical differences between platforms are inaudible – the primary concern is detecting catastrophic failures like NaN values in the audio buffer, incorrect sample rates, or silent (all-zero) output segments. Duration validation ensures the output length is proportional to input text length, catching buffer allocation bugs that can cause truncation or silence padding on the target device.

def validate_tts_model(device='cuda'):
    """Validate TTS model produces valid audio output."""
    # Example with a generic TTS pipeline
    from transformers import pipeline

    tts = pipeline("text-to-speech", model="microsoft/speecht5_tts", device=device)

    test_texts = [
        "Hello, this is a validation test.",
        "The quick brown fox jumps over the lazy dog.",
        "One two three four five six seven eight nine ten.",
    ]

    for text in test_texts:
        output = tts(text)
        audio = output["audio"]
        sample_rate = output["sampling_rate"]

        # Basic validation
        assert len(audio) > 0, f"Empty audio for: {text}"
        assert sample_rate > 0, "Invalid sample rate"
        assert not any(map(lambda x: x != x, audio)), "NaN in audio"  # NaN check

        duration = len(audio) / sample_rate
        print(f"TTS: '{text[:40]}...' → {duration:.1f}s audio ({sample_rate}Hz)")

4.5 Model-Level Validation Matrix¶

Models Every Hardware Vendor Must Validate¶

Category	Model	Parameters	Key Validation
LLM	Llama 3 8B	8B	Perplexity, generation quality
LLM	Llama 3 70B	70B	Multi-GPU, memory management
LLM	Mistral 7B	7B	Sliding window attention
LLM	GPT-NeoX 20B	20B	Rotary embeddings
CV	ResNet-50	25M	Classification accuracy, efficiency
CV	ViT-B/16	86M	Attention-based vision
CV	YOLO v8	11-68M	Real-time detection
CV	Stable Diffusion	1B	Image generation quality
Speech	Whisper Large	1.5B	WER on LibriSpeech
Speech	Wav2Vec 2.0	317M	ASR accuracy
Multi	CLIP	428M	Vision-language alignment
Multi	LLaVA	7-13B	Visual QA correctness

4.6 Profiling Model Performance¶

PyTorch Profiler¶

def profile_model(model, input_data, device='cuda'):
    """Profile model execution with PyTorch profiler."""
    from torch.profiler import profile, ProfilerActivity, schedule

    model = model.to(device).eval()
    input_data = input_data.to(device)

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=schedule(wait=1, warmup=3, active=5, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    ) as prof:
        for _ in range(10):
            with torch.no_grad():
                model(input_data)
            prof.step()

    # Print summary
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
    return prof

GPU Utilization Monitoring¶

GPU utilization monitoring during benchmarks provides essential context for interpreting performance numbers: high throughput with low GPU utilization (< 80%) suggests the workload is bottlenecked on data loading, CPU preprocessing, or PCIe transfers rather than GPU compute. The monitor samples nvidia-smi metrics every second, capturing GPU utilization percentage, memory utilization, power draw, and temperature. Peak memory usage reveals whether the model fits comfortably in HBM or is close to OOM. Average power during benchmarking is used to compute performance per watt (TFLOPS/W or tokens/sec/W), a critical metric for datacenter TCO calculations that directly affects hardware procurement decisions.

def monitor_gpu_utilization(duration_seconds=60):
    """Monitor GPU utilization during a benchmark."""
    import subprocess

    samples = []
    start = time.time()
    while time.time() - start < duration_seconds:
        output = subprocess.check_output(
            ["nvidia-smi",
             "--query-gpu=utilization.gpu,utilization.memory,"
             "memory.used,memory.total,power.draw,temperature.gpu",
             "--format=csv,noheader,nounits"],
            text=True
        ).strip()
        values = [v.strip() for v in output.split(',')]
        samples.append({
            "gpu_util": float(values[0]),
            "mem_util": float(values[1]),
            "mem_used_mb": float(values[2]),
            "mem_total_mb": float(values[3]),
            "power_w": float(values[4]),
            "temp_c": float(values[5]),
        })
        time.sleep(1)

    # Summary
    avg_util = sum(s["gpu_util"] for s in samples) / len(samples)
    peak_mem = max(s["mem_used_mb"] for s in samples)
    avg_power = sum(s["power_w"] for s in samples) / len(samples)
    print(f"Avg GPU util: {avg_util:.0f}%, Peak mem: {peak_mem:.0f}MB, "
          f"Avg power: {avg_power:.0f}W")
    return samples

4.7 Exercises¶

LLM Logits Parity: Load a small LLM (e.g., GPT-2) on CPU (FP32) and GPU (FP16). Compare logits for 10 different prompts. What is the average and max difference?
CV Model Sweep: Benchmark ResNet-50, ViT-B/16, and EfficientNet-B0 throughput (images/sec) across batch sizes {1, 8, 32, 64}. Which model is most efficient on your GPU?
Profiling: Use torch.profiler to profile a Llama-like transformer block. Identify the top-3 most time-consuming operations. Are they compute-bound or memory-bound?
Quantization Impact: Compare model accuracy and throughput for FP32, FP16, BF16, and INT8 (if available). At what point does accuracy degrade?
Cross-Device Parity: If you have access to both NVIDIA and AMD GPUs, run the same model on both and compare outputs. Are the top-5 predictions identical?

Key Takeaways¶

Model-level validation is the ultimate test — it’s what customers care about
Always validate both correctness (right answer) and performance (fast enough)
LLM validation includes perplexity, generation quality, and TTFT/TPS metrics
CV validation requires accuracy parity and throughput benchmarks
Profiling identifies bottlenecks: is the model compute-bound or memory-bound?

Previous: 03_framework_validation.ipynb
Next: 05_e2e_pipeline_validation.ipynb Back to Overview: README.md