Run this notebook: Open in Colab Open in Kaggle

Section 1: Hardware Validation¶

Power, Thermals, Memory & Stability for AI Accelerators¶

Duration: 5 hours
Difficulty: Intermediate

1.1 Why Hardware Validation Matters¶

Before any ML model runs on a GPU/NPU/TPU, the silicon itself must be validated under extreme workloads. AI workloads are unique because they:

Sustain 100% compute utilization for hours or days (unlike gaming or rendering)
Generate massive memory bandwidth demands (HBM3e at 5+ TB/s)
Create thermal hotspots from dense tensor cores running FP16/BF16/FP8
Require deterministic behavior — a bit-flip causes silent model corruption

Hardware validation engineers ensure the chip works correctly under these conditions.

1.2 Power Validation¶

Key Concepts¶

Metric	Description	Typical Range
TDP (Thermal Design Power)	Maximum sustained power draw	300W–700W for datacenter GPUs
TBP (Total Board Power)	Power for entire accelerator card	350W–800W
Power States (P-states)	Dynamic voltage/frequency scaling	P0 (max) → P8 (idle)
Power Capping	Firmware-enforced power limit	Configurable per card

Validation Procedures¶

1. Sustained Load Power Test

# NVIDIA: Monitor power under sustained compute
nvidia-smi dmon -s pucvmet -d 1 > power_log.csv &
# Run a sustained GEMM workload
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
    c = torch.matmul(a, a)
    torch.cuda.synchronize()
"
kill %1

# AMD: Monitor power on ROCm
rocm-smi --showpower --showtemp --showuse -l 1 > amd_power_log.csv &
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
    c = torch.matmul(a, a)
    torch.cuda.synchronize()
"
kill %1

2. Power State Transition Test

# Verify GPU transitions between power states correctly
nvidia-smi -q -d PERFORMANCE  # Check current P-state
# Idle → should be P8
# Under load → should be P0

# Validate power cap enforcement
nvidia-smi -pl 400  # Set power limit to 400W
# Run workload and verify power never exceeds 400W ± margin

3. Power Delivery Validation

Voltage droop under transient load spikes
VRM (Voltage Regulator Module) stability
Power supply ripple under sustained AI workloads
PCIe auxiliary power connector integrity

What to Validate¶

Sustained power stays within TDP ± 5%
Power capping is enforced within 10ms
P-state transitions are clean (no clock glitches)
No voltage droop-induced errors under burst loads
Power measurement accuracy vs external power meter

1.3 Thermal Validation¶

Key Concepts¶

Component	Max Temp	Throttle Temp	Shutdown Temp
GPU Die	83–95°C	80–90°C	95–105°C
HBM	95–105°C	95°C	110°C
VRM	100–125°C	110°C	130°C
Junction (Tj)	95–110°C	Device-specific	Device-specific

Validation Procedures¶

1. Thermal Characterization

import subprocess
import time
import csv

def monitor_thermal(duration_seconds=3600, interval=1):
    """Monitor GPU temperatures during a stress test."""
    results = []
    start = time.time()
    while time.time() - start < duration_seconds:
        output = subprocess.check_output(
            ["nvidia-smi",
             "--query-gpu=timestamp,temperature.gpu,temperature.memory,"
             "power.draw,clocks.sm,utilization.gpu",
             "--format=csv,noheader,nounits"],
            text=True
        )
        results.append(output.strip())
        time.sleep(interval)
    return results

Thermal Throttling Detection¶

Modern AI accelerators dynamically reduce clock frequency when the GPU die or HBM approaches its thermal limit, a mechanism known as thermal throttling. During sustained compute workloads like large FP16 GEMM operations, junction temperatures can climb steadily until the firmware intervenes. The detection strategy below monitors per-iteration execution time: if iteration latency increases by more than 20% above the warm-up baseline, the clock has likely been reduced. Identifying the onset time of throttling is critical for datacenter capacity planning – it tells engineers how long a workload can sustain peak performance before cooling becomes the bottleneck, and whether the thermal solution (heatsink, liquid cooling, airflow design) is adequate for 24/7 AI training workloads.

import torch
import time

def detect_thermal_throttle(matrix_size=8192, iterations=5000):
    """Run sustained compute and detect clock frequency drops."""
    a = torch.randn(matrix_size, matrix_size, device='cuda', dtype=torch.float16)
    times = []

    for i in range(iterations):
        start = time.perf_counter()
        c = torch.matmul(a, a)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start
        times.append(elapsed)

        # Detect throttling: iteration time increases >20%
        if i > 100:
            baseline = sum(times[10:100]) / 90
            if elapsed > baseline * 1.2:
                print(f"Potential throttle at iteration {i}: "
                      f"{elapsed:.4f}s vs baseline {baseline:.4f}s")

    return times

3. Cooling Validation

Fan curve verification (RPM vs temperature)
Liquid cooling loop integrity (for datacenter cards)
Ambient temperature impact (25°C vs 35°C inlet)
Sustained workload: temperature must stabilize, not continuously rise

What to Validate¶

GPU die temp stabilizes under sustained AI load
HBM temperature stays below throttle threshold
Thermal throttling activates correctly (no hard shutdowns)
Performance is predictable across ambient temperature range
Fan/cooling system responds within expected timeframes

1.4 Memory Validation¶

HBM (High Bandwidth Memory) Validation¶

Modern AI accelerators use HBM2e/HBM3/HBM3e for massive bandwidth:

GPU	Memory Type	Capacity	Bandwidth
NVIDIA H100	HBM3	80 GB	3.35 TB/s
NVIDIA H200	HBM3e	141 GB	4.8 TB/s
AMD MI300X	HBM3	192 GB	5.3 TB/s
AMD MI325X	HBM3e	256 GB	6.0 TB/s

1. Memory Bandwidth Test

import torch
import time

def measure_hbm_bandwidth(size_gb=10, dtype=torch.float16):
    """Measure sustained HBM bandwidth with large tensor copies."""
    numel = int(size_gb * 1e9 / (2 if dtype == torch.float16 else 4))
    src = torch.randn(numel, device='cuda', dtype=dtype)
    dst = torch.empty_like(src)

    # Warmup
    for _ in range(5):
        dst.copy_(src)
    torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    iterations = 100
    for _ in range(iterations):
        dst.copy_(src)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    bytes_total = numel * src.element_size() * 2 * iterations  # read + write
    bandwidth_gbs = bytes_total / elapsed / 1e9
    print(f"Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
    return bandwidth_gbs

Memory Integrity Test¶

HBM (High Bandwidth Memory) integrity validation writes known bit patterns to GPU memory and reads them back to verify zero corruption. Unlike ECC, which corrects single-bit errors transparently, this test catches systematic failures such as stuck bits, address decode errors, and refresh-related data loss under thermal stress. The test uses multiple patterns – sequential integers, all-ones, all-zeros, and random data – because different failure modes are sensitive to different data patterns. A single undetected bit flip in a model’s weight tensor can silently corrupt inference outputs or cause training loss to diverge, making memory integrity one of the hard gates in any hardware validation checklist before a GPU is cleared for production AI workloads.

import torch

def memory_integrity_test(size_gb=1):
    """Write known patterns and verify data integrity."""
    numel = int(size_gb * 1e9 / 4)  # FP32

    # Pattern 1: Sequential integers
    expected = torch.arange(numel, device='cuda', dtype=torch.int32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "Sequential pattern failed!"

    # Pattern 2: All ones
    expected = torch.ones(numel, device='cuda', dtype=torch.float32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "All-ones pattern failed!"

    # Pattern 3: Random pattern (write, read back, compare)
    expected = torch.randn(numel, device='cuda', dtype=torch.float32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "Random pattern failed!"

    print(f"Memory integrity test passed for {size_gb} GB")

3. ECC (Error Correcting Code) Validation

# Check ECC status
nvidia-smi -q -d ECC

# Check for correctable and uncorrectable errors
nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# Inject ECC error (requires admin privileges, NVIDIA-specific)
# nvidia-smi --ecc-config=1  # Enable ECC

What to Validate¶

HBM bandwidth meets datasheet specification (±5%)
No bit errors under sustained read/write patterns
ECC correctly detects and corrects single-bit errors
Memory allocation and deallocation doesn’t leak
Out-of-memory handling is graceful (no kernel panic)
PCIe DMA transfers are correct (host ↔ device)

1.5 Stability Validation¶

Long-Duration Stress Testing¶

AI training runs last days to weeks. The hardware must be stable throughout.

1. 24-Hour Stress Test Template

import torch
import time
import json
from pathlib import Path

def run_stability_test(duration_hours=24, log_interval_seconds=60):
    """Run a 24-hour stability test with mixed workloads."""
    log_path = Path("stability_log.jsonl")
    start = time.time()
    end = start + duration_hours * 3600
    iteration = 0

    # Allocate tensors of varying sizes
    sizes = [1024, 2048, 4096, 8192]

    while time.time() < end:
        size = sizes[iteration % len(sizes)]
        try:
            # Mixed precision compute
            a = torch.randn(size, size, device='cuda', dtype=torch.float16)
            b = torch.randn(size, size, device='cuda', dtype=torch.float16)
            c = torch.matmul(a, b)

            # Verify basic numerical properties
            assert not torch.isnan(c).any(), f"NaN detected at iteration {iteration}"
            assert not torch.isinf(c).any(), f"Inf detected at iteration {iteration}"

            torch.cuda.synchronize()

            # Log metrics
            if iteration % (log_interval_seconds // 2) == 0:
                mem_used = torch.cuda.memory_allocated() / 1e9
                mem_total = torch.cuda.get_device_properties(0).total_mem / 1e9
                entry = {
                    "iteration": iteration,
                    "elapsed_hours": (time.time() - start) / 3600,
                    "matrix_size": size,
                    "memory_gb": round(mem_used, 2),
                    "status": "ok"
                }
                with open(log_path, "a") as f:
                    f.write(json.dumps(entry) + "\n")

            del a, b, c

        except Exception as e:
            entry = {
                "iteration": iteration,
                "elapsed_hours": (time.time() - start) / 3600,
                "error": str(e),
                "status": "FAIL"
            }
            with open(log_path, "a") as f:
                f.write(json.dumps(entry) + "\n")
            raise

        iteration += 1

    print(f"Stability test completed: {iteration} iterations over {duration_hours}h")

2. GPU Reset / Recovery Test

# Verify GPU recovers from reset
nvidia-smi --gpu-reset -i 0
sleep 5
nvidia-smi  # Should show healthy GPU

# Verify no zombie processes after reset
nvidia-smi --query-compute-apps=pid --format=csv,noheader

What to Validate¶

No hangs, crashes, or kernel panics over 24+ hours
No NaN/Inf outputs under sustained compute
GPU recovers cleanly from software resets
Memory doesn’t leak across thousands of allocations
Performance doesn’t degrade over time (thermal throttle aside)
No XID errors (NVIDIA) or RAS events (AMD) in system logs

1.6 Platform-Specific Tools Reference¶

NVIDIA Tools¶

Tool	Purpose
`nvidia-smi`	GPU monitoring, power management
`dcgm-exporter`	Datacenter GPU metrics for Prometheus
`nvidia-bug-report.sh`	Collect diagnostic bundle
`cuda-memcheck`	Memory error detection

AMD Tools¶

Tool	Purpose
`rocm-smi`	GPU monitoring, power management
`rocm-smi --showras`	RAS (Reliability) error reporting
`amd-smi`	Next-gen monitoring tool
`rocm-bandwidth-test`	Memory bandwidth measurement

Qualcomm Tools¶

Tool	Purpose
`qaic-util`	Cloud AI 100 card management
`qaic-runner`	Inference execution and monitoring

AWS (Annapurna)¶

Tool	Purpose
`neuron-top`	Real-time Neuron device monitoring
`neuron-monitor`	Detailed performance metrics
`neuron-ls`	List Neuron devices

1.7 Exercises¶

Power Monitoring: Write a script that logs GPU power every second during a 10-minute GEMM workload. Plot power vs time. Does it stabilize?
Thermal Characterization: Run a sustained FP16 GEMM for 30 minutes. Log temperature every 5 seconds. At what point does temperature stabilize? Is there throttling?
Memory Bandwidth: Measure actual HBM bandwidth with varying tensor sizes (1 MB → 10 GB). Compare to the datasheet specification. What percentage of peak do you achieve?
ECC Validation: Check your GPU’s ECC status. If enabled, run a memory stress test and verify zero uncorrectable errors.
Stability Soak: Run a 1-hour mixed-precision workload. Verify no NaN/Inf in outputs and no memory leaks.

Key Takeaways¶

Hardware validation is the foundation — all software correctness depends on it
AI workloads stress hardware differently than traditional HPC (sustained, memory-bound)
Power and thermal validation prevent field failures and datacenter issues
Memory integrity is critical — a single bit flip can corrupt an entire training run
Stability testing must match real-world durations (24h+ soak tests)

Next: 02_kernel_validation.ipynb — Validate compute kernels (GEMM, conv, attention)
Back to Overview: README.md