Section 1: Hardware Validation

Power, Thermals, Memory & Stability for AI Accelerators

Duration: 5 hours
Difficulty: Intermediate

1.1 Why Hardware Validation Matters

Before any ML model runs on a GPU/NPU/TPU, the silicon itself must be validated under extreme workloads. AI workloads are unique because they:

  • Sustain 100% compute utilization for hours or days (unlike gaming or rendering)

  • Generate massive memory bandwidth demands (HBM3e at 5+ TB/s)

  • Create thermal hotspots from dense tensor cores running FP16/BF16/FP8

  • Require deterministic behavior — a bit-flip causes silent model corruption

Hardware validation engineers ensure the chip works correctly under these conditions.

1.2 Power Validation

Key Concepts

Metric

Description

Typical Range

TDP (Thermal Design Power)

Maximum sustained power draw

300W–700W for datacenter GPUs

TBP (Total Board Power)

Power for entire accelerator card

350W–800W

Power States (P-states)

Dynamic voltage/frequency scaling

P0 (max) → P8 (idle)

Power Capping

Firmware-enforced power limit

Configurable per card

Validation Procedures

1. Sustained Load Power Test

# NVIDIA: Monitor power under sustained compute
nvidia-smi dmon -s pucvmet -d 1 > power_log.csv &
# Run a sustained GEMM workload
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
    c = torch.matmul(a, a)
    torch.cuda.synchronize()
"
kill %1
# AMD: Monitor power on ROCm
rocm-smi --showpower --showtemp --showuse -l 1 > amd_power_log.csv &
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
    c = torch.matmul(a, a)
    torch.cuda.synchronize()
"
kill %1

2. Power State Transition Test

# Verify GPU transitions between power states correctly
nvidia-smi -q -d PERFORMANCE  # Check current P-state
# Idle → should be P8
# Under load → should be P0

# Validate power cap enforcement
nvidia-smi -pl 400  # Set power limit to 400W
# Run workload and verify power never exceeds 400W ± margin

3. Power Delivery Validation

  • Voltage droop under transient load spikes

  • VRM (Voltage Regulator Module) stability

  • Power supply ripple under sustained AI workloads

  • PCIe auxiliary power connector integrity

What to Validate

  • Sustained power stays within TDP ± 5%

  • Power capping is enforced within 10ms

  • P-state transitions are clean (no clock glitches)

  • No voltage droop-induced errors under burst loads

  • Power measurement accuracy vs external power meter

1.3 Thermal Validation

Key Concepts

Component

Max Temp

Throttle Temp

Shutdown Temp

GPU Die

83–95°C

80–90°C

95–105°C

HBM

95–105°C

95°C

110°C

VRM

100–125°C

110°C

130°C

Junction (Tj)

95–110°C

Device-specific

Device-specific

Validation Procedures

1. Thermal Characterization

import subprocess
import time
import csv

def monitor_thermal(duration_seconds=3600, interval=1):
    """Monitor GPU temperatures during a stress test."""
    results = []
    start = time.time()
    while time.time() - start < duration_seconds:
        output = subprocess.check_output(
            ["nvidia-smi",
             "--query-gpu=timestamp,temperature.gpu,temperature.memory,"
             "power.draw,clocks.sm,utilization.gpu",
             "--format=csv,noheader,nounits"],
            text=True
        )
        results.append(output.strip())
        time.sleep(interval)
    return results

Thermal Throttling Detection

Modern AI accelerators dynamically reduce clock frequency when the GPU die or HBM approaches its thermal limit, a mechanism known as thermal throttling. During sustained compute workloads like large FP16 GEMM operations, junction temperatures can climb steadily until the firmware intervenes. The detection strategy below monitors per-iteration execution time: if iteration latency increases by more than 20% above the warm-up baseline, the clock has likely been reduced. Identifying the onset time of throttling is critical for datacenter capacity planning – it tells engineers how long a workload can sustain peak performance before cooling becomes the bottleneck, and whether the thermal solution (heatsink, liquid cooling, airflow design) is adequate for 24/7 AI training workloads.

import torch
import time

def detect_thermal_throttle(matrix_size=8192, iterations=5000):
    """Run sustained compute and detect clock frequency drops."""
    a = torch.randn(matrix_size, matrix_size, device='cuda', dtype=torch.float16)
    times = []

    for i in range(iterations):
        start = time.perf_counter()
        c = torch.matmul(a, a)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start
        times.append(elapsed)

        # Detect throttling: iteration time increases >20%
        if i > 100:
            baseline = sum(times[10:100]) / 90
            if elapsed > baseline * 1.2:
                print(f"Potential throttle at iteration {i}: "
                      f"{elapsed:.4f}s vs baseline {baseline:.4f}s")

    return times

3. Cooling Validation

  • Fan curve verification (RPM vs temperature)

  • Liquid cooling loop integrity (for datacenter cards)

  • Ambient temperature impact (25°C vs 35°C inlet)

  • Sustained workload: temperature must stabilize, not continuously rise

What to Validate

  • GPU die temp stabilizes under sustained AI load

  • HBM temperature stays below throttle threshold

  • Thermal throttling activates correctly (no hard shutdowns)

  • Performance is predictable across ambient temperature range

  • Fan/cooling system responds within expected timeframes

1.4 Memory Validation

HBM (High Bandwidth Memory) Validation

Modern AI accelerators use HBM2e/HBM3/HBM3e for massive bandwidth:

GPU

Memory Type

Capacity

Bandwidth

NVIDIA H100

HBM3

80 GB

3.35 TB/s

NVIDIA H200

HBM3e

141 GB

4.8 TB/s

AMD MI300X

HBM3

192 GB

5.3 TB/s

AMD MI325X

HBM3e

256 GB

6.0 TB/s

1. Memory Bandwidth Test

import torch
import time

def measure_hbm_bandwidth(size_gb=10, dtype=torch.float16):
    """Measure sustained HBM bandwidth with large tensor copies."""
    numel = int(size_gb * 1e9 / (2 if dtype == torch.float16 else 4))
    src = torch.randn(numel, device='cuda', dtype=dtype)
    dst = torch.empty_like(src)

    # Warmup
    for _ in range(5):
        dst.copy_(src)
    torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    iterations = 100
    for _ in range(iterations):
        dst.copy_(src)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    bytes_total = numel * src.element_size() * 2 * iterations  # read + write
    bandwidth_gbs = bytes_total / elapsed / 1e9
    print(f"Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
    return bandwidth_gbs

Memory Integrity Test

HBM (High Bandwidth Memory) integrity validation writes known bit patterns to GPU memory and reads them back to verify zero corruption. Unlike ECC, which corrects single-bit errors transparently, this test catches systematic failures such as stuck bits, address decode errors, and refresh-related data loss under thermal stress. The test uses multiple patterns – sequential integers, all-ones, all-zeros, and random data – because different failure modes are sensitive to different data patterns. A single undetected bit flip in a model’s weight tensor can silently corrupt inference outputs or cause training loss to diverge, making memory integrity one of the hard gates in any hardware validation checklist before a GPU is cleared for production AI workloads.

import torch

def memory_integrity_test(size_gb=1):
    """Write known patterns and verify data integrity."""
    numel = int(size_gb * 1e9 / 4)  # FP32

    # Pattern 1: Sequential integers
    expected = torch.arange(numel, device='cuda', dtype=torch.int32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "Sequential pattern failed!"

    # Pattern 2: All ones
    expected = torch.ones(numel, device='cuda', dtype=torch.float32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "All-ones pattern failed!"

    # Pattern 3: Random pattern (write, read back, compare)
    expected = torch.randn(numel, device='cuda', dtype=torch.float32)
    actual = expected.clone()
    assert torch.equal(expected, actual), "Random pattern failed!"

    print(f"Memory integrity test passed for {size_gb} GB")

3. ECC (Error Correcting Code) Validation

# Check ECC status
nvidia-smi -q -d ECC

# Check for correctable and uncorrectable errors
nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# Inject ECC error (requires admin privileges, NVIDIA-specific)
# nvidia-smi --ecc-config=1  # Enable ECC

What to Validate

  • HBM bandwidth meets datasheet specification (±5%)

  • No bit errors under sustained read/write patterns

  • ECC correctly detects and corrects single-bit errors

  • Memory allocation and deallocation doesn’t leak

  • Out-of-memory handling is graceful (no kernel panic)

  • PCIe DMA transfers are correct (host ↔ device)

1.5 Stability Validation

Long-Duration Stress Testing

AI training runs last days to weeks. The hardware must be stable throughout.

1. 24-Hour Stress Test Template

import torch
import time
import json
from pathlib import Path

def run_stability_test(duration_hours=24, log_interval_seconds=60):
    """Run a 24-hour stability test with mixed workloads."""
    log_path = Path("stability_log.jsonl")
    start = time.time()
    end = start + duration_hours * 3600
    iteration = 0

    # Allocate tensors of varying sizes
    sizes = [1024, 2048, 4096, 8192]

    while time.time() < end:
        size = sizes[iteration % len(sizes)]
        try:
            # Mixed precision compute
            a = torch.randn(size, size, device='cuda', dtype=torch.float16)
            b = torch.randn(size, size, device='cuda', dtype=torch.float16)
            c = torch.matmul(a, b)

            # Verify basic numerical properties
            assert not torch.isnan(c).any(), f"NaN detected at iteration {iteration}"
            assert not torch.isinf(c).any(), f"Inf detected at iteration {iteration}"

            torch.cuda.synchronize()

            # Log metrics
            if iteration % (log_interval_seconds // 2) == 0:
                mem_used = torch.cuda.memory_allocated() / 1e9
                mem_total = torch.cuda.get_device_properties(0).total_mem / 1e9
                entry = {
                    "iteration": iteration,
                    "elapsed_hours": (time.time() - start) / 3600,
                    "matrix_size": size,
                    "memory_gb": round(mem_used, 2),
                    "status": "ok"
                }
                with open(log_path, "a") as f:
                    f.write(json.dumps(entry) + "\n")

            del a, b, c

        except Exception as e:
            entry = {
                "iteration": iteration,
                "elapsed_hours": (time.time() - start) / 3600,
                "error": str(e),
                "status": "FAIL"
            }
            with open(log_path, "a") as f:
                f.write(json.dumps(entry) + "\n")
            raise

        iteration += 1

    print(f"Stability test completed: {iteration} iterations over {duration_hours}h")

2. GPU Reset / Recovery Test

# Verify GPU recovers from reset
nvidia-smi --gpu-reset -i 0
sleep 5
nvidia-smi  # Should show healthy GPU

# Verify no zombie processes after reset
nvidia-smi --query-compute-apps=pid --format=csv,noheader

What to Validate

  • No hangs, crashes, or kernel panics over 24+ hours

  • No NaN/Inf outputs under sustained compute

  • GPU recovers cleanly from software resets

  • Memory doesn’t leak across thousands of allocations

  • Performance doesn’t degrade over time (thermal throttle aside)

  • No XID errors (NVIDIA) or RAS events (AMD) in system logs

1.6 Platform-Specific Tools Reference

NVIDIA Tools

Tool

Purpose

nvidia-smi

GPU monitoring, power management

dcgm-exporter

Datacenter GPU metrics for Prometheus

nvidia-bug-report.sh

Collect diagnostic bundle

cuda-memcheck

Memory error detection

AMD Tools

Tool

Purpose

rocm-smi

GPU monitoring, power management

rocm-smi --showras

RAS (Reliability) error reporting

amd-smi

Next-gen monitoring tool

rocm-bandwidth-test

Memory bandwidth measurement

Qualcomm Tools

Tool

Purpose

qaic-util

Cloud AI 100 card management

qaic-runner

Inference execution and monitoring

AWS (Annapurna)

Tool

Purpose

neuron-top

Real-time Neuron device monitoring

neuron-monitor

Detailed performance metrics

neuron-ls

List Neuron devices

1.7 Exercises

  1. Power Monitoring: Write a script that logs GPU power every second during a 10-minute GEMM workload. Plot power vs time. Does it stabilize?

  2. Thermal Characterization: Run a sustained FP16 GEMM for 30 minutes. Log temperature every 5 seconds. At what point does temperature stabilize? Is there throttling?

  3. Memory Bandwidth: Measure actual HBM bandwidth with varying tensor sizes (1 MB → 10 GB). Compare to the datasheet specification. What percentage of peak do you achieve?

  4. ECC Validation: Check your GPU’s ECC status. If enabled, run a memory stress test and verify zero uncorrectable errors.

  5. Stability Soak: Run a 1-hour mixed-precision workload. Verify no NaN/Inf in outputs and no memory leaks.

Key Takeaways

  • Hardware validation is the foundation — all software correctness depends on it

  • AI workloads stress hardware differently than traditional HPC (sustained, memory-bound)

  • Power and thermal validation prevent field failures and datacenter issues

  • Memory integrity is critical — a single bit flip can corrupt an entire training run

  • Stability testing must match real-world durations (24h+ soak tests)

Next: 02_kernel_validation.ipynb — Validate compute kernels (GEMM, conv, attention)
Back to Overview: README.md