Section 1: Hardware Validation¶
Power, Thermals, Memory & Stability for AI Accelerators¶
Duration: 5 hours
Difficulty: Intermediate
1.1 Why Hardware Validation Matters¶
Before any ML model runs on a GPU/NPU/TPU, the silicon itself must be validated under extreme workloads. AI workloads are unique because they:
Sustain 100% compute utilization for hours or days (unlike gaming or rendering)
Generate massive memory bandwidth demands (HBM3e at 5+ TB/s)
Create thermal hotspots from dense tensor cores running FP16/BF16/FP8
Require deterministic behavior — a bit-flip causes silent model corruption
Hardware validation engineers ensure the chip works correctly under these conditions.
1.2 Power Validation¶
Key Concepts¶
Metric |
Description |
Typical Range |
|---|---|---|
TDP (Thermal Design Power) |
Maximum sustained power draw |
300W–700W for datacenter GPUs |
TBP (Total Board Power) |
Power for entire accelerator card |
350W–800W |
Power States (P-states) |
Dynamic voltage/frequency scaling |
P0 (max) → P8 (idle) |
Power Capping |
Firmware-enforced power limit |
Configurable per card |
Validation Procedures¶
1. Sustained Load Power Test
# NVIDIA: Monitor power under sustained compute
nvidia-smi dmon -s pucvmet -d 1 > power_log.csv &
# Run a sustained GEMM workload
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
c = torch.matmul(a, a)
torch.cuda.synchronize()
"
kill %1
# AMD: Monitor power on ROCm
rocm-smi --showpower --showtemp --showuse -l 1 > amd_power_log.csv &
python -c "
import torch
a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16)
for i in range(10000):
c = torch.matmul(a, a)
torch.cuda.synchronize()
"
kill %1
2. Power State Transition Test
# Verify GPU transitions between power states correctly
nvidia-smi -q -d PERFORMANCE # Check current P-state
# Idle → should be P8
# Under load → should be P0
# Validate power cap enforcement
nvidia-smi -pl 400 # Set power limit to 400W
# Run workload and verify power never exceeds 400W ± margin
3. Power Delivery Validation
Voltage droop under transient load spikes
VRM (Voltage Regulator Module) stability
Power supply ripple under sustained AI workloads
PCIe auxiliary power connector integrity
What to Validate¶
Sustained power stays within TDP ± 5%
Power capping is enforced within 10ms
P-state transitions are clean (no clock glitches)
No voltage droop-induced errors under burst loads
Power measurement accuracy vs external power meter
1.3 Thermal Validation¶
Key Concepts¶
Component |
Max Temp |
Throttle Temp |
Shutdown Temp |
|---|---|---|---|
GPU Die |
83–95°C |
80–90°C |
95–105°C |
HBM |
95–105°C |
95°C |
110°C |
VRM |
100–125°C |
110°C |
130°C |
Junction (Tj) |
95–110°C |
Device-specific |
Device-specific |
Validation Procedures¶
1. Thermal Characterization
import subprocess
import time
import csv
def monitor_thermal(duration_seconds=3600, interval=1):
"""Monitor GPU temperatures during a stress test."""
results = []
start = time.time()
while time.time() - start < duration_seconds:
output = subprocess.check_output(
["nvidia-smi",
"--query-gpu=timestamp,temperature.gpu,temperature.memory,"
"power.draw,clocks.sm,utilization.gpu",
"--format=csv,noheader,nounits"],
text=True
)
results.append(output.strip())
time.sleep(interval)
return results
Thermal Throttling Detection¶
Modern AI accelerators dynamically reduce clock frequency when the GPU die or HBM approaches its thermal limit, a mechanism known as thermal throttling. During sustained compute workloads like large FP16 GEMM operations, junction temperatures can climb steadily until the firmware intervenes. The detection strategy below monitors per-iteration execution time: if iteration latency increases by more than 20% above the warm-up baseline, the clock has likely been reduced. Identifying the onset time of throttling is critical for datacenter capacity planning – it tells engineers how long a workload can sustain peak performance before cooling becomes the bottleneck, and whether the thermal solution (heatsink, liquid cooling, airflow design) is adequate for 24/7 AI training workloads.
import torch
import time
def detect_thermal_throttle(matrix_size=8192, iterations=5000):
"""Run sustained compute and detect clock frequency drops."""
a = torch.randn(matrix_size, matrix_size, device='cuda', dtype=torch.float16)
times = []
for i in range(iterations):
start = time.perf_counter()
c = torch.matmul(a, a)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
times.append(elapsed)
# Detect throttling: iteration time increases >20%
if i > 100:
baseline = sum(times[10:100]) / 90
if elapsed > baseline * 1.2:
print(f"Potential throttle at iteration {i}: "
f"{elapsed:.4f}s vs baseline {baseline:.4f}s")
return times
3. Cooling Validation
Fan curve verification (RPM vs temperature)
Liquid cooling loop integrity (for datacenter cards)
Ambient temperature impact (25°C vs 35°C inlet)
Sustained workload: temperature must stabilize, not continuously rise
What to Validate¶
GPU die temp stabilizes under sustained AI load
HBM temperature stays below throttle threshold
Thermal throttling activates correctly (no hard shutdowns)
Performance is predictable across ambient temperature range
Fan/cooling system responds within expected timeframes
1.4 Memory Validation¶
HBM (High Bandwidth Memory) Validation¶
Modern AI accelerators use HBM2e/HBM3/HBM3e for massive bandwidth:
GPU |
Memory Type |
Capacity |
Bandwidth |
|---|---|---|---|
NVIDIA H100 |
HBM3 |
80 GB |
3.35 TB/s |
NVIDIA H200 |
HBM3e |
141 GB |
4.8 TB/s |
AMD MI300X |
HBM3 |
192 GB |
5.3 TB/s |
AMD MI325X |
HBM3e |
256 GB |
6.0 TB/s |
1. Memory Bandwidth Test
import torch
import time
def measure_hbm_bandwidth(size_gb=10, dtype=torch.float16):
"""Measure sustained HBM bandwidth with large tensor copies."""
numel = int(size_gb * 1e9 / (2 if dtype == torch.float16 else 4))
src = torch.randn(numel, device='cuda', dtype=dtype)
dst = torch.empty_like(src)
# Warmup
for _ in range(5):
dst.copy_(src)
torch.cuda.synchronize()
# Benchmark
start = time.perf_counter()
iterations = 100
for _ in range(iterations):
dst.copy_(src)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
bytes_total = numel * src.element_size() * 2 * iterations # read + write
bandwidth_gbs = bytes_total / elapsed / 1e9
print(f"Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
return bandwidth_gbs
Memory Integrity Test¶
HBM (High Bandwidth Memory) integrity validation writes known bit patterns to GPU memory and reads them back to verify zero corruption. Unlike ECC, which corrects single-bit errors transparently, this test catches systematic failures such as stuck bits, address decode errors, and refresh-related data loss under thermal stress. The test uses multiple patterns – sequential integers, all-ones, all-zeros, and random data – because different failure modes are sensitive to different data patterns. A single undetected bit flip in a model’s weight tensor can silently corrupt inference outputs or cause training loss to diverge, making memory integrity one of the hard gates in any hardware validation checklist before a GPU is cleared for production AI workloads.
import torch
def memory_integrity_test(size_gb=1):
"""Write known patterns and verify data integrity."""
numel = int(size_gb * 1e9 / 4) # FP32
# Pattern 1: Sequential integers
expected = torch.arange(numel, device='cuda', dtype=torch.int32)
actual = expected.clone()
assert torch.equal(expected, actual), "Sequential pattern failed!"
# Pattern 2: All ones
expected = torch.ones(numel, device='cuda', dtype=torch.float32)
actual = expected.clone()
assert torch.equal(expected, actual), "All-ones pattern failed!"
# Pattern 3: Random pattern (write, read back, compare)
expected = torch.randn(numel, device='cuda', dtype=torch.float32)
actual = expected.clone()
assert torch.equal(expected, actual), "Random pattern failed!"
print(f"Memory integrity test passed for {size_gb} GB")
3. ECC (Error Correcting Code) Validation
# Check ECC status
nvidia-smi -q -d ECC
# Check for correctable and uncorrectable errors
nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv
# Inject ECC error (requires admin privileges, NVIDIA-specific)
# nvidia-smi --ecc-config=1 # Enable ECC
What to Validate¶
HBM bandwidth meets datasheet specification (±5%)
No bit errors under sustained read/write patterns
ECC correctly detects and corrects single-bit errors
Memory allocation and deallocation doesn’t leak
Out-of-memory handling is graceful (no kernel panic)
PCIe DMA transfers are correct (host ↔ device)
1.5 Stability Validation¶
Long-Duration Stress Testing¶
AI training runs last days to weeks. The hardware must be stable throughout.
1. 24-Hour Stress Test Template
import torch
import time
import json
from pathlib import Path
def run_stability_test(duration_hours=24, log_interval_seconds=60):
"""Run a 24-hour stability test with mixed workloads."""
log_path = Path("stability_log.jsonl")
start = time.time()
end = start + duration_hours * 3600
iteration = 0
# Allocate tensors of varying sizes
sizes = [1024, 2048, 4096, 8192]
while time.time() < end:
size = sizes[iteration % len(sizes)]
try:
# Mixed precision compute
a = torch.randn(size, size, device='cuda', dtype=torch.float16)
b = torch.randn(size, size, device='cuda', dtype=torch.float16)
c = torch.matmul(a, b)
# Verify basic numerical properties
assert not torch.isnan(c).any(), f"NaN detected at iteration {iteration}"
assert not torch.isinf(c).any(), f"Inf detected at iteration {iteration}"
torch.cuda.synchronize()
# Log metrics
if iteration % (log_interval_seconds // 2) == 0:
mem_used = torch.cuda.memory_allocated() / 1e9
mem_total = torch.cuda.get_device_properties(0).total_mem / 1e9
entry = {
"iteration": iteration,
"elapsed_hours": (time.time() - start) / 3600,
"matrix_size": size,
"memory_gb": round(mem_used, 2),
"status": "ok"
}
with open(log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
del a, b, c
except Exception as e:
entry = {
"iteration": iteration,
"elapsed_hours": (time.time() - start) / 3600,
"error": str(e),
"status": "FAIL"
}
with open(log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
raise
iteration += 1
print(f"Stability test completed: {iteration} iterations over {duration_hours}h")
2. GPU Reset / Recovery Test
# Verify GPU recovers from reset
nvidia-smi --gpu-reset -i 0
sleep 5
nvidia-smi # Should show healthy GPU
# Verify no zombie processes after reset
nvidia-smi --query-compute-apps=pid --format=csv,noheader
What to Validate¶
No hangs, crashes, or kernel panics over 24+ hours
No NaN/Inf outputs under sustained compute
GPU recovers cleanly from software resets
Memory doesn’t leak across thousands of allocations
Performance doesn’t degrade over time (thermal throttle aside)
No XID errors (NVIDIA) or RAS events (AMD) in system logs
1.6 Platform-Specific Tools Reference¶
NVIDIA Tools¶
Tool |
Purpose |
|---|---|
|
GPU monitoring, power management |
|
Datacenter GPU metrics for Prometheus |
|
Collect diagnostic bundle |
|
Memory error detection |
AMD Tools¶
Tool |
Purpose |
|---|---|
|
GPU monitoring, power management |
|
RAS (Reliability) error reporting |
|
Next-gen monitoring tool |
|
Memory bandwidth measurement |
Qualcomm Tools¶
Tool |
Purpose |
|---|---|
|
Cloud AI 100 card management |
|
Inference execution and monitoring |
AWS (Annapurna)¶
Tool |
Purpose |
|---|---|
|
Real-time Neuron device monitoring |
|
Detailed performance metrics |
|
List Neuron devices |
1.7 Exercises¶
Power Monitoring: Write a script that logs GPU power every second during a 10-minute GEMM workload. Plot power vs time. Does it stabilize?
Thermal Characterization: Run a sustained FP16 GEMM for 30 minutes. Log temperature every 5 seconds. At what point does temperature stabilize? Is there throttling?
Memory Bandwidth: Measure actual HBM bandwidth with varying tensor sizes (1 MB → 10 GB). Compare to the datasheet specification. What percentage of peak do you achieve?
ECC Validation: Check your GPU’s ECC status. If enabled, run a memory stress test and verify zero uncorrectable errors.
Stability Soak: Run a 1-hour mixed-precision workload. Verify no NaN/Inf in outputs and no memory leaks.
Key Takeaways¶
Hardware validation is the foundation — all software correctness depends on it
AI workloads stress hardware differently than traditional HPC (sustained, memory-bound)
Power and thermal validation prevent field failures and datacenter issues
Memory integrity is critical — a single bit flip can corrupt an entire training run
Stability testing must match real-world durations (24h+ soak tests)
Next: 02_kernel_validation.ipynb — Validate compute kernels (GEMM, conv, attention)
Back to Overview: README.md