Lab 04: Regression & Release Validation Suite¶
Golden Baselines · Cross-Version Testing · CI/CD for AI Hardware¶
Role alignment: AMD Principal Staff – AI/ML Performance Validation
Reference: 08_regression_release_validation.ipynb
This is the most JD-aligned lab — AMD’s key ask is:
“system-level test plans” and “compatibility test suites”
What you will do:¶
Record a golden baseline (performance + correctness metrics)
Build a regression checker that compares new runs to golden baselines
Create a version matrix: test
PyTorch version × driver configDesign a CI/CD-ready test runner with JSON reports
Practice writing a system-level test plan (interview exercise)
Setup¶
The setup creates the baseline storage directory and initializes the test environment. Golden baselines are stored as individual JSON files keyed by test name, device, and dtype, enabling parallel test execution and easy inspection of individual baseline values. The BASELINE_DIR path is configurable for integration with CI/CD systems that mount shared storage for cross-run baseline persistence.
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import json
import platform
import subprocess
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, asdict, field
from typing import Dict, List, Optional, Any
from enum import Enum
if torch.cuda.is_available():
DEVICE = 'cuda'
elif torch.backends.mps.is_available():
DEVICE = 'mps'
else:
DEVICE = 'cpu'
print(f'Device: {DEVICE}')
if DEVICE == 'cuda':
print(f'GPU: {torch.cuda.get_device_name(0)}')
BASELINE_DIR = Path('baselines')
BASELINE_DIR.mkdir(exist_ok=True)
print(f'Baseline dir: {BASELINE_DIR.resolve()}')
Core Concepts¶
Performance regression vs Correctness regression¶
Type |
Definition |
Example |
Severity |
|---|---|---|---|
Correctness |
Output values changed beyond tolerance |
GEMM result differs by >1e-2 |
BLOCKER |
Performance |
Throughput dropped or latency increased |
10% slower than golden |
HIGH |
Memory |
VRAM usage increased significantly |
+500MB for same workload |
MEDIUM |
AMD releases: ROCm version × GPU driver × framework version must all be tested.
Exercise 4.1 – System Environment Capture¶
Every regression run starts with capturing the exact environment.
Without this, you can’t reproduce or bisect a regression.
def capture_environment() -> Dict[str, Any]:
"""Capture full system environment for a regression run."""
env = {
'timestamp': datetime.now().isoformat(),
'platform': platform.platform(),
'python_version': platform.python_version(),
'pytorch_version': torch.__version__,
'cuda_available': torch.cuda.is_available(),
'device': DEVICE,
}
if torch.cuda.is_available():
env['gpu_name'] = torch.cuda.get_device_name(0)
env['gpu_count'] = torch.cuda.device_count()
env['cuda_version'] = torch.version.cuda
env['vram_gb'] = round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1)
# Try to get driver version
try:
driver = subprocess.check_output(
['nvidia-smi', '--query-gpu=driver_version', '--format=csv,noheader'],
text=True, stderr=subprocess.DEVNULL
).strip()
env['driver_version'] = driver
except Exception:
pass
# Try ROCm
try:
rocm = subprocess.check_output(
['rocminfo'], text=True, stderr=subprocess.DEVNULL
)
for line in rocm.split('\n'):
if 'Runtime Version' in line:
env['rocm_version'] = line.split(':')[-1].strip()
break
except Exception:
pass
# Check key packages
for pkg in ['onnxruntime', 'transformers', 'deepspeed']:
try:
mod = __import__(pkg.replace('-', '_'))
env[f'{pkg}_version'] = getattr(mod, '__version__', 'unknown')
except ImportError:
env[f'{pkg}_version'] = 'not_installed'
return env
env = capture_environment()
print('System environment:')
for k, v in env.items():
print(f' {k:30s}: {v}')
Exercise 4.2 – Golden Baseline Framework¶
A golden baseline is a recorded set of metrics from a known-good state.
Future runs compare against it. If anything regresses beyond threshold → flag it.
class RegressionStatus(Enum):
PASS = 'PASS'
PERF_REGR = 'PERF_REGRESSION'
CORR_REGR = 'CORRECTNESS_REGRESSION'
MEM_REGR = 'MEMORY_REGRESSION'
NO_BASELINE = 'NO_BASELINE'
@dataclass
class BenchmarkResult:
name: str
latency_ms: float
throughput_items_per_sec: float
max_abs_error: float # vs FP32 CPU reference
peak_memory_mb: float
dtype: str
config: Dict[str, Any] = field(default_factory=dict)
def to_dict(self):
return asdict(self)
class GoldenBaseline:
"""
Records and compares benchmark results against stored baselines.
Baselines are stored as JSON files, keyed by (test_name, device, dtype).
"""
PERF_THRESHOLD = 0.10 # >10% slower = regression
MEM_THRESHOLD = 0.15 # >15% more memory = regression
def __init__(self, baseline_dir: Path = BASELINE_DIR):
self.baseline_dir = baseline_dir
def _key(self, result: BenchmarkResult) -> str:
return f'{result.name}_{DEVICE}_{result.dtype}'.replace(' ', '_')
def _path(self, key: str) -> Path:
return self.baseline_dir / f'{key}.json'
def record(self, result: BenchmarkResult, env: Dict):
"""Save result as the golden baseline."""
key = self._key(result)
data = {'result': result.to_dict(), 'environment': env}
self._path(key).write_text(json.dumps(data, indent=2))
print(f' Golden baseline recorded: {key}')
def compare(self, result: BenchmarkResult, corr_tol: Optional[float] = None) -> RegressionStatus:
"""Compare result to baseline. Returns regression status."""
key = self._key(result)
path = self._path(key)
if not path.exists():
return RegressionStatus.NO_BASELINE
baseline = json.loads(path.read_text())['result']
# 1. Correctness check
tol = corr_tol or self._default_corr_tol(result.dtype)
if result.max_abs_error > tol * 2: # 2x tolerance = regression
return RegressionStatus.CORR_REGR
# 2. Performance check
baseline_tps = baseline['throughput_items_per_sec']
if baseline_tps > 0:
perf_delta = (baseline_tps - result.throughput_items_per_sec) / baseline_tps
if perf_delta > self.PERF_THRESHOLD:
return RegressionStatus.PERF_REGR
# 3. Memory check
if baseline['peak_memory_mb'] > 0 and result.peak_memory_mb > 0:
mem_delta = (result.peak_memory_mb - baseline['peak_memory_mb']) / baseline['peak_memory_mb']
if mem_delta > self.MEM_THRESHOLD:
return RegressionStatus.MEM_REGR
return RegressionStatus.PASS
def _default_corr_tol(self, dtype_str: str) -> float:
return {'float32': 1e-4, 'float16': 1e-2, 'bfloat16': 0.1}.get(dtype_str, 1e-2)
print('GoldenBaseline class defined')
baseline_db = GoldenBaseline()
Exercise 4.3 – Test Suite: Define Benchmarks¶
A test suite is a collection of named benchmark functions.
Each produces a BenchmarkResult that can be recorded or compared.
def run_gemm_benchmark(M=1024, K=1024, N=1024, dtype=torch.float16, iters=50) -> BenchmarkResult:
"""GEMM performance + correctness benchmark."""
if DEVICE == 'cpu':
M = K = N = 256
dtype = torch.float32
iters = 10
# Reference
a_cpu = torch.randn(M, K)
b_cpu = torch.randn(K, N)
ref = torch.matmul(a_cpu, b_cpu)
# GPU
a = a_cpu.to(DEVICE, dtype=dtype)
b = b_cpu.to(DEVICE, dtype=dtype)
for _ in range(5):
_ = torch.matmul(a, b)
if DEVICE == 'cuda':
torch.cuda.synchronize()
if DEVICE == 'cuda':
torch.cuda.reset_peak_memory_stats()
t0 = time.perf_counter()
for _ in range(iters):
out = torch.matmul(a, b)
if DEVICE == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
max_err = (out.float().cpu() - ref).abs().max().item()
lat_ms = elapsed / iters * 1000
tps = (M * N * iters) / elapsed
peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0
return BenchmarkResult(
name=f'GEMM_{M}x{K}x{N}',
latency_ms=round(lat_ms, 3),
throughput_items_per_sec=round(tps, 1),
max_abs_error=round(max_err, 8),
peak_memory_mb=round(peak_mem, 1),
dtype=str(dtype).split('.')[-1],
config={'M': M, 'K': K, 'N': N},
)
def run_layernorm_benchmark(batch=8, seq=512, hidden=768, dtype=torch.float16, iters=100) -> BenchmarkResult:
"""LayerNorm performance + correctness benchmark."""
if DEVICE == 'cpu':
batch, seq, hidden = 2, 64, 128
dtype = torch.float32
iters = 10
x_cpu = torch.randn(batch, seq, hidden)
ref = F.layer_norm(x_cpu, [hidden])
x_gpu = x_cpu.to(DEVICE, dtype=dtype)
for _ in range(5):
_ = F.layer_norm(x_gpu, [hidden])
if DEVICE == 'cuda':
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
t0 = time.perf_counter()
for _ in range(iters):
out = F.layer_norm(x_gpu, [hidden])
if DEVICE == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
max_err = (out.float().cpu() - ref).abs().max().item()
lat_ms = elapsed / iters * 1000
tps = (batch * seq * iters) / elapsed
peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0
return BenchmarkResult(
name=f'LayerNorm_{batch}x{seq}x{hidden}',
latency_ms=round(lat_ms, 3),
throughput_items_per_sec=round(tps, 1),
max_abs_error=round(max_err, 8),
peak_memory_mb=round(peak_mem, 1),
dtype=str(dtype).split('.')[-1],
config={'batch': batch, 'seq': seq, 'hidden': hidden},
)
# Define the test suite as a list of callables
TEST_SUITE = [
lambda: run_gemm_benchmark(),
lambda: run_layernorm_benchmark(),
]
print('Test suite defined:', len(TEST_SUITE), 'benchmarks')
Exercise 4.4 – Record Golden Baselines¶
Recording golden baselines captures the known-good performance and correctness metrics from a validated hardware + software configuration. Each baseline includes throughput, latency, maximum numerical error, and peak memory usage, along with the full environment snapshot (driver version, framework version, GPU model). In AMD’s release workflow, golden baselines are recorded on a certified reference system after the previous release has passed all validation gates. Subsequent development builds are compared against these baselines, and any regression beyond the threshold triggers investigation before the code can be merged.
print('Recording golden baselines...')
print('=' * 60)
for test_fn in TEST_SUITE:
result = test_fn()
baseline_db.record(result, env)
print(f' {result.name} | latency={result.latency_ms}ms | '
f'tps={result.throughput_items_per_sec:.0f} | '
f'max_err={result.max_abs_error:.2e}')
print(f'\nBaselines saved in: {BASELINE_DIR.resolve()}')
print('Files:', [f.name for f in BASELINE_DIR.iterdir()])
Exercise 4.5 – Regression Check¶
Simulate running the same tests again and checking for regressions.
We’ll inject an artificial performance degradation to see the detection work.
def run_regression_check(inject_perf_regression=False, inject_corr_regression=False):
"""
Run test suite and compare to golden baselines.
Optionally inject artificial regressions to test detection.
"""
print(f'Running regression check '
f'(perf_inject={inject_perf_regression}, corr_inject={inject_corr_regression})')
print('=' * 70)
run_results = []
for test_fn in TEST_SUITE:
result = test_fn()
# Inject artificial regression for demo purposes
if inject_perf_regression:
# Simulate 25% throughput drop
result.throughput_items_per_sec *= 0.75
result.latency_ms *= 1.33
if inject_corr_regression:
# Simulate numerical error spike
result.max_abs_error *= 100
status = baseline_db.compare(result)
icon = 'PASS' if status == RegressionStatus.PASS else 'FAIL'
print(f' [{icon}] {result.name:45s} status={status.value}')
if status != RegressionStatus.PASS and status != RegressionStatus.NO_BASELINE:
baseline_file = baseline_db._path(baseline_db._key(result))
if baseline_file.exists():
gold = json.loads(baseline_file.read_text())['result']
print(f' Baseline tps={gold["throughput_items_per_sec"]:.0f} '
f'err={gold["max_abs_error"]:.2e}')
print(f' Current tps={result.throughput_items_per_sec:.0f} '
f'err={result.max_abs_error:.2e}')
run_results.append({'name': result.name, 'status': status.value,
'result': result.to_dict()})
passed = sum(1 for r in run_results if r['status'] == 'PASS')
print(f'\n{passed}/{len(run_results)} tests passed')
return run_results
# 1. Normal run — should all pass
print('--- Normal run ---')
normal_results = run_regression_check()
print()
# 2. Inject performance regression
print('--- Injected performance regression ---')
perf_results = run_regression_check(inject_perf_regression=True)
print()
# 3. Inject correctness regression
print('--- Injected correctness regression ---')
corr_results = run_regression_check(inject_corr_regression=True)
Exercise 4.6 – Version Matrix¶
Simulate testing across a version matrix.
In practice: ROCm 6.0 × PyTorch 2.2 vs ROCm 6.1 × PyTorch 2.3
# Simulate a version matrix test
# In real AMD validation, you'd run on actual different driver/framework combos
# Here we simulate different "configs" with slight performance variation
import random
random.seed(42)
VERSION_MATRIX = [
{'rocm': '6.0', 'pytorch': '2.2.0', 'driver': '6.7.0'},
{'rocm': '6.1', 'pytorch': '2.3.0', 'driver': '6.8.0'},
{'rocm': '6.2', 'pytorch': '2.4.0', 'driver': '6.9.0'},
]
print('Version Matrix Compatibility Test')
print('=' * 80)
print(f"{'Version':>30} {'GEMM PASS':>10} {'LayerNorm PASS':>15} {'Perf delta':>12}")
print('-' * 80)
matrix_report = []
for ver in VERSION_MATRIX:
label = f"ROCm {ver['rocm']} + PyTorch {ver['pytorch']}"
# Simulate slight perf variation across versions
perf_multiplier = 1.0 + (random.random() - 0.5) * 0.05 # ±2.5%
gemm_result = run_gemm_benchmark()
gemm_result.throughput_items_per_sec *= perf_multiplier
gemm_status = baseline_db.compare(gemm_result)
ln_result = run_layernorm_benchmark()
ln_result.throughput_items_per_sec *= perf_multiplier
ln_status = baseline_db.compare(ln_result)
perf_pct = (perf_multiplier - 1) * 100
gemm_pass = gemm_status in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)
ln_pass = ln_status in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)
gemm_label = 'PASS' if gemm_pass else 'FAIL'
ln_label = 'PASS' if ln_pass else 'FAIL'
print(f'{label:>30} {gemm_label:>10} {ln_label:>15} {perf_pct:>+11.1f}%')
matrix_report.append({
'version': ver,
'gemm_status': gemm_status.value,
'layernorm_status': ln_status.value,
'perf_delta_pct': round(perf_pct, 2),
})
Exercise 4.7 – CI/CD Report Generator¶
Generate a structured report that a CI/CD system can parse.
AMD’s Jenkins/GitLab CI consumes exactly this kind of JSON.
class ReleaseValidationReport:
"""Generates a CI/CD-ready release validation report."""
def __init__(self, run_name: str, environment: Dict):
self.run_name = run_name
self.environment = environment
self.test_results: List[Dict] = []
self.start_time = datetime.now()
def add_result(self, name: str, status: str, metrics: Dict, notes: str = ''):
self.test_results.append({
'test': name,
'status': status,
'metrics': metrics,
'notes': notes,
})
def finalize(self) -> Dict:
end_time = datetime.now()
passed = sum(1 for r in self.test_results if r['status'] == 'PASS')
total = len(self.test_results)
blockers = [r for r in self.test_results
if 'REGRESSION' in r['status'] and 'CORRECTNESS' in r['status']]
# Overall release decision
if blockers:
release_status = 'BLOCKED'
elif passed < total:
release_status = 'CONDITIONAL' # performance regressions but no correctness issues
else:
release_status = 'APPROVED'
return {
'report_type': 'Release Validation',
'run_name': self.run_name,
'release_status': release_status,
'summary': {
'total': total,
'passed': passed,
'failed': total - passed,
'blockers': len(blockers),
'duration_s': (end_time - self.start_time).total_seconds(),
},
'environment': self.environment,
'tests': self.test_results,
'generated_at': end_time.isoformat(),
}
# Build a full release report
rpt = ReleaseValidationReport(
run_name=f'ROCm-6.1_PyTorch-2.3_MI300X_{datetime.now().strftime("%Y%m%d")}',
environment=env,
)
# Add regression results
for r in normal_results:
rpt.add_result(
name=r['name'],
status=r['status'],
metrics=r['result'],
)
# Add version matrix
for m in matrix_report:
rpt.add_result(
name=f"VersionMatrix_ROCm{m['version']['rocm']}_PT{m['version']['pytorch']}",
status='PASS' if m['gemm_status'] == 'PASS' else m['gemm_status'],
metrics={'perf_delta_pct': m['perf_delta_pct']},
)
final_report = rpt.finalize()
# Save
report_path = Path('lab04_release_validation_report.json')
report_path.write_text(json.dumps(final_report, indent=2))
print(json.dumps(final_report, indent=2)[:2000], '...')
print(f'\nFull report saved: {report_path}')
print(f'\nRelease status: {final_report["release_status"]}')
Exercise 4.8 – Write a System-Level Test Plan (Interview Prep)¶
AMD’s JD says: “defining comprehensive system-level test plans”
Practice writing one here. This is the format AMD expects from a Principal Staff engineer.
Template: System-Level Test Plan¶
Scenario: Validate PyTorch 2.4 + ROCm 6.2 on MI300X for LLM inference workloads.
Test Plan ID: TP-2024-001
Author: [Your Name]
Target: MI300X × ROCm 6.2 × PyTorch 2.4
Layer 1: Hardware Validation (Gate)¶
Test |
Tool |
Pass Criteria |
Severity |
|---|---|---|---|
Power stability (24h) |
rocm-smi |
Within TDP ±5% |
BLOCKER |
Memory bandwidth |
rocm-bandwidth-test |
≥80% of 5.3 TB/s |
HIGH |
Thermal stability |
rocm-smi |
Temp stabilizes <95°C |
HIGH |
ECC error count |
rocm-smi –showras |
0 uncorrectable |
BLOCKER |
Layer 2: Kernel Correctness (Gate)¶
Test |
Tool |
Pass Criteria |
Severity |
|---|---|---|---|
GEMM FP16/BF16 |
KernelValidator |
max_err < 0.01 |
BLOCKER |
Attention (all seq lens) |
KernelValidator |
max_err < 0.1 |
BLOCKER |
Softmax edge cases |
Custom harness |
No NaN/Inf |
BLOCKER |
LayerNorm |
KernelValidator |
max_err < 0.001 |
HIGH |
Layer 3: Framework Integration¶
Test |
Tool |
Pass Criteria |
Severity |
|---|---|---|---|
PyTorch ops coverage |
Op coverage script |
100% of LLM ops on GPU |
BLOCKER |
ONNX Runtime ROCm EP |
ONNX test suite |
No CPU fallbacks |
HIGH |
torch.compile |
Test suite |
Same output as eager |
HIGH |
Layer 4: Model Performance¶
Test |
Workload |
Pass Criteria |
Severity |
|---|---|---|---|
LLaMA-2 70B inference |
Decode throughput |
≥90% of golden baseline |
HIGH |
BERT-Large training |
Tokens/sec |
≥90% of golden baseline |
HIGH |
ResNet50 inference |
Images/sec |
≥90% of golden baseline |
MEDIUM |
Layer 5: Regression vs Previous Release¶
Test |
Comparison |
Pass Criteria |
Severity |
|---|---|---|---|
Correctness regression |
vs ROCm 6.1 baseline |
0 correctness regressions |
BLOCKER |
Performance regression |
vs ROCm 6.1 baseline |
<5% perf delta |
HIGH |
Memory regression |
vs ROCm 6.1 baseline |
<10% memory delta |
MEDIUM |
Summary¶
Concept |
Why AMD cares |
|---|---|
Environment capture |
Can’t bisect a regression without knowing exact versions |
Golden baselines |
Source of truth across ROCm releases |
Regression detection |
Automated — no human should manually compare numbers |
Version matrix |
ROCm × PyTorch × driver combos = hundreds of test configs |
Release report |
Goes to engineering managers and partners (Meta, MS, etc.) |
System test plan |
What AMD means by “system-level test plans” in the JD |
AMD Interview Key Points¶
A regression suite is not just “run tests” — it’s record → compare → report
Correctness regressions are BLOCKERS; performance regressions are flagged but may ship
The version matrix is the core deliverable: N versions × M GPUs × K frameworks
A Principal Staff engineer designs the framework, not just runs it
All Labs Completed¶
Lab |
Topic |
|---|---|
Hardware: power, thermals, memory, stability |
|
Kernels: GEMM, attention, softmax, tolerances |
|
Model: profiling, precision, LLM phases |
|
Regression: baselines, version matrix, CI/CD |
|
Distributed: NCCL/RCCL, multi-GPU |
|
Framework: PyTorch backend, ONNX |
|
GPGPU backends: OpenCL, Vulkan, SYCL |
|
Benchmarking: TTFT, throughput, SLOs |
Previous: lab_03_model_performance.ipynb
Next: lab_05_distributed_training.ipynb — Distributed training validation
Back to Overview: README.md