Run this notebook: Open in Colab Open in Kaggle

Lab 04: Regression & Release Validation Suite¶

Golden Baselines · Cross-Version Testing · CI/CD for AI Hardware¶

Role alignment: AMD Principal Staff – AI/ML Performance Validation
Reference: 08_regression_release_validation.ipynb

This is the most JD-aligned lab — AMD’s key ask is:

“system-level test plans” and “compatibility test suites”

What you will do:¶

Record a golden baseline (performance + correctness metrics)
Build a regression checker that compares new runs to golden baselines
Create a version matrix: test PyTorch version × driver config
Design a CI/CD-ready test runner with JSON reports
Practice writing a system-level test plan (interview exercise)

Setup¶

The setup creates the baseline storage directory and initializes the test environment. Golden baselines are stored as individual JSON files keyed by test name, device, and dtype, enabling parallel test execution and easy inspection of individual baseline values. The BASELINE_DIR path is configurable for integration with CI/CD systems that mount shared storage for cross-run baseline persistence.

import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import json
import platform
import subprocess
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, asdict, field
from typing import Dict, List, Optional, Any
from enum import Enum

if torch.cuda.is_available():
    DEVICE = 'cuda'
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
else:
    DEVICE = 'cpu'

print(f'Device: {DEVICE}')
if DEVICE == 'cuda':
    print(f'GPU:    {torch.cuda.get_device_name(0)}')

BASELINE_DIR = Path('baselines')
BASELINE_DIR.mkdir(exist_ok=True)
print(f'Baseline dir: {BASELINE_DIR.resolve()}')

Core Concepts¶

Performance regression vs Correctness regression¶

Type	Definition	Example	Severity
Correctness	Output values changed beyond tolerance	GEMM result differs by >1e-2	BLOCKER
Performance	Throughput dropped or latency increased	10% slower than golden	HIGH
Memory	VRAM usage increased significantly	+500MB for same workload	MEDIUM

AMD releases: ROCm version × GPU driver × framework version must all be tested.

Exercise 4.1 – System Environment Capture¶

Every regression run starts with capturing the exact environment.
Without this, you can’t reproduce or bisect a regression.

def capture_environment() -> Dict[str, Any]:
    """Capture full system environment for a regression run."""
    env = {
        'timestamp': datetime.now().isoformat(),
        'platform': platform.platform(),
        'python_version': platform.python_version(),
        'pytorch_version': torch.__version__,
        'cuda_available': torch.cuda.is_available(),
        'device': DEVICE,
    }

    if torch.cuda.is_available():
        env['gpu_name']     = torch.cuda.get_device_name(0)
        env['gpu_count']    = torch.cuda.device_count()
        env['cuda_version'] = torch.version.cuda
        env['vram_gb']      = round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1)

        # Try to get driver version
        try:
            driver = subprocess.check_output(
                ['nvidia-smi', '--query-gpu=driver_version', '--format=csv,noheader'],
                text=True, stderr=subprocess.DEVNULL
            ).strip()
            env['driver_version'] = driver
        except Exception:
            pass

        # Try ROCm
        try:
            rocm = subprocess.check_output(
                ['rocminfo'], text=True, stderr=subprocess.DEVNULL
            )
            for line in rocm.split('\n'):
                if 'Runtime Version' in line:
                    env['rocm_version'] = line.split(':')[-1].strip()
                    break
        except Exception:
            pass

    # Check key packages
    for pkg in ['onnxruntime', 'transformers', 'deepspeed']:
        try:
            mod = __import__(pkg.replace('-', '_'))
            env[f'{pkg}_version'] = getattr(mod, '__version__', 'unknown')
        except ImportError:
            env[f'{pkg}_version'] = 'not_installed'

    return env


env = capture_environment()
print('System environment:')
for k, v in env.items():
    print(f'  {k:30s}: {v}')

Exercise 4.2 – Golden Baseline Framework¶

A golden baseline is a recorded set of metrics from a known-good state.
Future runs compare against it. If anything regresses beyond threshold → flag it.

class RegressionStatus(Enum):
    PASS        = 'PASS'
    PERF_REGR   = 'PERF_REGRESSION'
    CORR_REGR   = 'CORRECTNESS_REGRESSION'
    MEM_REGR    = 'MEMORY_REGRESSION'
    NO_BASELINE = 'NO_BASELINE'


@dataclass
class BenchmarkResult:
    name: str
    latency_ms: float
    throughput_items_per_sec: float
    max_abs_error: float       # vs FP32 CPU reference
    peak_memory_mb: float
    dtype: str
    config: Dict[str, Any] = field(default_factory=dict)

    def to_dict(self):
        return asdict(self)


class GoldenBaseline:
    """
    Records and compares benchmark results against stored baselines.
    Baselines are stored as JSON files, keyed by (test_name, device, dtype).
    """

    PERF_THRESHOLD  = 0.10   # >10% slower = regression
    MEM_THRESHOLD   = 0.15   # >15% more memory = regression

    def __init__(self, baseline_dir: Path = BASELINE_DIR):
        self.baseline_dir = baseline_dir

    def _key(self, result: BenchmarkResult) -> str:
        return f'{result.name}_{DEVICE}_{result.dtype}'.replace(' ', '_')

    def _path(self, key: str) -> Path:
        return self.baseline_dir / f'{key}.json'

    def record(self, result: BenchmarkResult, env: Dict):
        """Save result as the golden baseline."""
        key = self._key(result)
        data = {'result': result.to_dict(), 'environment': env}
        self._path(key).write_text(json.dumps(data, indent=2))
        print(f'  Golden baseline recorded: {key}')

    def compare(self, result: BenchmarkResult, corr_tol: Optional[float] = None) -> RegressionStatus:
        """Compare result to baseline. Returns regression status."""
        key  = self._key(result)
        path = self._path(key)

        if not path.exists():
            return RegressionStatus.NO_BASELINE

        baseline = json.loads(path.read_text())['result']

        # 1. Correctness check
        tol = corr_tol or self._default_corr_tol(result.dtype)
        if result.max_abs_error > tol * 2:   # 2x tolerance = regression
            return RegressionStatus.CORR_REGR

        # 2. Performance check
        baseline_tps = baseline['throughput_items_per_sec']
        if baseline_tps > 0:
            perf_delta = (baseline_tps - result.throughput_items_per_sec) / baseline_tps
            if perf_delta > self.PERF_THRESHOLD:
                return RegressionStatus.PERF_REGR

        # 3. Memory check
        if baseline['peak_memory_mb'] > 0 and result.peak_memory_mb > 0:
            mem_delta = (result.peak_memory_mb - baseline['peak_memory_mb']) / baseline['peak_memory_mb']
            if mem_delta > self.MEM_THRESHOLD:
                return RegressionStatus.MEM_REGR

        return RegressionStatus.PASS

    def _default_corr_tol(self, dtype_str: str) -> float:
        return {'float32': 1e-4, 'float16': 1e-2, 'bfloat16': 0.1}.get(dtype_str, 1e-2)


print('GoldenBaseline class defined')
baseline_db = GoldenBaseline()

Exercise 4.3 – Test Suite: Define Benchmarks¶

A test suite is a collection of named benchmark functions.
Each produces a BenchmarkResult that can be recorded or compared.

def run_gemm_benchmark(M=1024, K=1024, N=1024, dtype=torch.float16, iters=50) -> BenchmarkResult:
    """GEMM performance + correctness benchmark."""
    if DEVICE == 'cpu':
        M = K = N = 256
        dtype = torch.float32
        iters = 10

    # Reference
    a_cpu = torch.randn(M, K)
    b_cpu = torch.randn(K, N)
    ref   = torch.matmul(a_cpu, b_cpu)

    # GPU
    a = a_cpu.to(DEVICE, dtype=dtype)
    b = b_cpu.to(DEVICE, dtype=dtype)

    for _ in range(5):
        _ = torch.matmul(a, b)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()

    if DEVICE == 'cuda':
        torch.cuda.reset_peak_memory_stats()

    t0 = time.perf_counter()
    for _ in range(iters):
        out = torch.matmul(a, b)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    max_err  = (out.float().cpu() - ref).abs().max().item()
    lat_ms   = elapsed / iters * 1000
    tps      = (M * N * iters) / elapsed
    peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0

    return BenchmarkResult(
        name=f'GEMM_{M}x{K}x{N}',
        latency_ms=round(lat_ms, 3),
        throughput_items_per_sec=round(tps, 1),
        max_abs_error=round(max_err, 8),
        peak_memory_mb=round(peak_mem, 1),
        dtype=str(dtype).split('.')[-1],
        config={'M': M, 'K': K, 'N': N},
    )


def run_layernorm_benchmark(batch=8, seq=512, hidden=768, dtype=torch.float16, iters=100) -> BenchmarkResult:
    """LayerNorm performance + correctness benchmark."""
    if DEVICE == 'cpu':
        batch, seq, hidden = 2, 64, 128
        dtype = torch.float32
        iters = 10

    x_cpu  = torch.randn(batch, seq, hidden)
    ref    = F.layer_norm(x_cpu, [hidden])

    x_gpu  = x_cpu.to(DEVICE, dtype=dtype)

    for _ in range(5):
        _ = F.layer_norm(x_gpu, [hidden])
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()

    t0 = time.perf_counter()
    for _ in range(iters):
        out = F.layer_norm(x_gpu, [hidden])
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    max_err  = (out.float().cpu() - ref).abs().max().item()
    lat_ms   = elapsed / iters * 1000
    tps      = (batch * seq * iters) / elapsed
    peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0

    return BenchmarkResult(
        name=f'LayerNorm_{batch}x{seq}x{hidden}',
        latency_ms=round(lat_ms, 3),
        throughput_items_per_sec=round(tps, 1),
        max_abs_error=round(max_err, 8),
        peak_memory_mb=round(peak_mem, 1),
        dtype=str(dtype).split('.')[-1],
        config={'batch': batch, 'seq': seq, 'hidden': hidden},
    )


# Define the test suite as a list of callables
TEST_SUITE = [
    lambda: run_gemm_benchmark(),
    lambda: run_layernorm_benchmark(),
]

print('Test suite defined:', len(TEST_SUITE), 'benchmarks')

Exercise 4.4 – Record Golden Baselines¶

Recording golden baselines captures the known-good performance and correctness metrics from a validated hardware + software configuration. Each baseline includes throughput, latency, maximum numerical error, and peak memory usage, along with the full environment snapshot (driver version, framework version, GPU model). In AMD’s release workflow, golden baselines are recorded on a certified reference system after the previous release has passed all validation gates. Subsequent development builds are compared against these baselines, and any regression beyond the threshold triggers investigation before the code can be merged.

print('Recording golden baselines...')
print('=' * 60)

for test_fn in TEST_SUITE:
    result = test_fn()
    baseline_db.record(result, env)
    print(f'  {result.name} | latency={result.latency_ms}ms | '
          f'tps={result.throughput_items_per_sec:.0f} | '
          f'max_err={result.max_abs_error:.2e}')

print(f'\nBaselines saved in: {BASELINE_DIR.resolve()}')
print('Files:', [f.name for f in BASELINE_DIR.iterdir()])

Exercise 4.5 – Regression Check¶

Simulate running the same tests again and checking for regressions.
We’ll inject an artificial performance degradation to see the detection work.

def run_regression_check(inject_perf_regression=False, inject_corr_regression=False):
    """
    Run test suite and compare to golden baselines.
    Optionally inject artificial regressions to test detection.
    """
    print(f'Running regression check  '
          f'(perf_inject={inject_perf_regression}, corr_inject={inject_corr_regression})')
    print('=' * 70)

    run_results = []

    for test_fn in TEST_SUITE:
        result = test_fn()

        # Inject artificial regression for demo purposes
        if inject_perf_regression:
            # Simulate 25% throughput drop
            result.throughput_items_per_sec *= 0.75
            result.latency_ms *= 1.33

        if inject_corr_regression:
            # Simulate numerical error spike
            result.max_abs_error *= 100

        status = baseline_db.compare(result)
        icon   = 'PASS' if status == RegressionStatus.PASS else 'FAIL'

        print(f'  [{icon}] {result.name:45s}  status={status.value}')
        if status != RegressionStatus.PASS and status != RegressionStatus.NO_BASELINE:
            baseline_file = baseline_db._path(baseline_db._key(result))
            if baseline_file.exists():
                gold = json.loads(baseline_file.read_text())['result']
                print(f'         Baseline  tps={gold["throughput_items_per_sec"]:.0f}  '
                      f'err={gold["max_abs_error"]:.2e}')
                print(f'         Current   tps={result.throughput_items_per_sec:.0f}  '
                      f'err={result.max_abs_error:.2e}')

        run_results.append({'name': result.name, 'status': status.value,
                             'result': result.to_dict()})

    passed = sum(1 for r in run_results if r['status'] == 'PASS')
    print(f'\n{passed}/{len(run_results)} tests passed')
    return run_results


# 1. Normal run — should all pass
print('--- Normal run ---')
normal_results = run_regression_check()

print()

# 2. Inject performance regression
print('--- Injected performance regression ---')
perf_results = run_regression_check(inject_perf_regression=True)

print()

# 3. Inject correctness regression
print('--- Injected correctness regression ---')
corr_results = run_regression_check(inject_corr_regression=True)

Exercise 4.6 – Version Matrix¶

Simulate testing across a version matrix.
In practice: ROCm 6.0 × PyTorch 2.2 vs ROCm 6.1 × PyTorch 2.3

# Simulate a version matrix test
# In real AMD validation, you'd run on actual different driver/framework combos
# Here we simulate different "configs" with slight performance variation

import random
random.seed(42)

VERSION_MATRIX = [
    {'rocm': '6.0', 'pytorch': '2.2.0', 'driver': '6.7.0'},
    {'rocm': '6.1', 'pytorch': '2.3.0', 'driver': '6.8.0'},
    {'rocm': '6.2', 'pytorch': '2.4.0', 'driver': '6.9.0'},
]

print('Version Matrix Compatibility Test')
print('=' * 80)
print(f"{'Version':>30}  {'GEMM PASS':>10}  {'LayerNorm PASS':>15}  {'Perf delta':>12}")
print('-' * 80)

matrix_report = []
for ver in VERSION_MATRIX:
    label = f"ROCm {ver['rocm']} + PyTorch {ver['pytorch']}"

    # Simulate slight perf variation across versions
    perf_multiplier = 1.0 + (random.random() - 0.5) * 0.05  # ±2.5%

    gemm_result = run_gemm_benchmark()
    gemm_result.throughput_items_per_sec *= perf_multiplier
    gemm_status = baseline_db.compare(gemm_result)

    ln_result = run_layernorm_benchmark()
    ln_result.throughput_items_per_sec *= perf_multiplier
    ln_status = baseline_db.compare(ln_result)

    perf_pct = (perf_multiplier - 1) * 100
    gemm_pass = gemm_status in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)
    ln_pass   = ln_status   in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)

    gemm_label = 'PASS' if gemm_pass else 'FAIL'
    ln_label = 'PASS' if ln_pass else 'FAIL'
    print(f'{label:>30}  {gemm_label:>10}  {ln_label:>15}  {perf_pct:>+11.1f}%')

    matrix_report.append({
        'version': ver,
        'gemm_status': gemm_status.value,
        'layernorm_status': ln_status.value,
        'perf_delta_pct': round(perf_pct, 2),
    })

Exercise 4.7 – CI/CD Report Generator¶

Generate a structured report that a CI/CD system can parse.
AMD’s Jenkins/GitLab CI consumes exactly this kind of JSON.

class ReleaseValidationReport:
    """Generates a CI/CD-ready release validation report."""

    def __init__(self, run_name: str, environment: Dict):
        self.run_name    = run_name
        self.environment = environment
        self.test_results: List[Dict] = []
        self.start_time  = datetime.now()

    def add_result(self, name: str, status: str, metrics: Dict, notes: str = ''):
        self.test_results.append({
            'test': name,
            'status': status,
            'metrics': metrics,
            'notes': notes,
        })

    def finalize(self) -> Dict:
        end_time = datetime.now()
        passed   = sum(1 for r in self.test_results if r['status'] == 'PASS')
        total    = len(self.test_results)
        blockers = [r for r in self.test_results
                    if 'REGRESSION' in r['status'] and 'CORRECTNESS' in r['status']]

        # Overall release decision
        if blockers:
            release_status = 'BLOCKED'
        elif passed < total:
            release_status = 'CONDITIONAL'  # performance regressions but no correctness issues
        else:
            release_status = 'APPROVED'

        return {
            'report_type': 'Release Validation',
            'run_name': self.run_name,
            'release_status': release_status,
            'summary': {
                'total': total,
                'passed': passed,
                'failed': total - passed,
                'blockers': len(blockers),
                'duration_s': (end_time - self.start_time).total_seconds(),
            },
            'environment': self.environment,
            'tests': self.test_results,
            'generated_at': end_time.isoformat(),
        }


# Build a full release report
rpt = ReleaseValidationReport(
    run_name=f'ROCm-6.1_PyTorch-2.3_MI300X_{datetime.now().strftime("%Y%m%d")}',
    environment=env,
)

# Add regression results
for r in normal_results:
    rpt.add_result(
        name=r['name'],
        status=r['status'],
        metrics=r['result'],
    )

# Add version matrix
for m in matrix_report:
    rpt.add_result(
        name=f"VersionMatrix_ROCm{m['version']['rocm']}_PT{m['version']['pytorch']}",
        status='PASS' if m['gemm_status'] == 'PASS' else m['gemm_status'],
        metrics={'perf_delta_pct': m['perf_delta_pct']},
    )

final_report = rpt.finalize()

# Save
report_path = Path('lab04_release_validation_report.json')
report_path.write_text(json.dumps(final_report, indent=2))

print(json.dumps(final_report, indent=2)[:2000], '...')
print(f'\nFull report saved: {report_path}')
print(f'\nRelease status: {final_report["release_status"]}')

Exercise 4.8 – Write a System-Level Test Plan (Interview Prep)¶

AMD’s JD says: “defining comprehensive system-level test plans”

Practice writing one here. This is the format AMD expects from a Principal Staff engineer.

Template: System-Level Test Plan¶

Scenario: Validate PyTorch 2.4 + ROCm 6.2 on MI300X for LLM inference workloads.

Test Plan ID: TP-2024-001
Author: [Your Name]
Target: MI300X × ROCm 6.2 × PyTorch 2.4

Layer 1: Hardware Validation (Gate)¶

Test	Tool	Pass Criteria	Severity
Power stability (24h)	rocm-smi	Within TDP ±5%	BLOCKER
Memory bandwidth	rocm-bandwidth-test	≥80% of 5.3 TB/s	HIGH
Thermal stability	rocm-smi	Temp stabilizes <95°C	HIGH
ECC error count	rocm-smi –showras	0 uncorrectable	BLOCKER

Layer 2: Kernel Correctness (Gate)¶

Test	Tool	Pass Criteria	Severity
GEMM FP16/BF16	KernelValidator	max_err < 0.01	BLOCKER
Attention (all seq lens)	KernelValidator	max_err < 0.1	BLOCKER
Softmax edge cases	Custom harness	No NaN/Inf	BLOCKER
LayerNorm	KernelValidator	max_err < 0.001	HIGH

Layer 3: Framework Integration¶

Test	Tool	Pass Criteria	Severity
PyTorch ops coverage	Op coverage script	100% of LLM ops on GPU	BLOCKER
ONNX Runtime ROCm EP	ONNX test suite	No CPU fallbacks	HIGH
torch.compile	Test suite	Same output as eager	HIGH

Layer 4: Model Performance¶

Test	Workload	Pass Criteria	Severity
LLaMA-2 70B inference	Decode throughput	≥90% of golden baseline	HIGH
BERT-Large training	Tokens/sec	≥90% of golden baseline	HIGH
ResNet50 inference	Images/sec	≥90% of golden baseline	MEDIUM

Layer 5: Regression vs Previous Release¶

Test	Comparison	Pass Criteria	Severity
Correctness regression	vs ROCm 6.1 baseline	0 correctness regressions	BLOCKER
Performance regression	vs ROCm 6.1 baseline	<5% perf delta	HIGH
Memory regression	vs ROCm 6.1 baseline	<10% memory delta	MEDIUM

Summary¶

Concept	Why AMD cares
Environment capture	Can’t bisect a regression without knowing exact versions
Golden baselines	Source of truth across ROCm releases
Regression detection	Automated — no human should manually compare numbers
Version matrix	ROCm × PyTorch × driver combos = hundreds of test configs
Release report	Goes to engineering managers and partners (Meta, MS, etc.)
System test plan	What AMD means by “system-level test plans” in the JD

AMD Interview Key Points¶

A regression suite is not just “run tests” — it’s record → compare → report
Correctness regressions are BLOCKERS; performance regressions are flagged but may ship
The version matrix is the core deliverable: N versions × M GPUs × K frameworks
A Principal Staff engineer designs the framework, not just runs it

All Labs Completed¶

Lab	Topic
lab_01	Hardware: power, thermals, memory, stability
lab_02	Kernels: GEMM, attention, softmax, tolerances
lab_03	Model: profiling, precision, LLM phases
lab_04	Regression: baselines, version matrix, CI/CD
lab_05	Distributed: NCCL/RCCL, multi-GPU
lab_06	Framework: PyTorch backend, ONNX
lab_07	GPGPU backends: OpenCL, Vulkan, SYCL
lab_08	Benchmarking: TTFT, throughput, SLOs

Previous: lab_03_model_performance.ipynb
Next: lab_05_distributed_training.ipynb — Distributed training validation
Back to Overview: README.md