Lab 04: Regression & Release Validation Suite

Golden Baselines · Cross-Version Testing · CI/CD for AI Hardware

Role alignment: AMD Principal Staff – AI/ML Performance Validation
Reference: 08_regression_release_validation.ipynb

This is the most JD-aligned lab — AMD’s key ask is:

“system-level test plans” and “compatibility test suites”

What you will do:

  1. Record a golden baseline (performance + correctness metrics)

  2. Build a regression checker that compares new runs to golden baselines

  3. Create a version matrix: test PyTorch version × driver config

  4. Design a CI/CD-ready test runner with JSON reports

  5. Practice writing a system-level test plan (interview exercise)

Setup

The setup creates the baseline storage directory and initializes the test environment. Golden baselines are stored as individual JSON files keyed by test name, device, and dtype, enabling parallel test execution and easy inspection of individual baseline values. The BASELINE_DIR path is configurable for integration with CI/CD systems that mount shared storage for cross-run baseline persistence.

import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import json
import platform
import subprocess
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, asdict, field
from typing import Dict, List, Optional, Any
from enum import Enum

if torch.cuda.is_available():
    DEVICE = 'cuda'
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
else:
    DEVICE = 'cpu'

print(f'Device: {DEVICE}')
if DEVICE == 'cuda':
    print(f'GPU:    {torch.cuda.get_device_name(0)}')

BASELINE_DIR = Path('baselines')
BASELINE_DIR.mkdir(exist_ok=True)
print(f'Baseline dir: {BASELINE_DIR.resolve()}')

Core Concepts

Performance regression vs Correctness regression

Type

Definition

Example

Severity

Correctness

Output values changed beyond tolerance

GEMM result differs by >1e-2

BLOCKER

Performance

Throughput dropped or latency increased

10% slower than golden

HIGH

Memory

VRAM usage increased significantly

+500MB for same workload

MEDIUM

AMD releases: ROCm version × GPU driver × framework version must all be tested.

Exercise 4.1 – System Environment Capture

Every regression run starts with capturing the exact environment.
Without this, you can’t reproduce or bisect a regression.

def capture_environment() -> Dict[str, Any]:
    """Capture full system environment for a regression run."""
    env = {
        'timestamp': datetime.now().isoformat(),
        'platform': platform.platform(),
        'python_version': platform.python_version(),
        'pytorch_version': torch.__version__,
        'cuda_available': torch.cuda.is_available(),
        'device': DEVICE,
    }

    if torch.cuda.is_available():
        env['gpu_name']     = torch.cuda.get_device_name(0)
        env['gpu_count']    = torch.cuda.device_count()
        env['cuda_version'] = torch.version.cuda
        env['vram_gb']      = round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1)

        # Try to get driver version
        try:
            driver = subprocess.check_output(
                ['nvidia-smi', '--query-gpu=driver_version', '--format=csv,noheader'],
                text=True, stderr=subprocess.DEVNULL
            ).strip()
            env['driver_version'] = driver
        except Exception:
            pass

        # Try ROCm
        try:
            rocm = subprocess.check_output(
                ['rocminfo'], text=True, stderr=subprocess.DEVNULL
            )
            for line in rocm.split('\n'):
                if 'Runtime Version' in line:
                    env['rocm_version'] = line.split(':')[-1].strip()
                    break
        except Exception:
            pass

    # Check key packages
    for pkg in ['onnxruntime', 'transformers', 'deepspeed']:
        try:
            mod = __import__(pkg.replace('-', '_'))
            env[f'{pkg}_version'] = getattr(mod, '__version__', 'unknown')
        except ImportError:
            env[f'{pkg}_version'] = 'not_installed'

    return env


env = capture_environment()
print('System environment:')
for k, v in env.items():
    print(f'  {k:30s}: {v}')

Exercise 4.2 – Golden Baseline Framework

A golden baseline is a recorded set of metrics from a known-good state.
Future runs compare against it. If anything regresses beyond threshold → flag it.

class RegressionStatus(Enum):
    PASS        = 'PASS'
    PERF_REGR   = 'PERF_REGRESSION'
    CORR_REGR   = 'CORRECTNESS_REGRESSION'
    MEM_REGR    = 'MEMORY_REGRESSION'
    NO_BASELINE = 'NO_BASELINE'


@dataclass
class BenchmarkResult:
    name: str
    latency_ms: float
    throughput_items_per_sec: float
    max_abs_error: float       # vs FP32 CPU reference
    peak_memory_mb: float
    dtype: str
    config: Dict[str, Any] = field(default_factory=dict)

    def to_dict(self):
        return asdict(self)


class GoldenBaseline:
    """
    Records and compares benchmark results against stored baselines.
    Baselines are stored as JSON files, keyed by (test_name, device, dtype).
    """

    PERF_THRESHOLD  = 0.10   # >10% slower = regression
    MEM_THRESHOLD   = 0.15   # >15% more memory = regression

    def __init__(self, baseline_dir: Path = BASELINE_DIR):
        self.baseline_dir = baseline_dir

    def _key(self, result: BenchmarkResult) -> str:
        return f'{result.name}_{DEVICE}_{result.dtype}'.replace(' ', '_')

    def _path(self, key: str) -> Path:
        return self.baseline_dir / f'{key}.json'

    def record(self, result: BenchmarkResult, env: Dict):
        """Save result as the golden baseline."""
        key = self._key(result)
        data = {'result': result.to_dict(), 'environment': env}
        self._path(key).write_text(json.dumps(data, indent=2))
        print(f'  Golden baseline recorded: {key}')

    def compare(self, result: BenchmarkResult, corr_tol: Optional[float] = None) -> RegressionStatus:
        """Compare result to baseline. Returns regression status."""
        key  = self._key(result)
        path = self._path(key)

        if not path.exists():
            return RegressionStatus.NO_BASELINE

        baseline = json.loads(path.read_text())['result']

        # 1. Correctness check
        tol = corr_tol or self._default_corr_tol(result.dtype)
        if result.max_abs_error > tol * 2:   # 2x tolerance = regression
            return RegressionStatus.CORR_REGR

        # 2. Performance check
        baseline_tps = baseline['throughput_items_per_sec']
        if baseline_tps > 0:
            perf_delta = (baseline_tps - result.throughput_items_per_sec) / baseline_tps
            if perf_delta > self.PERF_THRESHOLD:
                return RegressionStatus.PERF_REGR

        # 3. Memory check
        if baseline['peak_memory_mb'] > 0 and result.peak_memory_mb > 0:
            mem_delta = (result.peak_memory_mb - baseline['peak_memory_mb']) / baseline['peak_memory_mb']
            if mem_delta > self.MEM_THRESHOLD:
                return RegressionStatus.MEM_REGR

        return RegressionStatus.PASS

    def _default_corr_tol(self, dtype_str: str) -> float:
        return {'float32': 1e-4, 'float16': 1e-2, 'bfloat16': 0.1}.get(dtype_str, 1e-2)


print('GoldenBaseline class defined')
baseline_db = GoldenBaseline()

Exercise 4.3 – Test Suite: Define Benchmarks

A test suite is a collection of named benchmark functions.
Each produces a BenchmarkResult that can be recorded or compared.

def run_gemm_benchmark(M=1024, K=1024, N=1024, dtype=torch.float16, iters=50) -> BenchmarkResult:
    """GEMM performance + correctness benchmark."""
    if DEVICE == 'cpu':
        M = K = N = 256
        dtype = torch.float32
        iters = 10

    # Reference
    a_cpu = torch.randn(M, K)
    b_cpu = torch.randn(K, N)
    ref   = torch.matmul(a_cpu, b_cpu)

    # GPU
    a = a_cpu.to(DEVICE, dtype=dtype)
    b = b_cpu.to(DEVICE, dtype=dtype)

    for _ in range(5):
        _ = torch.matmul(a, b)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()

    if DEVICE == 'cuda':
        torch.cuda.reset_peak_memory_stats()

    t0 = time.perf_counter()
    for _ in range(iters):
        out = torch.matmul(a, b)
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    max_err  = (out.float().cpu() - ref).abs().max().item()
    lat_ms   = elapsed / iters * 1000
    tps      = (M * N * iters) / elapsed
    peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0

    return BenchmarkResult(
        name=f'GEMM_{M}x{K}x{N}',
        latency_ms=round(lat_ms, 3),
        throughput_items_per_sec=round(tps, 1),
        max_abs_error=round(max_err, 8),
        peak_memory_mb=round(peak_mem, 1),
        dtype=str(dtype).split('.')[-1],
        config={'M': M, 'K': K, 'N': N},
    )


def run_layernorm_benchmark(batch=8, seq=512, hidden=768, dtype=torch.float16, iters=100) -> BenchmarkResult:
    """LayerNorm performance + correctness benchmark."""
    if DEVICE == 'cpu':
        batch, seq, hidden = 2, 64, 128
        dtype = torch.float32
        iters = 10

    x_cpu  = torch.randn(batch, seq, hidden)
    ref    = F.layer_norm(x_cpu, [hidden])

    x_gpu  = x_cpu.to(DEVICE, dtype=dtype)

    for _ in range(5):
        _ = F.layer_norm(x_gpu, [hidden])
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()

    t0 = time.perf_counter()
    for _ in range(iters):
        out = F.layer_norm(x_gpu, [hidden])
    if DEVICE == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    max_err  = (out.float().cpu() - ref).abs().max().item()
    lat_ms   = elapsed / iters * 1000
    tps      = (batch * seq * iters) / elapsed
    peak_mem = torch.cuda.max_memory_allocated() / 1e6 if DEVICE == 'cuda' else 0.0

    return BenchmarkResult(
        name=f'LayerNorm_{batch}x{seq}x{hidden}',
        latency_ms=round(lat_ms, 3),
        throughput_items_per_sec=round(tps, 1),
        max_abs_error=round(max_err, 8),
        peak_memory_mb=round(peak_mem, 1),
        dtype=str(dtype).split('.')[-1],
        config={'batch': batch, 'seq': seq, 'hidden': hidden},
    )


# Define the test suite as a list of callables
TEST_SUITE = [
    lambda: run_gemm_benchmark(),
    lambda: run_layernorm_benchmark(),
]

print('Test suite defined:', len(TEST_SUITE), 'benchmarks')

Exercise 4.4 – Record Golden Baselines

Recording golden baselines captures the known-good performance and correctness metrics from a validated hardware + software configuration. Each baseline includes throughput, latency, maximum numerical error, and peak memory usage, along with the full environment snapshot (driver version, framework version, GPU model). In AMD’s release workflow, golden baselines are recorded on a certified reference system after the previous release has passed all validation gates. Subsequent development builds are compared against these baselines, and any regression beyond the threshold triggers investigation before the code can be merged.

print('Recording golden baselines...')
print('=' * 60)

for test_fn in TEST_SUITE:
    result = test_fn()
    baseline_db.record(result, env)
    print(f'  {result.name} | latency={result.latency_ms}ms | '
          f'tps={result.throughput_items_per_sec:.0f} | '
          f'max_err={result.max_abs_error:.2e}')

print(f'\nBaselines saved in: {BASELINE_DIR.resolve()}')
print('Files:', [f.name for f in BASELINE_DIR.iterdir()])

Exercise 4.5 – Regression Check

Simulate running the same tests again and checking for regressions.
We’ll inject an artificial performance degradation to see the detection work.

def run_regression_check(inject_perf_regression=False, inject_corr_regression=False):
    """
    Run test suite and compare to golden baselines.
    Optionally inject artificial regressions to test detection.
    """
    print(f'Running regression check  '
          f'(perf_inject={inject_perf_regression}, corr_inject={inject_corr_regression})')
    print('=' * 70)

    run_results = []

    for test_fn in TEST_SUITE:
        result = test_fn()

        # Inject artificial regression for demo purposes
        if inject_perf_regression:
            # Simulate 25% throughput drop
            result.throughput_items_per_sec *= 0.75
            result.latency_ms *= 1.33

        if inject_corr_regression:
            # Simulate numerical error spike
            result.max_abs_error *= 100

        status = baseline_db.compare(result)
        icon   = 'PASS' if status == RegressionStatus.PASS else 'FAIL'

        print(f'  [{icon}] {result.name:45s}  status={status.value}')
        if status != RegressionStatus.PASS and status != RegressionStatus.NO_BASELINE:
            baseline_file = baseline_db._path(baseline_db._key(result))
            if baseline_file.exists():
                gold = json.loads(baseline_file.read_text())['result']
                print(f'         Baseline  tps={gold["throughput_items_per_sec"]:.0f}  '
                      f'err={gold["max_abs_error"]:.2e}')
                print(f'         Current   tps={result.throughput_items_per_sec:.0f}  '
                      f'err={result.max_abs_error:.2e}')

        run_results.append({'name': result.name, 'status': status.value,
                             'result': result.to_dict()})

    passed = sum(1 for r in run_results if r['status'] == 'PASS')
    print(f'\n{passed}/{len(run_results)} tests passed')
    return run_results


# 1. Normal run — should all pass
print('--- Normal run ---')
normal_results = run_regression_check()

print()

# 2. Inject performance regression
print('--- Injected performance regression ---')
perf_results = run_regression_check(inject_perf_regression=True)

print()

# 3. Inject correctness regression
print('--- Injected correctness regression ---')
corr_results = run_regression_check(inject_corr_regression=True)

Exercise 4.6 – Version Matrix

Simulate testing across a version matrix.
In practice: ROCm 6.0 × PyTorch 2.2 vs ROCm 6.1 × PyTorch 2.3

# Simulate a version matrix test
# In real AMD validation, you'd run on actual different driver/framework combos
# Here we simulate different "configs" with slight performance variation

import random
random.seed(42)

VERSION_MATRIX = [
    {'rocm': '6.0', 'pytorch': '2.2.0', 'driver': '6.7.0'},
    {'rocm': '6.1', 'pytorch': '2.3.0', 'driver': '6.8.0'},
    {'rocm': '6.2', 'pytorch': '2.4.0', 'driver': '6.9.0'},
]

print('Version Matrix Compatibility Test')
print('=' * 80)
print(f"{'Version':>30}  {'GEMM PASS':>10}  {'LayerNorm PASS':>15}  {'Perf delta':>12}")
print('-' * 80)

matrix_report = []
for ver in VERSION_MATRIX:
    label = f"ROCm {ver['rocm']} + PyTorch {ver['pytorch']}"

    # Simulate slight perf variation across versions
    perf_multiplier = 1.0 + (random.random() - 0.5) * 0.05  # ±2.5%

    gemm_result = run_gemm_benchmark()
    gemm_result.throughput_items_per_sec *= perf_multiplier
    gemm_status = baseline_db.compare(gemm_result)

    ln_result = run_layernorm_benchmark()
    ln_result.throughput_items_per_sec *= perf_multiplier
    ln_status = baseline_db.compare(ln_result)

    perf_pct = (perf_multiplier - 1) * 100
    gemm_pass = gemm_status in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)
    ln_pass   = ln_status   in (RegressionStatus.PASS, RegressionStatus.NO_BASELINE)

    gemm_label = 'PASS' if gemm_pass else 'FAIL'
    ln_label = 'PASS' if ln_pass else 'FAIL'
    print(f'{label:>30}  {gemm_label:>10}  {ln_label:>15}  {perf_pct:>+11.1f}%')

    matrix_report.append({
        'version': ver,
        'gemm_status': gemm_status.value,
        'layernorm_status': ln_status.value,
        'perf_delta_pct': round(perf_pct, 2),
    })

Exercise 4.7 – CI/CD Report Generator

Generate a structured report that a CI/CD system can parse.
AMD’s Jenkins/GitLab CI consumes exactly this kind of JSON.

class ReleaseValidationReport:
    """Generates a CI/CD-ready release validation report."""

    def __init__(self, run_name: str, environment: Dict):
        self.run_name    = run_name
        self.environment = environment
        self.test_results: List[Dict] = []
        self.start_time  = datetime.now()

    def add_result(self, name: str, status: str, metrics: Dict, notes: str = ''):
        self.test_results.append({
            'test': name,
            'status': status,
            'metrics': metrics,
            'notes': notes,
        })

    def finalize(self) -> Dict:
        end_time = datetime.now()
        passed   = sum(1 for r in self.test_results if r['status'] == 'PASS')
        total    = len(self.test_results)
        blockers = [r for r in self.test_results
                    if 'REGRESSION' in r['status'] and 'CORRECTNESS' in r['status']]

        # Overall release decision
        if blockers:
            release_status = 'BLOCKED'
        elif passed < total:
            release_status = 'CONDITIONAL'  # performance regressions but no correctness issues
        else:
            release_status = 'APPROVED'

        return {
            'report_type': 'Release Validation',
            'run_name': self.run_name,
            'release_status': release_status,
            'summary': {
                'total': total,
                'passed': passed,
                'failed': total - passed,
                'blockers': len(blockers),
                'duration_s': (end_time - self.start_time).total_seconds(),
            },
            'environment': self.environment,
            'tests': self.test_results,
            'generated_at': end_time.isoformat(),
        }


# Build a full release report
rpt = ReleaseValidationReport(
    run_name=f'ROCm-6.1_PyTorch-2.3_MI300X_{datetime.now().strftime("%Y%m%d")}',
    environment=env,
)

# Add regression results
for r in normal_results:
    rpt.add_result(
        name=r['name'],
        status=r['status'],
        metrics=r['result'],
    )

# Add version matrix
for m in matrix_report:
    rpt.add_result(
        name=f"VersionMatrix_ROCm{m['version']['rocm']}_PT{m['version']['pytorch']}",
        status='PASS' if m['gemm_status'] == 'PASS' else m['gemm_status'],
        metrics={'perf_delta_pct': m['perf_delta_pct']},
    )

final_report = rpt.finalize()

# Save
report_path = Path('lab04_release_validation_report.json')
report_path.write_text(json.dumps(final_report, indent=2))

print(json.dumps(final_report, indent=2)[:2000], '...')
print(f'\nFull report saved: {report_path}')
print(f'\nRelease status: {final_report["release_status"]}')

Exercise 4.8 – Write a System-Level Test Plan (Interview Prep)

AMD’s JD says: “defining comprehensive system-level test plans”

Practice writing one here. This is the format AMD expects from a Principal Staff engineer.

Template: System-Level Test Plan

Scenario: Validate PyTorch 2.4 + ROCm 6.2 on MI300X for LLM inference workloads.

Test Plan ID: TP-2024-001
Author: [Your Name]
Target: MI300X × ROCm 6.2 × PyTorch 2.4

Layer 1: Hardware Validation (Gate)

Test

Tool

Pass Criteria

Severity

Power stability (24h)

rocm-smi

Within TDP ±5%

BLOCKER

Memory bandwidth

rocm-bandwidth-test

≥80% of 5.3 TB/s

HIGH

Thermal stability

rocm-smi

Temp stabilizes <95°C

HIGH

ECC error count

rocm-smi –showras

0 uncorrectable

BLOCKER

Layer 2: Kernel Correctness (Gate)

Test

Tool

Pass Criteria

Severity

GEMM FP16/BF16

KernelValidator

max_err < 0.01

BLOCKER

Attention (all seq lens)

KernelValidator

max_err < 0.1

BLOCKER

Softmax edge cases

Custom harness

No NaN/Inf

BLOCKER

LayerNorm

KernelValidator

max_err < 0.001

HIGH

Layer 3: Framework Integration

Test

Tool

Pass Criteria

Severity

PyTorch ops coverage

Op coverage script

100% of LLM ops on GPU

BLOCKER

ONNX Runtime ROCm EP

ONNX test suite

No CPU fallbacks

HIGH

torch.compile

Test suite

Same output as eager

HIGH

Layer 4: Model Performance

Test

Workload

Pass Criteria

Severity

LLaMA-2 70B inference

Decode throughput

≥90% of golden baseline

HIGH

BERT-Large training

Tokens/sec

≥90% of golden baseline

HIGH

ResNet50 inference

Images/sec

≥90% of golden baseline

MEDIUM

Layer 5: Regression vs Previous Release

Test

Comparison

Pass Criteria

Severity

Correctness regression

vs ROCm 6.1 baseline

0 correctness regressions

BLOCKER

Performance regression

vs ROCm 6.1 baseline

<5% perf delta

HIGH

Memory regression

vs ROCm 6.1 baseline

<10% memory delta

MEDIUM

Summary

Concept

Why AMD cares

Environment capture

Can’t bisect a regression without knowing exact versions

Golden baselines

Source of truth across ROCm releases

Regression detection

Automated — no human should manually compare numbers

Version matrix

ROCm × PyTorch × driver combos = hundreds of test configs

Release report

Goes to engineering managers and partners (Meta, MS, etc.)

System test plan

What AMD means by “system-level test plans” in the JD

AMD Interview Key Points

  • A regression suite is not just “run tests” — it’s record → compare → report

  • Correctness regressions are BLOCKERS; performance regressions are flagged but may ship

  • The version matrix is the core deliverable: N versions × M GPUs × K frameworks

  • A Principal Staff engineer designs the framework, not just runs it

All Labs Completed

Lab

Topic

lab_01

Hardware: power, thermals, memory, stability

lab_02

Kernels: GEMM, attention, softmax, tolerances

lab_03

Model: profiling, precision, LLM phases

lab_04

Regression: baselines, version matrix, CI/CD

lab_05

Distributed: NCCL/RCCL, multi-GPU

lab_06

Framework: PyTorch backend, ONNX

lab_07

GPGPU backends: OpenCL, Vulkan, SYCL

lab_08

Benchmarking: TTFT, throughput, SLOs

Previous: lab_03_model_performance.ipynb
Next: lab_05_distributed_training.ipynb — Distributed training validation
Back to Overview: README.md