Run this notebook: Open in Colab Open in Kaggle

Chapter 9: Industry AI Benchmarking & Performance Analysis¶

Duration: 3-4 hours | Difficulty: Intermediate-Advanced

Learning Objectives¶

By the end of this chapter, you will understand:

The standard metrics used to benchmark LLM inference (TTFT, output speed, throughput)
How Artificial Analysis benchmarks API providers and hardware systems
AA-SLT (System Load Test) and AA-AgentPerf methodologies in detail
Intelligence benchmarking: evals, indices, and confidence intervals
Multi-modal benchmarking approaches
Other industry benchmarks: MLPerf, LMSys Arena, Open LLM Leaderboard
How these external benchmarks relate to your internal hardware validation work

Why This Matters for Hardware Validation Engineers¶

Your internal validation work (chapters 1-8) produces the numbers that external benchmarks like Artificial Analysis measure. Understanding what gets measured externally tells you:

What to optimize for — TTFT and output speed are the KPIs customers care about
How your hardware is judged — MI300X vs H100 vs B200 comparisons are public
What SLOs to target — real production thresholds for agentic workloads
Where AMD can differentiate — $/hr, perf/watt, and capacity per rack

Internal Validation (Ch 1-8)	External Benchmark (This Chapter)
GEMM correctness & TFLOPS	Output tokens/sec under load
Power draw during sustained ops	Perf/watt normalization
Regression baselines	Throughput plateau detection
Distributed AllReduce bandwidth	System-level capacity planning

Part 1 — LLM Performance Metrics: What Gets Measured¶

1.1 Core Latency & Speed Metrics¶

Every LLM inference benchmark measures some combination of these:

Metric	Definition	Why It Matters
TTFT (Time to First Token)	Time from request sent to first token received	User-perceived responsiveness
TTFAT (Time to First Answer Token)	TTFT + reasoning token time (for reasoning models)	Real wait time before useful output
Output Speed	Tokens/sec after first token	Reading speed — how fast text appears
System Output Throughput	Total tokens/sec across all concurrent users	Hardware capacity metric
E2E Response Time	Total time for complete response	Full request lifecycle

1.2 The Formulas¶

Time to First Token¶

\[\text{TTFT} = T_{\text{first\_token\_arrival}} - T_{\text{request\_sent}}\]

Time to First Answer Token (reasoning models)¶

\[\text{TTFAT} = \text{Input Processing Time} + \frac{\text{Avg Reasoning Tokens}}{\text{Reasoning Output Speed}}\]

Output Speed¶

\[\text{Output Speed} = \frac{\text{Total Tokens} - \text{First Chunk Tokens}}{T_{\text{final\_chunk}} - T_{\text{first\_chunk}}}\]

Total Response Time (for 100 output tokens)¶

\[\text{Total Response Time} = \text{TTFT} + \frac{100}{\text{Output Speed}}\]

End-to-End Response Time (reasoning models, 500 answer tokens)¶

\[\text{E2E} = \text{Input Processing} + \frac{\text{Reasoning Tokens}}{\text{Reasoning Speed}} + \frac{500}{\text{Answer Speed}}\]

1.3 Token Normalization¶

Different models use different tokenizers. Artificial Analysis normalizes everything to OpenAI tokens (using o200k_base tokenizer via tiktoken) for fair comparison.

Model	Native tokenizer	Tokens for “Hello, world!”	OpenAI tokens
GPT-4	o200k_base	4	4 (reference)
Llama 3	SentencePiece	5	4
Claude	Custom BPE	4	4

This matters because a model claiming “200 tokens/sec” with a verbose tokenizer might actually be generating less text than one claiming “150 tokens/sec” with a compact tokenizer.

1.4 Pricing Metrics¶

Blended Price assumes 3:1 input-to-output ratio:

\[\text{Blended Price} = \frac{3 \times \text{Input Price} + \text{Output Price}}{4}\]

This is expressed as USD per 1M tokens.

# ── 1.5 Metric Calculation Demo ──────────────────────────────────────────────
import time
import numpy as np

def calculate_llm_metrics(token_timestamps: list[float], request_sent: float) -> dict:
    """Calculate standard LLM inference metrics from token arrival timestamps.

    Args:
        token_timestamps: list of arrival times for each output token
        request_sent: timestamp when the request was sent

    Returns:
        dict with TTFT, output_speed, e2e_latency
    """
    ttft = token_timestamps[0] - request_sent
    n_tokens = len(token_timestamps)

    if n_tokens > 1:
        output_speed = (n_tokens - 1) / (token_timestamps[-1] - token_timestamps[0])
    else:
        output_speed = 0.0

    e2e_latency = token_timestamps[-1] - request_sent

    return {
        "ttft_ms": ttft * 1000,
        "output_speed_tps": output_speed,
        "e2e_latency_ms": e2e_latency * 1000,
        "total_tokens": n_tokens,
        "total_response_time_100t_ms": (ttft + 100 / max(output_speed, 1e-9)) * 1000,
    }


# Simulate a streaming response: 50 tokens with ~80 t/s speed
np.random.seed(42)
request_t = time.time()
prefill_delay = 0.15  # 150ms TTFT
inter_token = 1.0 / 80  # ~12.5ms per token

timestamps = []
t = request_t + prefill_delay
for _ in range(50):
    timestamps.append(t)
    t += inter_token * (1 + np.random.normal(0, 0.05))  # 5% jitter

metrics = calculate_llm_metrics(timestamps, request_t)
print("=== Simulated LLM Inference Metrics ===")
for k, v in metrics.items():
    if "ms" in k:
        print(f"  {k:<35} {v:>10.1f} ms")
    elif "tps" in k:
        print(f"  {k:<35} {v:>10.1f} tokens/sec")
    else:
        print(f"  {k:<35} {v:>10}")

# ── 1.6 Token Normalization with tiktoken ────────────────────────────────────
try:
    import tiktoken
    enc = tiktoken.get_encoding("o200k_base")  # OpenAI's standard

    test_strings = [
        "Hello, world!",
        "The quick brown fox jumps over the lazy dog.",
        "def fibonacci(n):\n    if n <= 1: return n\n    return fibonacci(n-1) + fibonacci(n-2)",
        "Explain the concept of attention mechanisms in transformer models.",
    ]

    print(f"{'Text (first 50 chars)':<55} {'OpenAI tokens':>15}")
    print("-" * 72)
    for s in test_strings:
        tokens = enc.encode(s)
        display = s[:50] + ("..." if len(s) > 50 else "")
        print(f"{display:<55} {len(tokens):>15}")

    # Why normalization matters
    print("\n=== Why normalization matters ===")
    text = "Explain quantum computing in simple terms." * 10
    openai_tokens = len(enc.encode(text))
    # Hypothetical: another tokenizer produces 20% more tokens for same text
    native_tokens = int(openai_tokens * 1.2)
    native_speed = 200  # tokens/sec in native tokens
    normalized_speed = native_speed * (openai_tokens / native_tokens)
    print(f"  Native tokenizer speed : {native_speed} native tokens/sec")
    print(f"  OpenAI-normalized speed: {normalized_speed:.0f} OpenAI tokens/sec")
    print(f"  Actual text generation is {(1 - normalized_speed/native_speed)*100:.0f}% slower than native number suggests")

except ImportError:
    print("tiktoken not installed. Run: pip install tiktoken")

# ── 1.7 Blended Pricing Calculator ───────────────────────────────────────────
PROVIDER_PRICING = {
    "GPT-4o":         {"input": 2.50, "output": 10.00},
    "GPT-4o-mini":    {"input": 0.15, "output": 0.60},
    "Claude Sonnet":  {"input": 3.00, "output": 15.00},
    "Claude Haiku":   {"input": 0.25, "output": 1.25},
    "Llama 3.3 70B":  {"input": 0.20, "output": 0.20},  # typical open-weight hosting
    "DeepSeek V3":    {"input": 0.27, "output": 1.10},
}

def blended_price(input_price: float, output_price: float) -> float:
    """Calculate blended price assuming 3:1 input:output ratio."""
    return (3 * input_price + output_price) / 4


print(f"{'Model':<20} {'Input $/1M':>12} {'Output $/1M':>13} {'Blended $/1M':>14}")
print("-" * 62)
for model, prices in sorted(PROVIDER_PRICING.items(), key=lambda x: blended_price(x[1]['input'], x[1]['output'])):
    bp = blended_price(prices['input'], prices['output'])
    print(f"{model:<20} ${prices['input']:>10.2f} ${prices['output']:>11.2f} ${bp:>12.2f}")

Part 2 — API Performance Benchmarking¶

2.1 How Artificial Analysis Tests API Providers¶

Testing Infrastructure¶

Component	Specification
Test server	Virtual machine in GCP `us-central1-a`
Test frequency	1K & 10K input workloads: 8x/day (~every 3 hours)
	Parallel workload (10 concurrent): 1x/day
	100K input workload: 1x/week
Metric window	Rolling 72 hours (“live” metrics)
Accounts	Mix of anonymous, credited, and provided API keys

Workload Shapes¶

Workload	Input tokens	Purpose
Short context	~1,000	Chatbot-style, quick Q&A
Medium context	~10,000	Document analysis, code review
Long context	~100,000	Full codebase, book analysis
Vision	Image + text	Multi-modal inference
Parallel (10x)	10 concurrent	Load handling

What Makes This Methodology Good¶

Standardized location — all tests from same GCP zone eliminates network variance
Regular cadence — 8x/day captures peak/off-peak performance
Rolling window — 72 hours smooths transient spikes
Multiple workloads — short, medium, long context exercises different system paths
Real accounts — tests what customers actually experience

What It Doesn’t Capture¶

Geographic variation — only tests from one zone
Burst patterns — 10 concurrent is modest; production may see 1000s
Model-specific optimizations — can’t see if a provider has speculative decoding enabled
Cold start latency — serverless endpoints may have spin-up time not captured

2.2 Reasoning Model Considerations¶

For models like o1, DeepSeek-R1, and Claude with extended thinking:

Average Reasoning Tokens is measured across 60 diverse prompts (personal, commercial, coding, math, science)
Prompts sourced from: MMLU Pro, AIME 2025, LiveCodeBench
Default assumption: 2K reasoning tokens when actual count is unavailable
TTFAT (Time to First Answer Token) is reported separately from TTFT

# ── 2.3 Simulating an API Performance Benchmark ──────────────────────────────

def simulate_api_benchmark(n_runs: int = 20, base_ttft_ms: float = 200,
                            base_speed_tps: float = 100, output_tokens: int = 100):
    """Simulate API performance measurements with realistic variance.

    Real API performance varies due to:
    - Server load (queueing)
    - Batch scheduling
    - Network jitter
    - KV cache hits/misses
    """
    results = []
    for i in range(n_runs):
        # Add realistic noise
        ttft = base_ttft_ms * np.random.lognormal(0, 0.3)  # log-normal: right-skewed
        speed = base_speed_tps * np.random.normal(1.0, 0.1)  # normal: symmetric
        speed = max(speed, 10)  # floor

        e2e = ttft + (output_tokens / speed) * 1000
        results.append({"ttft_ms": ttft, "speed_tps": speed, "e2e_ms": e2e})

    return results


# Simulate two providers
provider_a = simulate_api_benchmark(n_runs=50, base_ttft_ms=150, base_speed_tps=120)
provider_b = simulate_api_benchmark(n_runs=50, base_ttft_ms=300, base_speed_tps=90)

def summarize(results, name):
    ttfts = [r["ttft_ms"] for r in results]
    speeds = [r["speed_tps"] for r in results]
    e2es = [r["e2e_ms"] for r in results]
    print(f"\n{'='*55}")
    print(f"  {name}  (n={len(results)})")
    print(f"{'='*55}")
    print(f"  {'Metric':<25} {'Median':>10} {'P5':>10} {'P95':>10}")
    print(f"  {'-'*55}")
    for label, data, unit in [
        ("TTFT", ttfts, "ms"),
        ("Output Speed", speeds, "t/s"),
        ("E2E (100 tokens)", e2es, "ms"),
    ]:
        p5 = np.percentile(data, 5)
        p50 = np.percentile(data, 50)
        p95 = np.percentile(data, 95)
        print(f"  {label:<25} {p50:>8.0f} {unit}  {p5:>8.0f}  {p95:>8.0f}")

summarize(provider_a, "Provider A (fast TTFT, high speed)")
summarize(provider_b, "Provider B (slow TTFT, lower speed)")

Part 3 — Hardware Benchmarking: AA-SLT (System Load Test)¶

3.1 Overview¶

The AA-SLT is Artificial Analysis’s original hardware benchmark. It measures how a hardware system performs under increasing concurrent load.

Design Principles¶

Phased concurrency ramp — start with 1 user, scale to saturation
Fixed workload — 1K input tokens, 1K output tokens per query
Sustained load — 3 minutes per phase (excludes ramp-up/cool-down)
Immediate replacement — when a query finishes, a new one starts immediately
Streaming enabled — tokens measured as they arrive

Concurrency Phases¶

Phase:  1    2    3    4    5    6    7     8      9     ...
Users:  1    2    4    8    16   32   64    128    192   ...
Time:   3m   3m   3m   3m   3m   3m   3m    3m     3m

After 64, increments of 64 until throughput plateaus.

Key Metrics¶

Metric	Definition	Aggregation
System Output Throughput	Total output tokens/sec across all concurrent users	Average over phase
Response Rate	Proportion of queries that received at least 1 token	Per phase
E2E Latency per Query	Full response time per query	Median per phase
Output Speed per Query	Tokens/sec after first token, per query	Median per phase

Throughput Plateau Detection¶

The benchmark stops when adding more concurrent users no longer increases total system throughput. This indicates the hardware is saturated:

System Throughput (t/s)
  |
  |          ___________  ← plateau (hardware saturated)
  |         /
  |        /
  |       /
  |      /
  |     /
  |    /
  |   /
  |__/
  +---------------------------→ Concurrent Users
  1  2  4  8  16  32  64  128

3.2 Current Results (Llama 3.3 70B)¶

As measured by Artificial Analysis (March 2026):

Accelerator	Peak System Throughput (t/s)	Per-Query Speed (t/s)	Rental $/GPU/hr
NVIDIA B200	15,400	144	—
NVIDIA H200	8,470	97	$5.00
NVIDIA H100	7,290	95.9	$2.70
AMD MI300X	6,730	85	$2.00
Google TPU v6e	3,670	61.3	$2.70

Key Observations for AMD¶

MI300X throughput is ~92% of H100 at 74% the cost — competitive on perf/$
Per-query speed gap is larger (85 vs 96 t/s) — decode phase optimization opportunity
B200 is 2.3x MI300X in throughput — next-gen AMD (MI350X) needs to close this
MI300X’s 192GB HBM3 enables larger batch sizes than H100 (80GB) — advantage at high concurrency

# ── 3.3 Simulate AA-SLT Phased Load Test ─────────────────────────────────────
import dataclasses


@dataclasses.dataclass
class SLTPhaseResult:
    concurrent_users: int
    system_throughput_tps: float
    median_per_query_speed_tps: float
    median_e2e_latency_ms: float
    response_rate: float  # 0.0 - 1.0


def simulate_slt_phase(concurrent_users: int, max_throughput: float = 7000,
                        single_user_speed: float = 95) -> SLTPhaseResult:
    """Simulate one phase of an AA-SLT benchmark.

    Models the fundamental trade-off: as concurrency increases,
    per-query speed decreases but total throughput increases (until saturation).

    Args:
        concurrent_users: number of simultaneous queries
        max_throughput: hardware's maximum total output tokens/sec
        single_user_speed: per-query speed with 1 concurrent user
    """
    # Per-query speed degrades as resources are shared
    # Modeled as: speed = single_user_speed / (1 + alpha * log(users))
    alpha = 0.35
    per_query_speed = single_user_speed / (1 + alpha * np.log(max(concurrent_users, 1)))

    # Total throughput = per_query * users, capped at hardware max
    raw_throughput = per_query_speed * concurrent_users
    system_throughput = min(raw_throughput, max_throughput)

    # If throughput-limited, per-query speed is further reduced
    if raw_throughput > max_throughput:
        per_query_speed = max_throughput / concurrent_users

    # E2E latency: TTFT + decode time for 1000 tokens
    ttft_ms = 100 + 20 * np.log(max(concurrent_users, 1))  # queuing delay
    decode_ms = (1000 / per_query_speed) * 1000
    e2e_ms = ttft_ms + decode_ms

    # Response rate degrades at extreme concurrency
    response_rate = min(1.0, 1.0 - max(0, concurrent_users - 128) * 0.001)

    return SLTPhaseResult(
        concurrent_users=concurrent_users,
        system_throughput_tps=system_throughput,
        median_per_query_speed_tps=per_query_speed,
        median_e2e_latency_ms=e2e_ms,
        response_rate=response_rate,
    )


# Run simulated SLT for two "accelerators"
concurrency_levels = [1, 2, 4, 8, 16, 32, 64, 128, 192, 256]

accelerators = {
    "MI300X (simulated)": {"max_throughput": 6730, "single_user_speed": 85},
    "H100 (simulated)":   {"max_throughput": 7290, "single_user_speed": 96},
}

for accel_name, params in accelerators.items():
    print(f"\n{'='*70}")
    print(f"  AA-SLT Results: {accel_name}")
    print(f"{'='*70}")
    print(f"  {'Users':>6} {'Sys Throughput':>16} {'Per-Query Speed':>17} {'E2E Latency':>13} {'Resp Rate':>10}")
    print(f"  {'-'*65}")

    prev_throughput = 0
    for n_users in concurrency_levels:
        r = simulate_slt_phase(n_users, **params)
        plateau = "  << plateau" if r.system_throughput_tps - prev_throughput < 100 and n_users > 1 else ""
        print(f"  {r.concurrent_users:>6} {r.system_throughput_tps:>13.0f} t/s {r.median_per_query_speed_tps:>13.1f} t/s {r.median_e2e_latency_ms:>10.0f} ms {r.response_rate:>9.1%}{plateau}")
        prev_throughput = r.system_throughput_tps

Part 4 — Hardware Benchmarking: AA-AgentPerf¶

4.1 Why a New Benchmark?¶

AA-SLT uses uniform 1K/1K workloads. Real agentic use cases are very different:

Property	AA-SLT	AA-AgentPerf
Input length	Fixed 1K tokens	1K–131K tokens (mean 27K)
Output length	Fixed 1K tokens	Median 150, P95 ~2K
Multi-turn	No	Yes (up to 200 turns)
Workload source	Synthetic	Real coding agent trajectories
Context growth	No	Each turn adds to context
KV cache stress	Minimal	Heavy (long sequences, reuse)
Optimizations	Baseline	Production (speculative decoding, disaggregated prefill/decode)

4.2 Dataset¶

Source: Real coding agent trajectories from public repositories
Models used to generate: Top open-source models with reasoning enabled
Languages: 12+ programming languages
Input Sequence Length (ISL): 1K–131K tokens, mean ~27K
Output Sequence Length (OSL): Median ~150 tokens, P95 ~2K
Tuning subset: Provided to participants for config validation
Test dataset: Kept private to prevent benchmark-targeted optimizations

4.3 Service-Level Objectives (SLOs)¶

SLOs are derived from real market data — what providers actually deliver today:

Model	SLO Tier	P25 Output Speed (t/s)	P95 TTFT (s)
DeepSeek V3.2 (Thinking)	#1	30	15
	#2	100	15
	#3	300	15
gpt-oss-120b (high)	#1	100	10
	#2	250	10
	#3	500	10
	#4	2,000	10

P25 Output Speed (not P50 or P5) is used because agentic workloads have many short-output requests. P25 captures the slower quarter without being overly strict.

4.4 Binary Search for Max Users¶

Instead of linear ramp (like AA-SLT), AgentPerf uses binary search:

Phase 1: 1 user   → 348 t/s → PASS (above 100 t/s SLO)
Phase 2: 2 users  → 310 t/s → PASS
Phase 3: 4 users  → 245 t/s → PASS
Phase 4: 8 users  → 152 t/s → PASS
Phase 5: 16 users → 42 t/s  → FAIL ✗
Phase 6: 12 users → 82 t/s  → FAIL ✗
Phase 7: 10 users → 128 t/s → PASS
Phase 8: 11 users → 108 t/s → PASS ★

Result: Max users = 11 at SLO #1 (100 t/s)

4.5 Result Normalization¶

Results are normalized to enable fair comparison:

Normalization	Why It Matters
Per accelerator	Compare chip-to-chip (MI300X vs H100)
Per kW TDP	Power efficiency — critical for datacenter TCO
Per rack	Physical space constraints in datacenters
Per $/hr	Cost efficiency for cloud customers

4.6 Production Optimizations Allowed¶

Unlike many benchmarks, AgentPerf allows real-world optimizations:

Optimization	What It Does	Impact
KV cache reuse	Skip recomputing attention for prior turns	Reduces TTFT dramatically in multi-turn
Speculative decoding	Draft model predicts tokens, main model verifies	2-3x decode speedup
Disaggregated prefill/decode	Separate hardware pools for each phase	Better utilization
Continuous batching	Add new requests to running batch	Higher throughput
Prefix caching	Cache common prompt prefixes	Faster system prompts

# ── 4.7 Simulate AA-AgentPerf Binary Search ──────────────────────────────────

def simulate_agentperf_phase(n_users: int, base_speed: float = 350,
                              degradation_rate: float = 0.6) -> dict:
    """Simulate AgentPerf at a given concurrency.

    In AgentPerf, performance degrades faster than SLT because:
    - Variable context lengths stress KV cache
    - Long sequences increase memory pressure
    - Multi-turn sessions create scheduling complexity
    """
    # Power-law degradation: speed = base / users^rate
    p25_speed = base_speed / (n_users ** degradation_rate)
    p95_ttft  = 1.0 + 0.5 * n_users  # grows linearly
    return {"users": n_users, "p25_speed": p25_speed, "p95_ttft": p95_ttft}


def binary_search_max_users(slo_speed: float, slo_ttft: float,
                             base_speed: float = 350) -> list[dict]:
    """Find max concurrent users meeting SLO via binary search."""
    phases = []

    # Exponential ramp first
    n = 1
    while n <= 64:
        result = simulate_agentperf_phase(n, base_speed)
        passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
        phases.append({**result, "pass": passed})
        if not passed:
            break
        n *= 2

    if not phases[-1]["pass"]:
        # Binary search between last pass and first fail
        lo = phases[-2]["users"] if len(phases) > 1 else 1
        hi = phases[-1]["users"]
        while hi - lo > 1:
            mid = (lo + hi) // 2
            result = simulate_agentperf_phase(mid, base_speed)
            passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
            phases.append({**result, "pass": passed})
            if passed:
                lo = mid
            else:
                hi = mid

    return phases


# Run for SLO #1: 100 t/s, 10s TTFT
print("=== AA-AgentPerf Binary Search (SLO: 100 t/s, 10s TTFT) ===")
print(f"  {'Phase':>5} {'Users':>6} {'P25 Speed':>12} {'P95 TTFT':>10} {'Result':>8}")
print(f"  {'-'*45}")

phases = binary_search_max_users(slo_speed=100, slo_ttft=10)
max_users = 0
for i, p in enumerate(phases):
    status = "PASS" if p["pass"] else "FAIL"
    marker = "" if not p["pass"] else ""
    if p["pass"]:
        max_users = p["users"]
    print(f"  {i+1:>5} {p['users']:>6} {p['p25_speed']:>9.0f} t/s {p['p95_ttft']:>7.1f} s {status:>8} {marker}")

print(f"\n  Max concurrent users at SLO: {max_users}")

# Now run for all SLO tiers
print("\n=== Capacity per SLO Tier ===")
slo_tiers = [
    {"name": "SLO #1 (economy)",   "speed": 30,   "ttft": 15},
    {"name": "SLO #2 (standard)",  "speed": 100,  "ttft": 10},
    {"name": "SLO #3 (premium)",   "speed": 300,  "ttft": 10},
    {"name": "SLO #4 (ultra)",     "speed": 2000, "ttft": 10},
]

print(f"  {'SLO Tier':<25} {'Target Speed':>13} {'Target TTFT':>12} {'Max Users':>10}")
print(f"  {'-'*62}")
for tier in slo_tiers:
    phases = binary_search_max_users(tier["speed"], tier["ttft"])
    max_u = max((p["users"] for p in phases if p["pass"]), default=0)
    print(f"  {tier['name']:<25} {tier['speed']:>10} t/s {tier['ttft']:>9} s {max_u:>10}")

Part 5 — Intelligence Benchmarking¶

5.1 AA Intelligence Index v4.0¶

The Intelligence Index aggregates 10 evaluations into a single score (0-100):

Evaluation	What It Tests	Source
GDPval-AA	Real-world tasks across 44 occupations, 9 industries	AA + agentic (shell + web access)
τ²-Bench Telecom	Domain-specific (telecom) reasoning	Academic
Terminal-Bench Hard	Software engineering, sysadmin, game-playing via terminal	Stanford + AA
SciCode	Scientific computing	Academic
AA-LCR	Long context recall	AA
AA-Omniscience	General knowledge	AA
IFBench	Instruction following	Academic
Humanity’s Last Exam	Extremely hard multi-domain questions	External
GPQA Diamond	Graduate-level science Q&A	Academic
CritPt	Critical thinking & reasoning	External

Methodology Principles¶

95% confidence interval < 1% — verified via 10+ repeats on select models
Agentic evals (GDPval, Terminal-Bench) give models real tool access
ELO ratings derived from blind pairwise comparisons (GDPval)
Sub-indices: Coding Index, Agentic Index (subsets of the 10 evals)

5.2 Why Hardware Engineers Should Care About Intelligence Benchmarks¶

You might think: “I validate hardware, not model quality.” But:

Quantization impact — does INT8/FP8 on your hardware degrade intelligence scores?
Compiler correctness — torch.compile / graph optimizations must not change model outputs
Framework parity — same model on CUDA vs ROCm should score identically
Numerical precision — your kernel validation (Chapter 2) directly affects these scores

5.3 Other Intelligence Benchmarks¶

Benchmark	What It Does	Key Metric	Used By
MMLU / MMLU Pro	Multi-task language understanding	Accuracy %	Everyone
HumanEval / MBPP	Code generation	Pass@1	Code models
MATH / GSM8K	Math problem solving	Accuracy %	Reasoning models
GPQA	Graduate-level science	Accuracy %	Frontier models
IFEval	Instruction following	Strict accuracy %	Aligned models
LiveCodeBench	Competitive programming (live)	Pass rate	Coding models
SWE-bench	Real GitHub issue resolution	Resolve %	Agentic models
AIME	Math competition problems	Solve rate	Reasoning models

# ── 5.4 Confidence Interval Calculation for Benchmarks ────────────────────────
from scipy import stats as scipy_stats

def benchmark_confidence_interval(scores: list[float], confidence: float = 0.95) -> dict:
    """Calculate confidence interval for a benchmark score.

    This is how AA ensures their Intelligence Index has <1% CI.
    """
    n = len(scores)
    mean = np.mean(scores)
    se = scipy_stats.sem(scores)  # standard error of the mean
    ci = scipy_stats.t.interval(confidence, df=n-1, loc=mean, scale=se)

    return {
        "mean": mean,
        "std": np.std(scores, ddof=1),
        "n_runs": n,
        "ci_low": ci[0],
        "ci_high": ci[1],
        "ci_width": ci[1] - ci[0],
        "ci_width_pct": (ci[1] - ci[0]) / mean * 100,
    }


# Simulate: run an eval 15 times, get slightly different scores each time
np.random.seed(42)
model_a_scores = np.random.normal(52.5, 1.2, size=15)  # mean ~52.5, std ~1.2
model_b_scores = np.random.normal(50.8, 1.5, size=15)  # mean ~50.8, std ~1.5

for name, scores in [("Model A (frontier)", model_a_scores), ("Model B (challenger)", model_b_scores)]:
    ci = benchmark_confidence_interval(scores)
    print(f"\n{name}:")
    print(f"  Mean score    : {ci['mean']:.2f}")
    print(f"  Std dev       : {ci['std']:.2f}")
    print(f"  95% CI        : [{ci['ci_low']:.2f}, {ci['ci_high']:.2f}]")
    print(f"  CI width      : {ci['ci_width']:.2f} ({ci['ci_width_pct']:.1f}% of mean)")
    print(f"  Meets <1% CI? : {'Yes' if ci['ci_width_pct'] < 1.0 else 'No'}")

# Are the two models statistically different?
t_stat, p_value = scipy_stats.ttest_ind(model_a_scores, model_b_scores)
print(f"\n=== Are they statistically different? ===")
print(f"  t-statistic : {t_stat:.3f}")
print(f"  p-value     : {p_value:.4f}")
print(f"  Significant : {'Yes (p < 0.05)' if p_value < 0.05 else 'No (p >= 0.05)'}")

Part 6 — Multi-Modal Benchmarking¶

Artificial Analysis also benchmarks non-text modalities:

6.1 Modality Coverage¶

Modality	Models Tested	Key Metrics
Text-to-Image	DALL-E 3, Midjourney, Stable Diffusion, Flux	Quality (human eval), generation time, cost/image
Speech-to-Text	Whisper, Deepgram, AssemblyAI	Word Error Rate (WER), latency, cost/min
Text-to-Speech	ElevenLabs, OpenAI TTS, Play.ht	Quality (MOS), latency, cost/char
Speech-to-Speech	GPT-4o Voice, Gemini Live	Latency, naturalness, turn-taking

6.2 Openness Index¶

AA also rates models on openness — how much the model creator shares:

Factor	Examples
Weights released	Llama: Yes, GPT-4: No
Training data disclosed	Few models fully disclose
Architecture published	Most open models: Yes
License permissiveness	Apache 2.0 vs custom restrictive
Reproducibility	Training code, hyperparameters

6.3 Hardware Implications¶

For hardware validation engineers, multi-modal matters because:

Image generation stresses GPU differently than LLMs (UNet/DiT, high VRAM bandwidth)
Speech processing requires real-time constraints (<200ms round-trip)
Multi-modal models (GPT-4o, Gemini) process images + text in one forward pass — different memory and compute profiles

Part 7 — Other Industry Benchmarks¶

Artificial Analysis is one player. Here’s the full landscape:

7.1 MLPerf (MLCommons)¶

The gold standard for hardware benchmarking in ML:

Benchmark	What It Measures	Who Submits
MLPerf Training	Time to train to quality target	NVIDIA, Google, AMD, Intel
MLPerf Inference	Throughput & latency for inference	Hardware vendors, cloud providers
MLPerf HPC	Training on HPC systems	National labs
MLPerf Tiny	Inference on microcontrollers	Embedded vendors

MLPerf Inference Categories¶

Scenario	Description
Offline	Process all samples ASAP (batch throughput)
Server	Process queries arriving at random intervals (latency-bounded)
SingleStream	Process one query at a time (edge devices)
MultiStream	Multiple concurrent streams (autonomous driving)

Relevance to AMD¶

AMD submits MI300X results to MLPerf — your validation work feeds these submissions
MLPerf requires verifiable results with auditable configs
Results are normalized per chip, per system, and per power

7.2 LMSys Chatbot Arena¶

Property	Description
Method	Blind pairwise human evaluation
Metric	ELO rating (like chess)
Scale	1M+ human votes
Why it matters	Most trusted quality ranking for chatbots
Hardware relevance	None directly — but model quality is the end goal

7.3 Open LLM Leaderboard (HuggingFace)¶

Property	Description
Method	Automated eval on standard benchmarks
Benchmarks	MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K
Who uses it	Open-source model developers
Limitation	Prone to benchmark gaming (training on eval data)
Hardware relevance	Used to verify quantized models don’t lose quality

7.4 SPEC ML (Emerging)¶

Property	Description
Organization	Standard Performance Evaluation Corporation
Focus	Standardized ML workload benchmarking
Status	Newer, less established than MLPerf
Approach	Broader workload coverage including classical ML

7.5 Comparison Matrix¶

Benchmark	Focus	Measures Hardware?	Measures Quality?	Who Runs It?	Open?
AA-SLT	HW throughput under load	Yes	No	Artificial Analysis	Results public
AA-AgentPerf	HW capacity for agents	Yes	No	Artificial Analysis	Results public
AA Intelligence	Model quality	No	Yes	Artificial Analysis	Results public
MLPerf Inference	HW throughput (standardized)	Yes	No	Vendor submissions	Fully open
MLPerf Training	HW training speed	Yes	No	Vendor submissions	Fully open
LMSys Arena	Model quality (human)	No	Yes	LMSYS	Open
Open LLM Board	Model quality (auto)	No	Yes	HuggingFace	Open
SPEC ML	HW general ML workloads	Yes	No	SPEC	Membership

# ── 7.6 How Internal Validation Maps to External Benchmarks ──────────────────

mapping = [
    ("Chapter 1: HW Validation",   "Power, thermals, memory",     "MLPerf power submission, AA per-kW normalization"),
    ("Chapter 2: Kernel Val.",      "GEMM correctness, TFLOPS",    "MLPerf Inference (operator-level perf)"),
    ("Chapter 3: Framework Val.",   "ONNX parity, torch.compile",  "AA Intelligence (quantization impact)"),
    ("Chapter 4: Model Perf.",      "Latency, throughput, VRAM",   "AA-SLT per-query speed, AA-AgentPerf"),
    ("Chapter 5: E2E Pipeline",     "Data→Model→Post pipeline",    "AA E2E Response Time"),
    ("Chapter 6: Distributed",      "AllReduce, scaling eff.",     "MLPerf Training (multi-node)"),
    ("Chapter 7: Datacenter",       "K8s, scheduling, monitoring", "AA-AgentPerf (production topology)"),
    ("Chapter 8: Regression",       "Golden baselines, CI/CD",     "MLPerf audit trail, AA 72-hour rolling"),
]

print(f"{'Internal Validation':<30} {'What You Test':<30} {'External Benchmark Equivalent'}")
print("=" * 95)
for internal, tests, external in mapping:
    print(f"{internal:<30} {tests:<30} {external}")

Exercises¶

Metric Calculation: Given the following streaming response data, calculate TTFT, output speed, and total response time:
- Request sent: T=0.000s
- First token: T=0.187s
- Last token (token #200): T=2.687s
- What would TTFAT be if 500 reasoning tokens were generated at 150 t/s before the answer?
Token Normalization: A model uses a tokenizer that produces 15% more tokens than OpenAI’s for the same text. The model reports 250 native tokens/sec. What is the OpenAI-normalized speed?
SLT Analysis: Using the simulated SLT results above, at what concurrency level does MI300X reach its throughput plateau? What is the efficiency ratio (actual throughput / theoretical throughput) at 64 concurrent users?
AgentPerf SLO Planning: Your MI300X system supports 11 users at SLO #1 (100 t/s). A customer needs to serve 50 concurrent agent users. How many MI300X GPUs do they need? What’s the rental cost at $2.00/GPU/hr?
Benchmark Comparison: Explain why MLPerf Inference uses a “Server” scenario while AA-AgentPerf uses binary search for max users. What different aspect of hardware performance does each approach reveal?
Confidence Intervals: You run a benchmark 5 times and get scores: [48.2, 51.1, 49.7, 50.3, 48.9]. Calculate the 95% CI. Do you need more runs to achieve <1% CI width?

Key Takeaways¶

TTFT and output speed are the two metrics the industry judges hardware by — your validation work directly feeds these
AA-SLT measures raw hardware throughput with uniform workloads; AA-AgentPerf measures real-world capacity with agentic patterns
Token normalization (to OpenAI tokens) is essential for fair cross-model comparison
SLO-based capacity planning (P25 speed, P95 TTFT) is how production deployments are sized
MLPerf is the hardware industry standard; AA is the customer-facing external validator
Intelligence benchmarks matter for hardware engineers because quantization and compiler optimizations can degrade model quality
Your internal regression baselines (Chapter 8) should track the same metrics external benchmarks measure

Previous: 08_regression_release_validation.ipynb
Back to Overview: README.md