Chapter 9: Industry AI Benchmarking & Performance AnalysisΒΆ

Duration: 3-4 hours | Difficulty: Intermediate-Advanced

Learning ObjectivesΒΆ

By the end of this chapter, you will understand:

  1. The standard metrics used to benchmark LLM inference (TTFT, output speed, throughput)

  2. How Artificial Analysis benchmarks API providers and hardware systems

  3. AA-SLT (System Load Test) and AA-AgentPerf methodologies in detail

  4. Intelligence benchmarking: evals, indices, and confidence intervals

  5. Multi-modal benchmarking approaches

  6. Other industry benchmarks: MLPerf, LMSys Arena, Open LLM Leaderboard

  7. How these external benchmarks relate to your internal hardware validation work

Why This Matters for Hardware Validation EngineersΒΆ

Your internal validation work (chapters 1-8) produces the numbers that external benchmarks like Artificial Analysis measure. Understanding what gets measured externally tells you:

  • What to optimize for β€” TTFT and output speed are the KPIs customers care about

  • How your hardware is judged β€” MI300X vs H100 vs B200 comparisons are public

  • What SLOs to target β€” real production thresholds for agentic workloads

  • Where AMD can differentiate β€” $/hr, perf/watt, and capacity per rack

Internal Validation (Ch 1-8)

External Benchmark (This Chapter)

GEMM correctness & TFLOPS

Output tokens/sec under load

Power draw during sustained ops

Perf/watt normalization

Regression baselines

Throughput plateau detection

Distributed AllReduce bandwidth

System-level capacity planning

Part 1 β€” LLM Performance Metrics: What Gets MeasuredΒΆ

1.1 Core Latency & Speed MetricsΒΆ

Every LLM inference benchmark measures some combination of these:

Metric

Definition

Why It Matters

TTFT (Time to First Token)

Time from request sent to first token received

User-perceived responsiveness

TTFAT (Time to First Answer Token)

TTFT + reasoning token time (for reasoning models)

Real wait time before useful output

Output Speed

Tokens/sec after first token

Reading speed β€” how fast text appears

System Output Throughput

Total tokens/sec across all concurrent users

Hardware capacity metric

E2E Response Time

Total time for complete response

Full request lifecycle

1.2 The FormulasΒΆ

Time to First TokenΒΆ

\[\text{TTFT} = T_{\text{first\_token\_arrival}} - T_{\text{request\_sent}}\]

Time to First Answer Token (reasoning models)ΒΆ

\[\text{TTFAT} = \text{Input Processing Time} + \frac{\text{Avg Reasoning Tokens}}{\text{Reasoning Output Speed}}\]

Output SpeedΒΆ

\[\text{Output Speed} = \frac{\text{Total Tokens} - \text{First Chunk Tokens}}{T_{\text{final\_chunk}} - T_{\text{first\_chunk}}}\]

Total Response Time (for 100 output tokens)ΒΆ

\[\text{Total Response Time} = \text{TTFT} + \frac{100}{\text{Output Speed}}\]

End-to-End Response Time (reasoning models, 500 answer tokens)ΒΆ

\[\text{E2E} = \text{Input Processing} + \frac{\text{Reasoning Tokens}}{\text{Reasoning Speed}} + \frac{500}{\text{Answer Speed}}\]

1.3 Token NormalizationΒΆ

Different models use different tokenizers. Artificial Analysis normalizes everything to OpenAI tokens (using o200k_base tokenizer via tiktoken) for fair comparison.

Model

Native tokenizer

Tokens for β€œHello, world!”

OpenAI tokens

GPT-4

o200k_base

4

4 (reference)

Llama 3

SentencePiece

5

4

Claude

Custom BPE

4

4

This matters because a model claiming β€œ200 tokens/sec” with a verbose tokenizer might actually be generating less text than one claiming β€œ150 tokens/sec” with a compact tokenizer.

1.4 Pricing MetricsΒΆ

Blended Price assumes 3:1 input-to-output ratio:

\[\text{Blended Price} = \frac{3 \times \text{Input Price} + \text{Output Price}}{4}\]

This is expressed as USD per 1M tokens.

# ── 1.5 Metric Calculation Demo ──────────────────────────────────────────────
import time
import numpy as np

def calculate_llm_metrics(token_timestamps: list[float], request_sent: float) -> dict:
    """Calculate standard LLM inference metrics from token arrival timestamps.

    Args:
        token_timestamps: list of arrival times for each output token
        request_sent: timestamp when the request was sent

    Returns:
        dict with TTFT, output_speed, e2e_latency
    """
    ttft = token_timestamps[0] - request_sent
    n_tokens = len(token_timestamps)

    if n_tokens > 1:
        output_speed = (n_tokens - 1) / (token_timestamps[-1] - token_timestamps[0])
    else:
        output_speed = 0.0

    e2e_latency = token_timestamps[-1] - request_sent

    return {
        "ttft_ms": ttft * 1000,
        "output_speed_tps": output_speed,
        "e2e_latency_ms": e2e_latency * 1000,
        "total_tokens": n_tokens,
        "total_response_time_100t_ms": (ttft + 100 / max(output_speed, 1e-9)) * 1000,
    }


# Simulate a streaming response: 50 tokens with ~80 t/s speed
np.random.seed(42)
request_t = time.time()
prefill_delay = 0.15  # 150ms TTFT
inter_token = 1.0 / 80  # ~12.5ms per token

timestamps = []
t = request_t + prefill_delay
for _ in range(50):
    timestamps.append(t)
    t += inter_token * (1 + np.random.normal(0, 0.05))  # 5% jitter

metrics = calculate_llm_metrics(timestamps, request_t)
print("=== Simulated LLM Inference Metrics ===")
for k, v in metrics.items():
    if "ms" in k:
        print(f"  {k:<35} {v:>10.1f} ms")
    elif "tps" in k:
        print(f"  {k:<35} {v:>10.1f} tokens/sec")
    else:
        print(f"  {k:<35} {v:>10}")
# ── 1.6 Token Normalization with tiktoken ────────────────────────────────────
try:
    import tiktoken
    enc = tiktoken.get_encoding("o200k_base")  # OpenAI's standard

    test_strings = [
        "Hello, world!",
        "The quick brown fox jumps over the lazy dog.",
        "def fibonacci(n):\n    if n <= 1: return n\n    return fibonacci(n-1) + fibonacci(n-2)",
        "Explain the concept of attention mechanisms in transformer models.",
    ]

    print(f"{'Text (first 50 chars)':<55} {'OpenAI tokens':>15}")
    print("-" * 72)
    for s in test_strings:
        tokens = enc.encode(s)
        display = s[:50] + ("..." if len(s) > 50 else "")
        print(f"{display:<55} {len(tokens):>15}")

    # Why normalization matters
    print("\n=== Why normalization matters ===")
    text = "Explain quantum computing in simple terms." * 10
    openai_tokens = len(enc.encode(text))
    # Hypothetical: another tokenizer produces 20% more tokens for same text
    native_tokens = int(openai_tokens * 1.2)
    native_speed = 200  # tokens/sec in native tokens
    normalized_speed = native_speed * (openai_tokens / native_tokens)
    print(f"  Native tokenizer speed : {native_speed} native tokens/sec")
    print(f"  OpenAI-normalized speed: {normalized_speed:.0f} OpenAI tokens/sec")
    print(f"  Actual text generation is {(1 - normalized_speed/native_speed)*100:.0f}% slower than native number suggests")

except ImportError:
    print("tiktoken not installed. Run: pip install tiktoken")
# ── 1.7 Blended Pricing Calculator ───────────────────────────────────────────
PROVIDER_PRICING = {
    "GPT-4o":         {"input": 2.50, "output": 10.00},
    "GPT-4o-mini":    {"input": 0.15, "output": 0.60},
    "Claude Sonnet":  {"input": 3.00, "output": 15.00},
    "Claude Haiku":   {"input": 0.25, "output": 1.25},
    "Llama 3.3 70B":  {"input": 0.20, "output": 0.20},  # typical open-weight hosting
    "DeepSeek V3":    {"input": 0.27, "output": 1.10},
}

def blended_price(input_price: float, output_price: float) -> float:
    """Calculate blended price assuming 3:1 input:output ratio."""
    return (3 * input_price + output_price) / 4


print(f"{'Model':<20} {'Input $/1M':>12} {'Output $/1M':>13} {'Blended $/1M':>14}")
print("-" * 62)
for model, prices in sorted(PROVIDER_PRICING.items(), key=lambda x: blended_price(x[1]['input'], x[1]['output'])):
    bp = blended_price(prices['input'], prices['output'])
    print(f"{model:<20} ${prices['input']:>10.2f} ${prices['output']:>11.2f} ${bp:>12.2f}")

Part 2 β€” API Performance BenchmarkingΒΆ

2.1 How Artificial Analysis Tests API ProvidersΒΆ

Testing InfrastructureΒΆ

Component

Specification

Test server

Virtual machine in GCP us-central1-a

Test frequency

1K & 10K input workloads: 8x/day (~every 3 hours)

Parallel workload (10 concurrent): 1x/day

100K input workload: 1x/week

Metric window

Rolling 72 hours (β€œlive” metrics)

Accounts

Mix of anonymous, credited, and provided API keys

Workload ShapesΒΆ

Workload

Input tokens

Purpose

Short context

~1,000

Chatbot-style, quick Q&A

Medium context

~10,000

Document analysis, code review

Long context

~100,000

Full codebase, book analysis

Vision

Image + text

Multi-modal inference

Parallel (10x)

10 concurrent

Load handling

What Makes This Methodology GoodΒΆ

  1. Standardized location β€” all tests from same GCP zone eliminates network variance

  2. Regular cadence β€” 8x/day captures peak/off-peak performance

  3. Rolling window β€” 72 hours smooths transient spikes

  4. Multiple workloads β€” short, medium, long context exercises different system paths

  5. Real accounts β€” tests what customers actually experience

What It Doesn’t CaptureΒΆ

  • Geographic variation β€” only tests from one zone

  • Burst patterns β€” 10 concurrent is modest; production may see 1000s

  • Model-specific optimizations β€” can’t see if a provider has speculative decoding enabled

  • Cold start latency β€” serverless endpoints may have spin-up time not captured

2.2 Reasoning Model ConsiderationsΒΆ

For models like o1, DeepSeek-R1, and Claude with extended thinking:

  • Average Reasoning Tokens is measured across 60 diverse prompts (personal, commercial, coding, math, science)

  • Prompts sourced from: MMLU Pro, AIME 2025, LiveCodeBench

  • Default assumption: 2K reasoning tokens when actual count is unavailable

  • TTFAT (Time to First Answer Token) is reported separately from TTFT

# ── 2.3 Simulating an API Performance Benchmark ──────────────────────────────

def simulate_api_benchmark(n_runs: int = 20, base_ttft_ms: float = 200,
                            base_speed_tps: float = 100, output_tokens: int = 100):
    """Simulate API performance measurements with realistic variance.

    Real API performance varies due to:
    - Server load (queueing)
    - Batch scheduling
    - Network jitter
    - KV cache hits/misses
    """
    results = []
    for i in range(n_runs):
        # Add realistic noise
        ttft = base_ttft_ms * np.random.lognormal(0, 0.3)  # log-normal: right-skewed
        speed = base_speed_tps * np.random.normal(1.0, 0.1)  # normal: symmetric
        speed = max(speed, 10)  # floor

        e2e = ttft + (output_tokens / speed) * 1000
        results.append({"ttft_ms": ttft, "speed_tps": speed, "e2e_ms": e2e})

    return results


# Simulate two providers
provider_a = simulate_api_benchmark(n_runs=50, base_ttft_ms=150, base_speed_tps=120)
provider_b = simulate_api_benchmark(n_runs=50, base_ttft_ms=300, base_speed_tps=90)

def summarize(results, name):
    ttfts = [r["ttft_ms"] for r in results]
    speeds = [r["speed_tps"] for r in results]
    e2es = [r["e2e_ms"] for r in results]
    print(f"\n{'='*55}")
    print(f"  {name}  (n={len(results)})")
    print(f"{'='*55}")
    print(f"  {'Metric':<25} {'Median':>10} {'P5':>10} {'P95':>10}")
    print(f"  {'-'*55}")
    for label, data, unit in [
        ("TTFT", ttfts, "ms"),
        ("Output Speed", speeds, "t/s"),
        ("E2E (100 tokens)", e2es, "ms"),
    ]:
        p5 = np.percentile(data, 5)
        p50 = np.percentile(data, 50)
        p95 = np.percentile(data, 95)
        print(f"  {label:<25} {p50:>8.0f} {unit}  {p5:>8.0f}  {p95:>8.0f}")

summarize(provider_a, "Provider A (fast TTFT, high speed)")
summarize(provider_b, "Provider B (slow TTFT, lower speed)")

Part 3 β€” Hardware Benchmarking: AA-SLT (System Load Test)ΒΆ

3.1 OverviewΒΆ

The AA-SLT is Artificial Analysis’s original hardware benchmark. It measures how a hardware system performs under increasing concurrent load.

Design PrinciplesΒΆ

  1. Phased concurrency ramp β€” start with 1 user, scale to saturation

  2. Fixed workload β€” 1K input tokens, 1K output tokens per query

  3. Sustained load β€” 3 minutes per phase (excludes ramp-up/cool-down)

  4. Immediate replacement β€” when a query finishes, a new one starts immediately

  5. Streaming enabled β€” tokens measured as they arrive

Concurrency PhasesΒΆ

Phase:  1    2    3    4    5    6    7     8      9     ...
Users:  1    2    4    8    16   32   64    128    192   ...
Time:   3m   3m   3m   3m   3m   3m   3m    3m     3m

After 64, increments of 64 until throughput plateaus.

Key MetricsΒΆ

Metric

Definition

Aggregation

System Output Throughput

Total output tokens/sec across all concurrent users

Average over phase

Response Rate

Proportion of queries that received at least 1 token

Per phase

E2E Latency per Query

Full response time per query

Median per phase

Output Speed per Query

Tokens/sec after first token, per query

Median per phase

Throughput Plateau DetectionΒΆ

The benchmark stops when adding more concurrent users no longer increases total system throughput. This indicates the hardware is saturated:

System Throughput (t/s)
  |
  |          ___________  ← plateau (hardware saturated)
  |         /
  |        /
  |       /
  |      /
  |     /
  |    /
  |   /
  |__/
  +---------------------------β†’ Concurrent Users
  1  2  4  8  16  32  64  128

3.2 Current Results (Llama 3.3 70B)ΒΆ

As measured by Artificial Analysis (March 2026):

Accelerator

Peak System Throughput (t/s)

Per-Query Speed (t/s)

Rental $/GPU/hr

NVIDIA B200

15,400

144

β€”

NVIDIA H200

8,470

97

$5.00

NVIDIA H100

7,290

95.9

$2.70

AMD MI300X

6,730

85

$2.00

Google TPU v6e

3,670

61.3

$2.70

Key Observations for AMDΒΆ

  • MI300X throughput is ~92% of H100 at 74% the cost β€” competitive on perf/$

  • Per-query speed gap is larger (85 vs 96 t/s) β€” decode phase optimization opportunity

  • B200 is 2.3x MI300X in throughput β€” next-gen AMD (MI350X) needs to close this

  • MI300X’s 192GB HBM3 enables larger batch sizes than H100 (80GB) β€” advantage at high concurrency

# ── 3.3 Simulate AA-SLT Phased Load Test ─────────────────────────────────────
import dataclasses


@dataclasses.dataclass
class SLTPhaseResult:
    concurrent_users: int
    system_throughput_tps: float
    median_per_query_speed_tps: float
    median_e2e_latency_ms: float
    response_rate: float  # 0.0 - 1.0


def simulate_slt_phase(concurrent_users: int, max_throughput: float = 7000,
                        single_user_speed: float = 95) -> SLTPhaseResult:
    """Simulate one phase of an AA-SLT benchmark.

    Models the fundamental trade-off: as concurrency increases,
    per-query speed decreases but total throughput increases (until saturation).

    Args:
        concurrent_users: number of simultaneous queries
        max_throughput: hardware's maximum total output tokens/sec
        single_user_speed: per-query speed with 1 concurrent user
    """
    # Per-query speed degrades as resources are shared
    # Modeled as: speed = single_user_speed / (1 + alpha * log(users))
    alpha = 0.35
    per_query_speed = single_user_speed / (1 + alpha * np.log(max(concurrent_users, 1)))

    # Total throughput = per_query * users, capped at hardware max
    raw_throughput = per_query_speed * concurrent_users
    system_throughput = min(raw_throughput, max_throughput)

    # If throughput-limited, per-query speed is further reduced
    if raw_throughput > max_throughput:
        per_query_speed = max_throughput / concurrent_users

    # E2E latency: TTFT + decode time for 1000 tokens
    ttft_ms = 100 + 20 * np.log(max(concurrent_users, 1))  # queuing delay
    decode_ms = (1000 / per_query_speed) * 1000
    e2e_ms = ttft_ms + decode_ms

    # Response rate degrades at extreme concurrency
    response_rate = min(1.0, 1.0 - max(0, concurrent_users - 128) * 0.001)

    return SLTPhaseResult(
        concurrent_users=concurrent_users,
        system_throughput_tps=system_throughput,
        median_per_query_speed_tps=per_query_speed,
        median_e2e_latency_ms=e2e_ms,
        response_rate=response_rate,
    )


# Run simulated SLT for two "accelerators"
concurrency_levels = [1, 2, 4, 8, 16, 32, 64, 128, 192, 256]

accelerators = {
    "MI300X (simulated)": {"max_throughput": 6730, "single_user_speed": 85},
    "H100 (simulated)":   {"max_throughput": 7290, "single_user_speed": 96},
}

for accel_name, params in accelerators.items():
    print(f"\n{'='*70}")
    print(f"  AA-SLT Results: {accel_name}")
    print(f"{'='*70}")
    print(f"  {'Users':>6} {'Sys Throughput':>16} {'Per-Query Speed':>17} {'E2E Latency':>13} {'Resp Rate':>10}")
    print(f"  {'-'*65}")

    prev_throughput = 0
    for n_users in concurrency_levels:
        r = simulate_slt_phase(n_users, **params)
        plateau = "  << plateau" if r.system_throughput_tps - prev_throughput < 100 and n_users > 1 else ""
        print(f"  {r.concurrent_users:>6} {r.system_throughput_tps:>13.0f} t/s {r.median_per_query_speed_tps:>13.1f} t/s {r.median_e2e_latency_ms:>10.0f} ms {r.response_rate:>9.1%}{plateau}")
        prev_throughput = r.system_throughput_tps

Part 4 β€” Hardware Benchmarking: AA-AgentPerfΒΆ

4.1 Why a New Benchmark?ΒΆ

AA-SLT uses uniform 1K/1K workloads. Real agentic use cases are very different:

Property

AA-SLT

AA-AgentPerf

Input length

Fixed 1K tokens

1K–131K tokens (mean 27K)

Output length

Fixed 1K tokens

Median 150, P95 ~2K

Multi-turn

No

Yes (up to 200 turns)

Workload source

Synthetic

Real coding agent trajectories

Context growth

No

Each turn adds to context

KV cache stress

Minimal

Heavy (long sequences, reuse)

Optimizations

Baseline

Production (speculative decoding, disaggregated prefill/decode)

4.2 DatasetΒΆ

  • Source: Real coding agent trajectories from public repositories

  • Models used to generate: Top open-source models with reasoning enabled

  • Languages: 12+ programming languages

  • Input Sequence Length (ISL): 1K–131K tokens, mean ~27K

  • Output Sequence Length (OSL): Median ~150 tokens, P95 ~2K

  • Tuning subset: Provided to participants for config validation

  • Test dataset: Kept private to prevent benchmark-targeted optimizations

4.3 Service-Level Objectives (SLOs)ΒΆ

SLOs are derived from real market data β€” what providers actually deliver today:

Model

SLO Tier

P25 Output Speed (t/s)

P95 TTFT (s)

DeepSeek V3.2 (Thinking)

#1

30

15

#2

100

15

#3

300

15

gpt-oss-120b (high)

#1

100

10

#2

250

10

#3

500

10

#4

2,000

10

P25 Output Speed (not P50 or P5) is used because agentic workloads have many short-output requests. P25 captures the slower quarter without being overly strict.

4.4 Binary Search for Max UsersΒΆ

Instead of linear ramp (like AA-SLT), AgentPerf uses binary search:

Phase 1: 1 user   β†’ 348 t/s β†’ PASS (above 100 t/s SLO)
Phase 2: 2 users  β†’ 310 t/s β†’ PASS
Phase 3: 4 users  β†’ 245 t/s β†’ PASS
Phase 4: 8 users  β†’ 152 t/s β†’ PASS
Phase 5: 16 users β†’ 42 t/s  β†’ FAIL βœ—
Phase 6: 12 users β†’ 82 t/s  β†’ FAIL βœ—
Phase 7: 10 users β†’ 128 t/s β†’ PASS
Phase 8: 11 users β†’ 108 t/s β†’ PASS β˜…

Result: Max users = 11 at SLO #1 (100 t/s)

4.5 Result NormalizationΒΆ

Results are normalized to enable fair comparison:

Normalization

Why It Matters

Per accelerator

Compare chip-to-chip (MI300X vs H100)

Per kW TDP

Power efficiency β€” critical for datacenter TCO

Per rack

Physical space constraints in datacenters

Per $/hr

Cost efficiency for cloud customers

4.6 Production Optimizations AllowedΒΆ

Unlike many benchmarks, AgentPerf allows real-world optimizations:

Optimization

What It Does

Impact

KV cache reuse

Skip recomputing attention for prior turns

Reduces TTFT dramatically in multi-turn

Speculative decoding

Draft model predicts tokens, main model verifies

2-3x decode speedup

Disaggregated prefill/decode

Separate hardware pools for each phase

Better utilization

Continuous batching

Add new requests to running batch

Higher throughput

Prefix caching

Cache common prompt prefixes

Faster system prompts

# ── 4.7 Simulate AA-AgentPerf Binary Search ──────────────────────────────────

def simulate_agentperf_phase(n_users: int, base_speed: float = 350,
                              degradation_rate: float = 0.6) -> dict:
    """Simulate AgentPerf at a given concurrency.

    In AgentPerf, performance degrades faster than SLT because:
    - Variable context lengths stress KV cache
    - Long sequences increase memory pressure
    - Multi-turn sessions create scheduling complexity
    """
    # Power-law degradation: speed = base / users^rate
    p25_speed = base_speed / (n_users ** degradation_rate)
    p95_ttft  = 1.0 + 0.5 * n_users  # grows linearly
    return {"users": n_users, "p25_speed": p25_speed, "p95_ttft": p95_ttft}


def binary_search_max_users(slo_speed: float, slo_ttft: float,
                             base_speed: float = 350) -> list[dict]:
    """Find max concurrent users meeting SLO via binary search."""
    phases = []

    # Exponential ramp first
    n = 1
    while n <= 64:
        result = simulate_agentperf_phase(n, base_speed)
        passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
        phases.append({**result, "pass": passed})
        if not passed:
            break
        n *= 2

    if not phases[-1]["pass"]:
        # Binary search between last pass and first fail
        lo = phases[-2]["users"] if len(phases) > 1 else 1
        hi = phases[-1]["users"]
        while hi - lo > 1:
            mid = (lo + hi) // 2
            result = simulate_agentperf_phase(mid, base_speed)
            passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
            phases.append({**result, "pass": passed})
            if passed:
                lo = mid
            else:
                hi = mid

    return phases


# Run for SLO #1: 100 t/s, 10s TTFT
print("=== AA-AgentPerf Binary Search (SLO: 100 t/s, 10s TTFT) ===")
print(f"  {'Phase':>5} {'Users':>6} {'P25 Speed':>12} {'P95 TTFT':>10} {'Result':>8}")
print(f"  {'-'*45}")

phases = binary_search_max_users(slo_speed=100, slo_ttft=10)
max_users = 0
for i, p in enumerate(phases):
    status = "PASS" if p["pass"] else "FAIL"
    marker = "" if not p["pass"] else ""
    if p["pass"]:
        max_users = p["users"]
    print(f"  {i+1:>5} {p['users']:>6} {p['p25_speed']:>9.0f} t/s {p['p95_ttft']:>7.1f} s {status:>8} {marker}")

print(f"\n  Max concurrent users at SLO: {max_users}")

# Now run for all SLO tiers
print("\n=== Capacity per SLO Tier ===")
slo_tiers = [
    {"name": "SLO #1 (economy)",   "speed": 30,   "ttft": 15},
    {"name": "SLO #2 (standard)",  "speed": 100,  "ttft": 10},
    {"name": "SLO #3 (premium)",   "speed": 300,  "ttft": 10},
    {"name": "SLO #4 (ultra)",     "speed": 2000, "ttft": 10},
]

print(f"  {'SLO Tier':<25} {'Target Speed':>13} {'Target TTFT':>12} {'Max Users':>10}")
print(f"  {'-'*62}")
for tier in slo_tiers:
    phases = binary_search_max_users(tier["speed"], tier["ttft"])
    max_u = max((p["users"] for p in phases if p["pass"]), default=0)
    print(f"  {tier['name']:<25} {tier['speed']:>10} t/s {tier['ttft']:>9} s {max_u:>10}")

Part 5 β€” Intelligence BenchmarkingΒΆ

5.1 AA Intelligence Index v4.0ΒΆ

The Intelligence Index aggregates 10 evaluations into a single score (0-100):

Evaluation

What It Tests

Source

GDPval-AA

Real-world tasks across 44 occupations, 9 industries

AA + agentic (shell + web access)

τ²-Bench Telecom

Domain-specific (telecom) reasoning

Academic

Terminal-Bench Hard

Software engineering, sysadmin, game-playing via terminal

Stanford + AA

SciCode

Scientific computing

Academic

AA-LCR

Long context recall

AA

AA-Omniscience

General knowledge

AA

IFBench

Instruction following

Academic

Humanity’s Last Exam

Extremely hard multi-domain questions

External

GPQA Diamond

Graduate-level science Q&A

Academic

CritPt

Critical thinking & reasoning

External

Methodology PrinciplesΒΆ

  • 95% confidence interval < 1% β€” verified via 10+ repeats on select models

  • Agentic evals (GDPval, Terminal-Bench) give models real tool access

  • ELO ratings derived from blind pairwise comparisons (GDPval)

  • Sub-indices: Coding Index, Agentic Index (subsets of the 10 evals)

5.2 Why Hardware Engineers Should Care About Intelligence BenchmarksΒΆ

You might think: β€œI validate hardware, not model quality.” But:

  1. Quantization impact β€” does INT8/FP8 on your hardware degrade intelligence scores?

  2. Compiler correctness β€” torch.compile / graph optimizations must not change model outputs

  3. Framework parity β€” same model on CUDA vs ROCm should score identically

  4. Numerical precision β€” your kernel validation (Chapter 2) directly affects these scores

5.3 Other Intelligence BenchmarksΒΆ

Benchmark

What It Does

Key Metric

Used By

MMLU / MMLU Pro

Multi-task language understanding

Accuracy %

Everyone

HumanEval / MBPP

Code generation

Pass@1

Code models

MATH / GSM8K

Math problem solving

Accuracy %

Reasoning models

GPQA

Graduate-level science

Accuracy %

Frontier models

IFEval

Instruction following

Strict accuracy %

Aligned models

LiveCodeBench

Competitive programming (live)

Pass rate

Coding models

SWE-bench

Real GitHub issue resolution

Resolve %

Agentic models

AIME

Math competition problems

Solve rate

Reasoning models

# ── 5.4 Confidence Interval Calculation for Benchmarks ────────────────────────
from scipy import stats as scipy_stats

def benchmark_confidence_interval(scores: list[float], confidence: float = 0.95) -> dict:
    """Calculate confidence interval for a benchmark score.

    This is how AA ensures their Intelligence Index has <1% CI.
    """
    n = len(scores)
    mean = np.mean(scores)
    se = scipy_stats.sem(scores)  # standard error of the mean
    ci = scipy_stats.t.interval(confidence, df=n-1, loc=mean, scale=se)

    return {
        "mean": mean,
        "std": np.std(scores, ddof=1),
        "n_runs": n,
        "ci_low": ci[0],
        "ci_high": ci[1],
        "ci_width": ci[1] - ci[0],
        "ci_width_pct": (ci[1] - ci[0]) / mean * 100,
    }


# Simulate: run an eval 15 times, get slightly different scores each time
np.random.seed(42)
model_a_scores = np.random.normal(52.5, 1.2, size=15)  # mean ~52.5, std ~1.2
model_b_scores = np.random.normal(50.8, 1.5, size=15)  # mean ~50.8, std ~1.5

for name, scores in [("Model A (frontier)", model_a_scores), ("Model B (challenger)", model_b_scores)]:
    ci = benchmark_confidence_interval(scores)
    print(f"\n{name}:")
    print(f"  Mean score    : {ci['mean']:.2f}")
    print(f"  Std dev       : {ci['std']:.2f}")
    print(f"  95% CI        : [{ci['ci_low']:.2f}, {ci['ci_high']:.2f}]")
    print(f"  CI width      : {ci['ci_width']:.2f} ({ci['ci_width_pct']:.1f}% of mean)")
    print(f"  Meets <1% CI? : {'Yes' if ci['ci_width_pct'] < 1.0 else 'No'}")

# Are the two models statistically different?
t_stat, p_value = scipy_stats.ttest_ind(model_a_scores, model_b_scores)
print(f"\n=== Are they statistically different? ===")
print(f"  t-statistic : {t_stat:.3f}")
print(f"  p-value     : {p_value:.4f}")
print(f"  Significant : {'Yes (p < 0.05)' if p_value < 0.05 else 'No (p >= 0.05)'}")

Part 6 β€” Multi-Modal BenchmarkingΒΆ

Artificial Analysis also benchmarks non-text modalities:

6.1 Modality CoverageΒΆ

Modality

Models Tested

Key Metrics

Text-to-Image

DALL-E 3, Midjourney, Stable Diffusion, Flux

Quality (human eval), generation time, cost/image

Speech-to-Text

Whisper, Deepgram, AssemblyAI

Word Error Rate (WER), latency, cost/min

Text-to-Speech

ElevenLabs, OpenAI TTS, Play.ht

Quality (MOS), latency, cost/char

Speech-to-Speech

GPT-4o Voice, Gemini Live

Latency, naturalness, turn-taking

6.2 Openness IndexΒΆ

AA also rates models on openness β€” how much the model creator shares:

Factor

Examples

Weights released

Llama: Yes, GPT-4: No

Training data disclosed

Few models fully disclose

Architecture published

Most open models: Yes

License permissiveness

Apache 2.0 vs custom restrictive

Reproducibility

Training code, hyperparameters

6.3 Hardware ImplicationsΒΆ

For hardware validation engineers, multi-modal matters because:

  • Image generation stresses GPU differently than LLMs (UNet/DiT, high VRAM bandwidth)

  • Speech processing requires real-time constraints (<200ms round-trip)

  • Multi-modal models (GPT-4o, Gemini) process images + text in one forward pass β€” different memory and compute profiles

Part 7 β€” Other Industry BenchmarksΒΆ

Artificial Analysis is one player. Here’s the full landscape:

7.1 MLPerf (MLCommons)ΒΆ

The gold standard for hardware benchmarking in ML:

Benchmark

What It Measures

Who Submits

MLPerf Training

Time to train to quality target

NVIDIA, Google, AMD, Intel

MLPerf Inference

Throughput & latency for inference

Hardware vendors, cloud providers

MLPerf HPC

Training on HPC systems

National labs

MLPerf Tiny

Inference on microcontrollers

Embedded vendors

MLPerf Inference CategoriesΒΆ

Scenario

Description

Offline

Process all samples ASAP (batch throughput)

Server

Process queries arriving at random intervals (latency-bounded)

SingleStream

Process one query at a time (edge devices)

MultiStream

Multiple concurrent streams (autonomous driving)

Relevance to AMDΒΆ

  • AMD submits MI300X results to MLPerf β€” your validation work feeds these submissions

  • MLPerf requires verifiable results with auditable configs

  • Results are normalized per chip, per system, and per power

7.2 LMSys Chatbot ArenaΒΆ

Property

Description

Method

Blind pairwise human evaluation

Metric

ELO rating (like chess)

Scale

1M+ human votes

Why it matters

Most trusted quality ranking for chatbots

Hardware relevance

None directly β€” but model quality is the end goal

7.3 Open LLM Leaderboard (HuggingFace)ΒΆ

Property

Description

Method

Automated eval on standard benchmarks

Benchmarks

MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K

Who uses it

Open-source model developers

Limitation

Prone to benchmark gaming (training on eval data)

Hardware relevance

Used to verify quantized models don’t lose quality

7.4 SPEC ML (Emerging)ΒΆ

Property

Description

Organization

Standard Performance Evaluation Corporation

Focus

Standardized ML workload benchmarking

Status

Newer, less established than MLPerf

Approach

Broader workload coverage including classical ML

7.5 Comparison MatrixΒΆ

Benchmark

Focus

Measures Hardware?

Measures Quality?

Who Runs It?

Open?

AA-SLT

HW throughput under load

Yes

No

Artificial Analysis

Results public

AA-AgentPerf

HW capacity for agents

Yes

No

Artificial Analysis

Results public

AA Intelligence

Model quality

No

Yes

Artificial Analysis

Results public

MLPerf Inference

HW throughput (standardized)

Yes

No

Vendor submissions

Fully open

MLPerf Training

HW training speed

Yes

No

Vendor submissions

Fully open

LMSys Arena

Model quality (human)

No

Yes

LMSYS

Open

Open LLM Board

Model quality (auto)

No

Yes

HuggingFace

Open

SPEC ML

HW general ML workloads

Yes

No

SPEC

Membership

# ── 7.6 How Internal Validation Maps to External Benchmarks ──────────────────

mapping = [
    ("Chapter 1: HW Validation",   "Power, thermals, memory",     "MLPerf power submission, AA per-kW normalization"),
    ("Chapter 2: Kernel Val.",      "GEMM correctness, TFLOPS",    "MLPerf Inference (operator-level perf)"),
    ("Chapter 3: Framework Val.",   "ONNX parity, torch.compile",  "AA Intelligence (quantization impact)"),
    ("Chapter 4: Model Perf.",      "Latency, throughput, VRAM",   "AA-SLT per-query speed, AA-AgentPerf"),
    ("Chapter 5: E2E Pipeline",     "Data→Model→Post pipeline",    "AA E2E Response Time"),
    ("Chapter 6: Distributed",      "AllReduce, scaling eff.",     "MLPerf Training (multi-node)"),
    ("Chapter 7: Datacenter",       "K8s, scheduling, monitoring", "AA-AgentPerf (production topology)"),
    ("Chapter 8: Regression",       "Golden baselines, CI/CD",     "MLPerf audit trail, AA 72-hour rolling"),
]

print(f"{'Internal Validation':<30} {'What You Test':<30} {'External Benchmark Equivalent'}")
print("=" * 95)
for internal, tests, external in mapping:
    print(f"{internal:<30} {tests:<30} {external}")

ExercisesΒΆ

  1. Metric Calculation: Given the following streaming response data, calculate TTFT, output speed, and total response time:

    • Request sent: T=0.000s

    • First token: T=0.187s

    • Last token (token #200): T=2.687s

    • What would TTFAT be if 500 reasoning tokens were generated at 150 t/s before the answer?

  2. Token Normalization: A model uses a tokenizer that produces 15% more tokens than OpenAI’s for the same text. The model reports 250 native tokens/sec. What is the OpenAI-normalized speed?

  3. SLT Analysis: Using the simulated SLT results above, at what concurrency level does MI300X reach its throughput plateau? What is the efficiency ratio (actual throughput / theoretical throughput) at 64 concurrent users?

  4. AgentPerf SLO Planning: Your MI300X system supports 11 users at SLO #1 (100 t/s). A customer needs to serve 50 concurrent agent users. How many MI300X GPUs do they need? What’s the rental cost at $2.00/GPU/hr?

  5. Benchmark Comparison: Explain why MLPerf Inference uses a β€œServer” scenario while AA-AgentPerf uses binary search for max users. What different aspect of hardware performance does each approach reveal?

  6. Confidence Intervals: You run a benchmark 5 times and get scores: [48.2, 51.1, 49.7, 50.3, 48.9]. Calculate the 95% CI. Do you need more runs to achieve <1% CI width?

Key TakeawaysΒΆ

  1. TTFT and output speed are the two metrics the industry judges hardware by β€” your validation work directly feeds these

  2. AA-SLT measures raw hardware throughput with uniform workloads; AA-AgentPerf measures real-world capacity with agentic patterns

  3. Token normalization (to OpenAI tokens) is essential for fair cross-model comparison

  4. SLO-based capacity planning (P25 speed, P95 TTFT) is how production deployments are sized

  5. MLPerf is the hardware industry standard; AA is the customer-facing external validator

  6. Intelligence benchmarks matter for hardware engineers because quantization and compiler optimizations can degrade model quality

  7. Your internal regression baselines (Chapter 8) should track the same metrics external benchmarks measure

Previous: 08_regression_release_validation.ipynb
Back to Overview: README.md