Chapter 9: Industry AI Benchmarking & Performance AnalysisΒΆ
Duration: 3-4 hours | Difficulty: Intermediate-Advanced
Learning ObjectivesΒΆ
By the end of this chapter, you will understand:
The standard metrics used to benchmark LLM inference (TTFT, output speed, throughput)
How Artificial Analysis benchmarks API providers and hardware systems
AA-SLT (System Load Test) and AA-AgentPerf methodologies in detail
Intelligence benchmarking: evals, indices, and confidence intervals
Multi-modal benchmarking approaches
Other industry benchmarks: MLPerf, LMSys Arena, Open LLM Leaderboard
How these external benchmarks relate to your internal hardware validation work
Why This Matters for Hardware Validation EngineersΒΆ
Your internal validation work (chapters 1-8) produces the numbers that external benchmarks like Artificial Analysis measure. Understanding what gets measured externally tells you:
What to optimize for β TTFT and output speed are the KPIs customers care about
How your hardware is judged β MI300X vs H100 vs B200 comparisons are public
What SLOs to target β real production thresholds for agentic workloads
Where AMD can differentiate β $/hr, perf/watt, and capacity per rack
Internal Validation (Ch 1-8) |
External Benchmark (This Chapter) |
|---|---|
GEMM correctness & TFLOPS |
Output tokens/sec under load |
Power draw during sustained ops |
Perf/watt normalization |
Regression baselines |
Throughput plateau detection |
Distributed AllReduce bandwidth |
System-level capacity planning |
Part 1 β LLM Performance Metrics: What Gets MeasuredΒΆ
1.1 Core Latency & Speed MetricsΒΆ
Every LLM inference benchmark measures some combination of these:
Metric |
Definition |
Why It Matters |
|---|---|---|
TTFT (Time to First Token) |
Time from request sent to first token received |
User-perceived responsiveness |
TTFAT (Time to First Answer Token) |
TTFT + reasoning token time (for reasoning models) |
Real wait time before useful output |
Output Speed |
Tokens/sec after first token |
Reading speed β how fast text appears |
System Output Throughput |
Total tokens/sec across all concurrent users |
Hardware capacity metric |
E2E Response Time |
Total time for complete response |
Full request lifecycle |
1.2 The FormulasΒΆ
Time to First TokenΒΆ
Time to First Answer Token (reasoning models)ΒΆ
Output SpeedΒΆ
Total Response Time (for 100 output tokens)ΒΆ
End-to-End Response Time (reasoning models, 500 answer tokens)ΒΆ
1.3 Token NormalizationΒΆ
Different models use different tokenizers. Artificial Analysis normalizes everything to OpenAI tokens (using o200k_base tokenizer via tiktoken) for fair comparison.
Model |
Native tokenizer |
Tokens for βHello, world!β |
OpenAI tokens |
|---|---|---|---|
GPT-4 |
o200k_base |
4 |
4 (reference) |
Llama 3 |
SentencePiece |
5 |
4 |
Claude |
Custom BPE |
4 |
4 |
This matters because a model claiming β200 tokens/secβ with a verbose tokenizer might actually be generating less text than one claiming β150 tokens/secβ with a compact tokenizer.
1.4 Pricing MetricsΒΆ
Blended Price assumes 3:1 input-to-output ratio:
This is expressed as USD per 1M tokens.
# ββ 1.5 Metric Calculation Demo ββββββββββββββββββββββββββββββββββββββββββββββ
import time
import numpy as np
def calculate_llm_metrics(token_timestamps: list[float], request_sent: float) -> dict:
"""Calculate standard LLM inference metrics from token arrival timestamps.
Args:
token_timestamps: list of arrival times for each output token
request_sent: timestamp when the request was sent
Returns:
dict with TTFT, output_speed, e2e_latency
"""
ttft = token_timestamps[0] - request_sent
n_tokens = len(token_timestamps)
if n_tokens > 1:
output_speed = (n_tokens - 1) / (token_timestamps[-1] - token_timestamps[0])
else:
output_speed = 0.0
e2e_latency = token_timestamps[-1] - request_sent
return {
"ttft_ms": ttft * 1000,
"output_speed_tps": output_speed,
"e2e_latency_ms": e2e_latency * 1000,
"total_tokens": n_tokens,
"total_response_time_100t_ms": (ttft + 100 / max(output_speed, 1e-9)) * 1000,
}
# Simulate a streaming response: 50 tokens with ~80 t/s speed
np.random.seed(42)
request_t = time.time()
prefill_delay = 0.15 # 150ms TTFT
inter_token = 1.0 / 80 # ~12.5ms per token
timestamps = []
t = request_t + prefill_delay
for _ in range(50):
timestamps.append(t)
t += inter_token * (1 + np.random.normal(0, 0.05)) # 5% jitter
metrics = calculate_llm_metrics(timestamps, request_t)
print("=== Simulated LLM Inference Metrics ===")
for k, v in metrics.items():
if "ms" in k:
print(f" {k:<35} {v:>10.1f} ms")
elif "tps" in k:
print(f" {k:<35} {v:>10.1f} tokens/sec")
else:
print(f" {k:<35} {v:>10}")
# ββ 1.6 Token Normalization with tiktoken ββββββββββββββββββββββββββββββββββββ
try:
import tiktoken
enc = tiktoken.get_encoding("o200k_base") # OpenAI's standard
test_strings = [
"Hello, world!",
"The quick brown fox jumps over the lazy dog.",
"def fibonacci(n):\n if n <= 1: return n\n return fibonacci(n-1) + fibonacci(n-2)",
"Explain the concept of attention mechanisms in transformer models.",
]
print(f"{'Text (first 50 chars)':<55} {'OpenAI tokens':>15}")
print("-" * 72)
for s in test_strings:
tokens = enc.encode(s)
display = s[:50] + ("..." if len(s) > 50 else "")
print(f"{display:<55} {len(tokens):>15}")
# Why normalization matters
print("\n=== Why normalization matters ===")
text = "Explain quantum computing in simple terms." * 10
openai_tokens = len(enc.encode(text))
# Hypothetical: another tokenizer produces 20% more tokens for same text
native_tokens = int(openai_tokens * 1.2)
native_speed = 200 # tokens/sec in native tokens
normalized_speed = native_speed * (openai_tokens / native_tokens)
print(f" Native tokenizer speed : {native_speed} native tokens/sec")
print(f" OpenAI-normalized speed: {normalized_speed:.0f} OpenAI tokens/sec")
print(f" Actual text generation is {(1 - normalized_speed/native_speed)*100:.0f}% slower than native number suggests")
except ImportError:
print("tiktoken not installed. Run: pip install tiktoken")
# ββ 1.7 Blended Pricing Calculator βββββββββββββββββββββββββββββββββββββββββββ
PROVIDER_PRICING = {
"GPT-4o": {"input": 2.50, "output": 10.00},
"GPT-4o-mini": {"input": 0.15, "output": 0.60},
"Claude Sonnet": {"input": 3.00, "output": 15.00},
"Claude Haiku": {"input": 0.25, "output": 1.25},
"Llama 3.3 70B": {"input": 0.20, "output": 0.20}, # typical open-weight hosting
"DeepSeek V3": {"input": 0.27, "output": 1.10},
}
def blended_price(input_price: float, output_price: float) -> float:
"""Calculate blended price assuming 3:1 input:output ratio."""
return (3 * input_price + output_price) / 4
print(f"{'Model':<20} {'Input $/1M':>12} {'Output $/1M':>13} {'Blended $/1M':>14}")
print("-" * 62)
for model, prices in sorted(PROVIDER_PRICING.items(), key=lambda x: blended_price(x[1]['input'], x[1]['output'])):
bp = blended_price(prices['input'], prices['output'])
print(f"{model:<20} ${prices['input']:>10.2f} ${prices['output']:>11.2f} ${bp:>12.2f}")
Part 2 β API Performance BenchmarkingΒΆ
2.1 How Artificial Analysis Tests API ProvidersΒΆ
Testing InfrastructureΒΆ
Component |
Specification |
|---|---|
Test server |
Virtual machine in GCP |
Test frequency |
1K & 10K input workloads: 8x/day (~every 3 hours) |
Parallel workload (10 concurrent): 1x/day |
|
100K input workload: 1x/week |
|
Metric window |
Rolling 72 hours (βliveβ metrics) |
Accounts |
Mix of anonymous, credited, and provided API keys |
Workload ShapesΒΆ
Workload |
Input tokens |
Purpose |
|---|---|---|
Short context |
~1,000 |
Chatbot-style, quick Q&A |
Medium context |
~10,000 |
Document analysis, code review |
Long context |
~100,000 |
Full codebase, book analysis |
Vision |
Image + text |
Multi-modal inference |
Parallel (10x) |
10 concurrent |
Load handling |
What Makes This Methodology GoodΒΆ
Standardized location β all tests from same GCP zone eliminates network variance
Regular cadence β 8x/day captures peak/off-peak performance
Rolling window β 72 hours smooths transient spikes
Multiple workloads β short, medium, long context exercises different system paths
Real accounts β tests what customers actually experience
What It Doesnβt CaptureΒΆ
Geographic variation β only tests from one zone
Burst patterns β 10 concurrent is modest; production may see 1000s
Model-specific optimizations β canβt see if a provider has speculative decoding enabled
Cold start latency β serverless endpoints may have spin-up time not captured
2.2 Reasoning Model ConsiderationsΒΆ
For models like o1, DeepSeek-R1, and Claude with extended thinking:
Average Reasoning Tokens is measured across 60 diverse prompts (personal, commercial, coding, math, science)
Prompts sourced from: MMLU Pro, AIME 2025, LiveCodeBench
Default assumption: 2K reasoning tokens when actual count is unavailable
TTFAT (Time to First Answer Token) is reported separately from TTFT
# ββ 2.3 Simulating an API Performance Benchmark ββββββββββββββββββββββββββββββ
def simulate_api_benchmark(n_runs: int = 20, base_ttft_ms: float = 200,
base_speed_tps: float = 100, output_tokens: int = 100):
"""Simulate API performance measurements with realistic variance.
Real API performance varies due to:
- Server load (queueing)
- Batch scheduling
- Network jitter
- KV cache hits/misses
"""
results = []
for i in range(n_runs):
# Add realistic noise
ttft = base_ttft_ms * np.random.lognormal(0, 0.3) # log-normal: right-skewed
speed = base_speed_tps * np.random.normal(1.0, 0.1) # normal: symmetric
speed = max(speed, 10) # floor
e2e = ttft + (output_tokens / speed) * 1000
results.append({"ttft_ms": ttft, "speed_tps": speed, "e2e_ms": e2e})
return results
# Simulate two providers
provider_a = simulate_api_benchmark(n_runs=50, base_ttft_ms=150, base_speed_tps=120)
provider_b = simulate_api_benchmark(n_runs=50, base_ttft_ms=300, base_speed_tps=90)
def summarize(results, name):
ttfts = [r["ttft_ms"] for r in results]
speeds = [r["speed_tps"] for r in results]
e2es = [r["e2e_ms"] for r in results]
print(f"\n{'='*55}")
print(f" {name} (n={len(results)})")
print(f"{'='*55}")
print(f" {'Metric':<25} {'Median':>10} {'P5':>10} {'P95':>10}")
print(f" {'-'*55}")
for label, data, unit in [
("TTFT", ttfts, "ms"),
("Output Speed", speeds, "t/s"),
("E2E (100 tokens)", e2es, "ms"),
]:
p5 = np.percentile(data, 5)
p50 = np.percentile(data, 50)
p95 = np.percentile(data, 95)
print(f" {label:<25} {p50:>8.0f} {unit} {p5:>8.0f} {p95:>8.0f}")
summarize(provider_a, "Provider A (fast TTFT, high speed)")
summarize(provider_b, "Provider B (slow TTFT, lower speed)")
Part 3 β Hardware Benchmarking: AA-SLT (System Load Test)ΒΆ
3.1 OverviewΒΆ
The AA-SLT is Artificial Analysisβs original hardware benchmark. It measures how a hardware system performs under increasing concurrent load.
Design PrinciplesΒΆ
Phased concurrency ramp β start with 1 user, scale to saturation
Fixed workload β 1K input tokens, 1K output tokens per query
Sustained load β 3 minutes per phase (excludes ramp-up/cool-down)
Immediate replacement β when a query finishes, a new one starts immediately
Streaming enabled β tokens measured as they arrive
Concurrency PhasesΒΆ
Phase: 1 2 3 4 5 6 7 8 9 ...
Users: 1 2 4 8 16 32 64 128 192 ...
Time: 3m 3m 3m 3m 3m 3m 3m 3m 3m
After 64, increments of 64 until throughput plateaus.
Key MetricsΒΆ
Metric |
Definition |
Aggregation |
|---|---|---|
System Output Throughput |
Total output tokens/sec across all concurrent users |
Average over phase |
Response Rate |
Proportion of queries that received at least 1 token |
Per phase |
E2E Latency per Query |
Full response time per query |
Median per phase |
Output Speed per Query |
Tokens/sec after first token, per query |
Median per phase |
Throughput Plateau DetectionΒΆ
The benchmark stops when adding more concurrent users no longer increases total system throughput. This indicates the hardware is saturated:
System Throughput (t/s)
|
| ___________ β plateau (hardware saturated)
| /
| /
| /
| /
| /
| /
| /
|__/
+---------------------------β Concurrent Users
1 2 4 8 16 32 64 128
3.2 Current Results (Llama 3.3 70B)ΒΆ
As measured by Artificial Analysis (March 2026):
Accelerator |
Peak System Throughput (t/s) |
Per-Query Speed (t/s) |
Rental $/GPU/hr |
|---|---|---|---|
NVIDIA B200 |
15,400 |
144 |
β |
NVIDIA H200 |
8,470 |
97 |
$5.00 |
NVIDIA H100 |
7,290 |
95.9 |
$2.70 |
AMD MI300X |
6,730 |
85 |
$2.00 |
Google TPU v6e |
3,670 |
61.3 |
$2.70 |
Key Observations for AMDΒΆ
MI300X throughput is ~92% of H100 at 74% the cost β competitive on perf/$
Per-query speed gap is larger (85 vs 96 t/s) β decode phase optimization opportunity
B200 is 2.3x MI300X in throughput β next-gen AMD (MI350X) needs to close this
MI300Xβs 192GB HBM3 enables larger batch sizes than H100 (80GB) β advantage at high concurrency
# ββ 3.3 Simulate AA-SLT Phased Load Test βββββββββββββββββββββββββββββββββββββ
import dataclasses
@dataclasses.dataclass
class SLTPhaseResult:
concurrent_users: int
system_throughput_tps: float
median_per_query_speed_tps: float
median_e2e_latency_ms: float
response_rate: float # 0.0 - 1.0
def simulate_slt_phase(concurrent_users: int, max_throughput: float = 7000,
single_user_speed: float = 95) -> SLTPhaseResult:
"""Simulate one phase of an AA-SLT benchmark.
Models the fundamental trade-off: as concurrency increases,
per-query speed decreases but total throughput increases (until saturation).
Args:
concurrent_users: number of simultaneous queries
max_throughput: hardware's maximum total output tokens/sec
single_user_speed: per-query speed with 1 concurrent user
"""
# Per-query speed degrades as resources are shared
# Modeled as: speed = single_user_speed / (1 + alpha * log(users))
alpha = 0.35
per_query_speed = single_user_speed / (1 + alpha * np.log(max(concurrent_users, 1)))
# Total throughput = per_query * users, capped at hardware max
raw_throughput = per_query_speed * concurrent_users
system_throughput = min(raw_throughput, max_throughput)
# If throughput-limited, per-query speed is further reduced
if raw_throughput > max_throughput:
per_query_speed = max_throughput / concurrent_users
# E2E latency: TTFT + decode time for 1000 tokens
ttft_ms = 100 + 20 * np.log(max(concurrent_users, 1)) # queuing delay
decode_ms = (1000 / per_query_speed) * 1000
e2e_ms = ttft_ms + decode_ms
# Response rate degrades at extreme concurrency
response_rate = min(1.0, 1.0 - max(0, concurrent_users - 128) * 0.001)
return SLTPhaseResult(
concurrent_users=concurrent_users,
system_throughput_tps=system_throughput,
median_per_query_speed_tps=per_query_speed,
median_e2e_latency_ms=e2e_ms,
response_rate=response_rate,
)
# Run simulated SLT for two "accelerators"
concurrency_levels = [1, 2, 4, 8, 16, 32, 64, 128, 192, 256]
accelerators = {
"MI300X (simulated)": {"max_throughput": 6730, "single_user_speed": 85},
"H100 (simulated)": {"max_throughput": 7290, "single_user_speed": 96},
}
for accel_name, params in accelerators.items():
print(f"\n{'='*70}")
print(f" AA-SLT Results: {accel_name}")
print(f"{'='*70}")
print(f" {'Users':>6} {'Sys Throughput':>16} {'Per-Query Speed':>17} {'E2E Latency':>13} {'Resp Rate':>10}")
print(f" {'-'*65}")
prev_throughput = 0
for n_users in concurrency_levels:
r = simulate_slt_phase(n_users, **params)
plateau = " << plateau" if r.system_throughput_tps - prev_throughput < 100 and n_users > 1 else ""
print(f" {r.concurrent_users:>6} {r.system_throughput_tps:>13.0f} t/s {r.median_per_query_speed_tps:>13.1f} t/s {r.median_e2e_latency_ms:>10.0f} ms {r.response_rate:>9.1%}{plateau}")
prev_throughput = r.system_throughput_tps
Part 4 β Hardware Benchmarking: AA-AgentPerfΒΆ
4.1 Why a New Benchmark?ΒΆ
AA-SLT uses uniform 1K/1K workloads. Real agentic use cases are very different:
Property |
AA-SLT |
AA-AgentPerf |
|---|---|---|
Input length |
Fixed 1K tokens |
1Kβ131K tokens (mean 27K) |
Output length |
Fixed 1K tokens |
Median 150, P95 ~2K |
Multi-turn |
No |
Yes (up to 200 turns) |
Workload source |
Synthetic |
Real coding agent trajectories |
Context growth |
No |
Each turn adds to context |
KV cache stress |
Minimal |
Heavy (long sequences, reuse) |
Optimizations |
Baseline |
Production (speculative decoding, disaggregated prefill/decode) |
4.2 DatasetΒΆ
Source: Real coding agent trajectories from public repositories
Models used to generate: Top open-source models with reasoning enabled
Languages: 12+ programming languages
Input Sequence Length (ISL): 1Kβ131K tokens, mean ~27K
Output Sequence Length (OSL): Median ~150 tokens, P95 ~2K
Tuning subset: Provided to participants for config validation
Test dataset: Kept private to prevent benchmark-targeted optimizations
4.3 Service-Level Objectives (SLOs)ΒΆ
SLOs are derived from real market data β what providers actually deliver today:
Model |
SLO Tier |
P25 Output Speed (t/s) |
P95 TTFT (s) |
|---|---|---|---|
DeepSeek V3.2 (Thinking) |
#1 |
30 |
15 |
#2 |
100 |
15 |
|
#3 |
300 |
15 |
|
gpt-oss-120b (high) |
#1 |
100 |
10 |
#2 |
250 |
10 |
|
#3 |
500 |
10 |
|
#4 |
2,000 |
10 |
P25 Output Speed (not P50 or P5) is used because agentic workloads have many short-output requests. P25 captures the slower quarter without being overly strict.
4.4 Binary Search for Max UsersΒΆ
Instead of linear ramp (like AA-SLT), AgentPerf uses binary search:
Phase 1: 1 user β 348 t/s β PASS (above 100 t/s SLO)
Phase 2: 2 users β 310 t/s β PASS
Phase 3: 4 users β 245 t/s β PASS
Phase 4: 8 users β 152 t/s β PASS
Phase 5: 16 users β 42 t/s β FAIL β
Phase 6: 12 users β 82 t/s β FAIL β
Phase 7: 10 users β 128 t/s β PASS
Phase 8: 11 users β 108 t/s β PASS β
Result: Max users = 11 at SLO #1 (100 t/s)
4.5 Result NormalizationΒΆ
Results are normalized to enable fair comparison:
Normalization |
Why It Matters |
|---|---|
Per accelerator |
Compare chip-to-chip (MI300X vs H100) |
Per kW TDP |
Power efficiency β critical for datacenter TCO |
Per rack |
Physical space constraints in datacenters |
Per $/hr |
Cost efficiency for cloud customers |
4.6 Production Optimizations AllowedΒΆ
Unlike many benchmarks, AgentPerf allows real-world optimizations:
Optimization |
What It Does |
Impact |
|---|---|---|
KV cache reuse |
Skip recomputing attention for prior turns |
Reduces TTFT dramatically in multi-turn |
Speculative decoding |
Draft model predicts tokens, main model verifies |
2-3x decode speedup |
Disaggregated prefill/decode |
Separate hardware pools for each phase |
Better utilization |
Continuous batching |
Add new requests to running batch |
Higher throughput |
Prefix caching |
Cache common prompt prefixes |
Faster system prompts |
# ββ 4.7 Simulate AA-AgentPerf Binary Search ββββββββββββββββββββββββββββββββββ
def simulate_agentperf_phase(n_users: int, base_speed: float = 350,
degradation_rate: float = 0.6) -> dict:
"""Simulate AgentPerf at a given concurrency.
In AgentPerf, performance degrades faster than SLT because:
- Variable context lengths stress KV cache
- Long sequences increase memory pressure
- Multi-turn sessions create scheduling complexity
"""
# Power-law degradation: speed = base / users^rate
p25_speed = base_speed / (n_users ** degradation_rate)
p95_ttft = 1.0 + 0.5 * n_users # grows linearly
return {"users": n_users, "p25_speed": p25_speed, "p95_ttft": p95_ttft}
def binary_search_max_users(slo_speed: float, slo_ttft: float,
base_speed: float = 350) -> list[dict]:
"""Find max concurrent users meeting SLO via binary search."""
phases = []
# Exponential ramp first
n = 1
while n <= 64:
result = simulate_agentperf_phase(n, base_speed)
passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
phases.append({**result, "pass": passed})
if not passed:
break
n *= 2
if not phases[-1]["pass"]:
# Binary search between last pass and first fail
lo = phases[-2]["users"] if len(phases) > 1 else 1
hi = phases[-1]["users"]
while hi - lo > 1:
mid = (lo + hi) // 2
result = simulate_agentperf_phase(mid, base_speed)
passed = result["p25_speed"] >= slo_speed and result["p95_ttft"] <= slo_ttft
phases.append({**result, "pass": passed})
if passed:
lo = mid
else:
hi = mid
return phases
# Run for SLO #1: 100 t/s, 10s TTFT
print("=== AA-AgentPerf Binary Search (SLO: 100 t/s, 10s TTFT) ===")
print(f" {'Phase':>5} {'Users':>6} {'P25 Speed':>12} {'P95 TTFT':>10} {'Result':>8}")
print(f" {'-'*45}")
phases = binary_search_max_users(slo_speed=100, slo_ttft=10)
max_users = 0
for i, p in enumerate(phases):
status = "PASS" if p["pass"] else "FAIL"
marker = "" if not p["pass"] else ""
if p["pass"]:
max_users = p["users"]
print(f" {i+1:>5} {p['users']:>6} {p['p25_speed']:>9.0f} t/s {p['p95_ttft']:>7.1f} s {status:>8} {marker}")
print(f"\n Max concurrent users at SLO: {max_users}")
# Now run for all SLO tiers
print("\n=== Capacity per SLO Tier ===")
slo_tiers = [
{"name": "SLO #1 (economy)", "speed": 30, "ttft": 15},
{"name": "SLO #2 (standard)", "speed": 100, "ttft": 10},
{"name": "SLO #3 (premium)", "speed": 300, "ttft": 10},
{"name": "SLO #4 (ultra)", "speed": 2000, "ttft": 10},
]
print(f" {'SLO Tier':<25} {'Target Speed':>13} {'Target TTFT':>12} {'Max Users':>10}")
print(f" {'-'*62}")
for tier in slo_tiers:
phases = binary_search_max_users(tier["speed"], tier["ttft"])
max_u = max((p["users"] for p in phases if p["pass"]), default=0)
print(f" {tier['name']:<25} {tier['speed']:>10} t/s {tier['ttft']:>9} s {max_u:>10}")
Part 5 β Intelligence BenchmarkingΒΆ
5.1 AA Intelligence Index v4.0ΒΆ
The Intelligence Index aggregates 10 evaluations into a single score (0-100):
Evaluation |
What It Tests |
Source |
|---|---|---|
GDPval-AA |
Real-world tasks across 44 occupations, 9 industries |
AA + agentic (shell + web access) |
ΟΒ²-Bench Telecom |
Domain-specific (telecom) reasoning |
Academic |
Terminal-Bench Hard |
Software engineering, sysadmin, game-playing via terminal |
Stanford + AA |
SciCode |
Scientific computing |
Academic |
AA-LCR |
Long context recall |
AA |
AA-Omniscience |
General knowledge |
AA |
IFBench |
Instruction following |
Academic |
Humanityβs Last Exam |
Extremely hard multi-domain questions |
External |
GPQA Diamond |
Graduate-level science Q&A |
Academic |
CritPt |
Critical thinking & reasoning |
External |
Methodology PrinciplesΒΆ
95% confidence interval < 1% β verified via 10+ repeats on select models
Agentic evals (GDPval, Terminal-Bench) give models real tool access
ELO ratings derived from blind pairwise comparisons (GDPval)
Sub-indices: Coding Index, Agentic Index (subsets of the 10 evals)
5.2 Why Hardware Engineers Should Care About Intelligence BenchmarksΒΆ
You might think: βI validate hardware, not model quality.β But:
Quantization impact β does INT8/FP8 on your hardware degrade intelligence scores?
Compiler correctness β torch.compile / graph optimizations must not change model outputs
Framework parity β same model on CUDA vs ROCm should score identically
Numerical precision β your kernel validation (Chapter 2) directly affects these scores
5.3 Other Intelligence BenchmarksΒΆ
Benchmark |
What It Does |
Key Metric |
Used By |
|---|---|---|---|
MMLU / MMLU Pro |
Multi-task language understanding |
Accuracy % |
Everyone |
HumanEval / MBPP |
Code generation |
Pass@1 |
Code models |
MATH / GSM8K |
Math problem solving |
Accuracy % |
Reasoning models |
GPQA |
Graduate-level science |
Accuracy % |
Frontier models |
IFEval |
Instruction following |
Strict accuracy % |
Aligned models |
LiveCodeBench |
Competitive programming (live) |
Pass rate |
Coding models |
SWE-bench |
Real GitHub issue resolution |
Resolve % |
Agentic models |
AIME |
Math competition problems |
Solve rate |
Reasoning models |
# ββ 5.4 Confidence Interval Calculation for Benchmarks ββββββββββββββββββββββββ
from scipy import stats as scipy_stats
def benchmark_confidence_interval(scores: list[float], confidence: float = 0.95) -> dict:
"""Calculate confidence interval for a benchmark score.
This is how AA ensures their Intelligence Index has <1% CI.
"""
n = len(scores)
mean = np.mean(scores)
se = scipy_stats.sem(scores) # standard error of the mean
ci = scipy_stats.t.interval(confidence, df=n-1, loc=mean, scale=se)
return {
"mean": mean,
"std": np.std(scores, ddof=1),
"n_runs": n,
"ci_low": ci[0],
"ci_high": ci[1],
"ci_width": ci[1] - ci[0],
"ci_width_pct": (ci[1] - ci[0]) / mean * 100,
}
# Simulate: run an eval 15 times, get slightly different scores each time
np.random.seed(42)
model_a_scores = np.random.normal(52.5, 1.2, size=15) # mean ~52.5, std ~1.2
model_b_scores = np.random.normal(50.8, 1.5, size=15) # mean ~50.8, std ~1.5
for name, scores in [("Model A (frontier)", model_a_scores), ("Model B (challenger)", model_b_scores)]:
ci = benchmark_confidence_interval(scores)
print(f"\n{name}:")
print(f" Mean score : {ci['mean']:.2f}")
print(f" Std dev : {ci['std']:.2f}")
print(f" 95% CI : [{ci['ci_low']:.2f}, {ci['ci_high']:.2f}]")
print(f" CI width : {ci['ci_width']:.2f} ({ci['ci_width_pct']:.1f}% of mean)")
print(f" Meets <1% CI? : {'Yes' if ci['ci_width_pct'] < 1.0 else 'No'}")
# Are the two models statistically different?
t_stat, p_value = scipy_stats.ttest_ind(model_a_scores, model_b_scores)
print(f"\n=== Are they statistically different? ===")
print(f" t-statistic : {t_stat:.3f}")
print(f" p-value : {p_value:.4f}")
print(f" Significant : {'Yes (p < 0.05)' if p_value < 0.05 else 'No (p >= 0.05)'}")
Part 6 β Multi-Modal BenchmarkingΒΆ
Artificial Analysis also benchmarks non-text modalities:
6.1 Modality CoverageΒΆ
Modality |
Models Tested |
Key Metrics |
|---|---|---|
Text-to-Image |
DALL-E 3, Midjourney, Stable Diffusion, Flux |
Quality (human eval), generation time, cost/image |
Speech-to-Text |
Whisper, Deepgram, AssemblyAI |
Word Error Rate (WER), latency, cost/min |
Text-to-Speech |
ElevenLabs, OpenAI TTS, Play.ht |
Quality (MOS), latency, cost/char |
Speech-to-Speech |
GPT-4o Voice, Gemini Live |
Latency, naturalness, turn-taking |
6.2 Openness IndexΒΆ
AA also rates models on openness β how much the model creator shares:
Factor |
Examples |
|---|---|
Weights released |
Llama: Yes, GPT-4: No |
Training data disclosed |
Few models fully disclose |
Architecture published |
Most open models: Yes |
License permissiveness |
Apache 2.0 vs custom restrictive |
Reproducibility |
Training code, hyperparameters |
6.3 Hardware ImplicationsΒΆ
For hardware validation engineers, multi-modal matters because:
Image generation stresses GPU differently than LLMs (UNet/DiT, high VRAM bandwidth)
Speech processing requires real-time constraints (<200ms round-trip)
Multi-modal models (GPT-4o, Gemini) process images + text in one forward pass β different memory and compute profiles
Part 7 β Other Industry BenchmarksΒΆ
Artificial Analysis is one player. Hereβs the full landscape:
7.1 MLPerf (MLCommons)ΒΆ
The gold standard for hardware benchmarking in ML:
Benchmark |
What It Measures |
Who Submits |
|---|---|---|
MLPerf Training |
Time to train to quality target |
NVIDIA, Google, AMD, Intel |
MLPerf Inference |
Throughput & latency for inference |
Hardware vendors, cloud providers |
MLPerf HPC |
Training on HPC systems |
National labs |
MLPerf Tiny |
Inference on microcontrollers |
Embedded vendors |
MLPerf Inference CategoriesΒΆ
Scenario |
Description |
|---|---|
Offline |
Process all samples ASAP (batch throughput) |
Server |
Process queries arriving at random intervals (latency-bounded) |
SingleStream |
Process one query at a time (edge devices) |
MultiStream |
Multiple concurrent streams (autonomous driving) |
Relevance to AMDΒΆ
AMD submits MI300X results to MLPerf β your validation work feeds these submissions
MLPerf requires verifiable results with auditable configs
Results are normalized per chip, per system, and per power
7.2 LMSys Chatbot ArenaΒΆ
Property |
Description |
|---|---|
Method |
Blind pairwise human evaluation |
Metric |
ELO rating (like chess) |
Scale |
1M+ human votes |
Why it matters |
Most trusted quality ranking for chatbots |
Hardware relevance |
None directly β but model quality is the end goal |
7.3 Open LLM Leaderboard (HuggingFace)ΒΆ
Property |
Description |
|---|---|
Method |
Automated eval on standard benchmarks |
Benchmarks |
MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K |
Who uses it |
Open-source model developers |
Limitation |
Prone to benchmark gaming (training on eval data) |
Hardware relevance |
Used to verify quantized models donβt lose quality |
7.4 SPEC ML (Emerging)ΒΆ
Property |
Description |
|---|---|
Organization |
Standard Performance Evaluation Corporation |
Focus |
Standardized ML workload benchmarking |
Status |
Newer, less established than MLPerf |
Approach |
Broader workload coverage including classical ML |
7.5 Comparison MatrixΒΆ
Benchmark |
Focus |
Measures Hardware? |
Measures Quality? |
Who Runs It? |
Open? |
|---|---|---|---|---|---|
AA-SLT |
HW throughput under load |
Yes |
No |
Artificial Analysis |
Results public |
AA-AgentPerf |
HW capacity for agents |
Yes |
No |
Artificial Analysis |
Results public |
AA Intelligence |
Model quality |
No |
Yes |
Artificial Analysis |
Results public |
MLPerf Inference |
HW throughput (standardized) |
Yes |
No |
Vendor submissions |
Fully open |
MLPerf Training |
HW training speed |
Yes |
No |
Vendor submissions |
Fully open |
LMSys Arena |
Model quality (human) |
No |
Yes |
LMSYS |
Open |
Open LLM Board |
Model quality (auto) |
No |
Yes |
HuggingFace |
Open |
SPEC ML |
HW general ML workloads |
Yes |
No |
SPEC |
Membership |
# ββ 7.6 How Internal Validation Maps to External Benchmarks ββββββββββββββββββ
mapping = [
("Chapter 1: HW Validation", "Power, thermals, memory", "MLPerf power submission, AA per-kW normalization"),
("Chapter 2: Kernel Val.", "GEMM correctness, TFLOPS", "MLPerf Inference (operator-level perf)"),
("Chapter 3: Framework Val.", "ONNX parity, torch.compile", "AA Intelligence (quantization impact)"),
("Chapter 4: Model Perf.", "Latency, throughput, VRAM", "AA-SLT per-query speed, AA-AgentPerf"),
("Chapter 5: E2E Pipeline", "DataβModelβPost pipeline", "AA E2E Response Time"),
("Chapter 6: Distributed", "AllReduce, scaling eff.", "MLPerf Training (multi-node)"),
("Chapter 7: Datacenter", "K8s, scheduling, monitoring", "AA-AgentPerf (production topology)"),
("Chapter 8: Regression", "Golden baselines, CI/CD", "MLPerf audit trail, AA 72-hour rolling"),
]
print(f"{'Internal Validation':<30} {'What You Test':<30} {'External Benchmark Equivalent'}")
print("=" * 95)
for internal, tests, external in mapping:
print(f"{internal:<30} {tests:<30} {external}")
ExercisesΒΆ
Metric Calculation: Given the following streaming response data, calculate TTFT, output speed, and total response time:
Request sent: T=0.000s
First token: T=0.187s
Last token (token #200): T=2.687s
What would TTFAT be if 500 reasoning tokens were generated at 150 t/s before the answer?
Token Normalization: A model uses a tokenizer that produces 15% more tokens than OpenAIβs for the same text. The model reports 250 native tokens/sec. What is the OpenAI-normalized speed?
SLT Analysis: Using the simulated SLT results above, at what concurrency level does MI300X reach its throughput plateau? What is the efficiency ratio (actual throughput / theoretical throughput) at 64 concurrent users?
AgentPerf SLO Planning: Your MI300X system supports 11 users at SLO #1 (100 t/s). A customer needs to serve 50 concurrent agent users. How many MI300X GPUs do they need? Whatβs the rental cost at $2.00/GPU/hr?
Benchmark Comparison: Explain why MLPerf Inference uses a βServerβ scenario while AA-AgentPerf uses binary search for max users. What different aspect of hardware performance does each approach reveal?
Confidence Intervals: You run a benchmark 5 times and get scores: [48.2, 51.1, 49.7, 50.3, 48.9]. Calculate the 95% CI. Do you need more runs to achieve <1% CI width?
Key TakeawaysΒΆ
TTFT and output speed are the two metrics the industry judges hardware by β your validation work directly feeds these
AA-SLT measures raw hardware throughput with uniform workloads; AA-AgentPerf measures real-world capacity with agentic patterns
Token normalization (to OpenAI tokens) is essential for fair cross-model comparison
SLO-based capacity planning (P25 speed, P95 TTFT) is how production deployments are sized
MLPerf is the hardware industry standard; AA is the customer-facing external validator
Intelligence benchmarks matter for hardware engineers because quantization and compiler optimizations can degrade model quality
Your internal regression baselines (Chapter 8) should track the same metrics external benchmarks measure
Previous: 08_regression_release_validation.ipynb
Back to Overview: README.md