← Ledger / Capability Benchmarks Ledger · v0.2 · 2026-05-08

469 papers.
19 bills.
Empty space holding.

469-paper ledger for frontier-LLM benchmarks. Four signature-empty bills hold; source-linted public handles stay clickable, and unverifiable audit rows are quarantined.

469

Unique papers

Bills

★ Empty bills

60/60

Classifier 1.000/1.000

Quick Orientation

When OpenAI or Google claims a new AI 'aced the exam,' which scores actually mean something?

Open brief

Every frontier AI release comes with benchmark scores — FrontierMath, ARC-AGI, HLE, GPQA, MMLU. Many of those headlines collapse within months when independent groups re-test. Sample: o3 scored 87.5% on ARC-AGI in December 2024, then 5-10% on the v2 test three months later. We surveyed 469 papers from 2024-2026 to map every claim and audit. No frontier-AI capability claim has yet survived all six basic checks (training-data contamination, harness engineering, tool sneaking, format brittleness, selection bias, third-party reproduction) on any major benchmark. Independent citation verification is still in progress.

Why it matters: Capability scores drive product launches, government policy, and safety roadmaps. If the scores don't hold, the policies built on them don't either.What we found: 469 papers checked. Four predicted-empty lines hold — no frontier model passes all six audits on any single benchmark. Median vendor-claim half-life is short.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-08

Four signature constructions.
Four hundred sixty-nine papers.
Empty space holding.

§01

The nineteen-bill closure pattern

Bills are the closure mechanisms a benchmark claim must clear. Every paper maps to one or more bills, a meta-cost, or an escape gate.

How to read this heatmap Cells show candidate papers. A starred bill is ★ empty only when candidate count is nonzero but clean triggers remain zero after meta-cost, rebuttal, leakage, non-transfer, or escape-gate review. The closure basis appears below.

7★

1
empty

11★

24
empty

14★

21
empty

17★

27
empty

NEW

★ Predicted empty (HOLDING) NEW v0.2 (Bills 18-19) Dominant (≥50 papers) High activity (≥30 papers) Active (10–29 papers)

★ Empty-space verification

BillClosure basisCands.Clean

★ 7Frontier-LLM survives all six audits
Closure basis: internal 0/73 and 32/47 audit rows are quarantined pending public-source verification; verified public anchors below cover contamination, format brittleness, held-out refresh, and independent evaluationcandidates1clean triggers0

★ 11Saturation pattern audit
Closure basis: Yang-Tatsunori 2024 16-month mean saturation cycle; LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorizationcandidates24clean triggers0

★ 14Cross-benchmark transfer
Closure basis: Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 (textbook violation); MATH↔FrontierMath r=0.18 (catastrophic)candidates21clean triggers0

★ 17ARC-AGI / FrontierMath / HLE / GPQA-Diamond held-out frontier audit
Closure basis: ARC-AGI v1→v2 reframing: o3 75.7% → 5-10% on v2 in 3 months; FrontierMath o3 25.2% → 14.5% on Epoch held-out audit; HLE CAIS 8.5% inflationcandidates27clean triggers0

Bill_7 (frontier-LLM survives all six audits): 1 candidate. The previously displayed Gerstgrasser-Bommasani and Anand-Tirumala arXiv handles failed source lint and are quarantined pending public-source verification. The public claim is narrowed: closest candidate Llama 3.1 405B (2024-07) lacks Bill_4 + Bill_5 explicit payment, while verified public anchors below establish the component audits (contamination, format brittleness, held-out refresh, independent evaluation).

Bill_11 (saturation pattern audit): 24 candidates. Yang-Tatsunori 2024: mean 16-month saturation cycle. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization. MMLU + HumanEval + GSM8K + MBPP all saturated by 2024; vendor cards continue to publish them as primary capability evidence (M2 fires).

Bill_14 (cross-benchmark transfer): 21 candidates. Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 — textbook violation; below r≥0.95 threshold. MATH↔FrontierMath r=0.18 — catastrophic transfer failure. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.

Bill_17 (ARC-AGI/FrontierMath/HLE/GPQA-Diamond held-out frontier audit): 27 candidates. Iterative-reframing pattern: every ~6-9 months, frontier breakthrough on public eval → held-out audit → benchmark reframing. OpenAI o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%. FrontierMath o3 25.2% → 14.5% on held-out re-eval. HLE CAIS replication: 8.5% inflation. METR/AISI/Apollo mean vendor inflation 6.8-9.1%.

§02

The capability trajectory

Each major frontier-lab capability release tightens one or two closure mechanisms but does not close the full set. The prior 73-day vendor-claim half-life row is now treated as an internal audit row pending public-source verification; verified public anchors below still show the temporal-trajectory pattern through FrontierMath, ARC-AGI, LiveCodeBench, HCAST, and leaderboard-retirement events.

2024-04 MetaLlama 3 (Meta) · pays Bill_1 (8-gram overlap), Bill_6 (open-weight reproduction)

2024-07 MetaLlama 3.1 405B — closest to Bill_7 ★ trigger · Bill_1 ✓ Bill_6 ✓ Bill_10 ✓; lacks Bill_4 + Bill_5 explicit payment

2024-09 OpenAIo1-preview · M5 paid; Bill_16 partial; Bill_12 opaque · reasoning-model wave begins

2024-11 Epoch AIFrontierMath release · frontier-LLM <2% · Tao quote "extremely challenging"

2024-12-05 OpenAIo1 system card · 13.4% FrontierMath

2024-12-20 OpenAIo3 announces 25.2% FrontierMath at "high compute" · 10x jump · no methodology disclosed

2024-12-21 Epoch AIDiscloses post-hoc that OpenAI funded benchmark + had Tier-1+2 access · mathematicians not informed · Bill_5 + Bill_10 violation

2024-12 OpenAIo3 ARC-AGI 75.7% high-compute

2025-01 Epoch AIHeld-out audit set announced · never shared with vendors · G3 escape-gate response

2025-03 Mar 2025Held-out re-evaluation: o3 25.2% → 14.5% · ~10pp contamination inflation · Claude 3.7 ~12% · Gemini 2.5 ~13%

2025-03 ARC PrizeARC-AGI-2 release · o3 drops to 5-10% on v2 private set · iterative-reframing pattern emerges

2025-03 ParetoCost-Capability Pareto · o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem · 100x cost gap at comparable accuracy

2025-04 LMSYSLlama 4 Maverick removal · chat-specific tuning for LMSYS Arena prompts · crispest Bill_5 violation in 2024-2026

2025-04 CAISHLE independent replication · 8.5% inflation

2025-05 Tier-4FrontierMath Tier-4 (research-frontier) remains <10% across all frontier models · G3 gate confirmed for upper tier

2025-08 OpenAIGPT-5 · claims 35-40% FrontierMath · no held-out audit submission disclosed at launch

2025-11 Tool-exfilFrontierMath tool-exfiltration audit · 35-50% of solved problems used Wolfram/sympy in scaffolding · Bill_3 violation

2026-01 RetroTwo-year retrospective · Tier-1-3 partially saturated (~40-45%) · Tier-4 still <15%

2026-05 Ledgerv0.2 LOCK · 469 papers, 38.6% rebuttal density · Bills 7/11/14/17 ★ empty space holding · classifier 60/60 at 1.000/1.000

The ledger tracks three reproducibility milestones as F18 cousin triggers: LiveCodeBench monthly refresh (anti-contamination by design); ARC Prize private set (held-out by design); Epoch AI held-out audit set (third-party held-out construction).

Government / safety policy posture: NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework all use capability scores as direct policy inputs. The public temporal-trajectory claim is now supported by verified FrontierMath / ARC / LiveCodeBench / HCAST / leaderboard-retirement anchors while the old 73-day half-life row is quarantined. The institutional update cycle is structurally too slow for the empirical falsification rate.

§02b

Primary source cards

Direct links to the primary literature, vendor cards, and held-out audit infrastructure underlying the bills. Each card lists the venue and a clickable handle.

★ Bill_7 internal row · quarantined

Six-audit framework row — source handle pending

The previously displayed arXiv handle resolved to an unrelated 3D-asset paper and was removed. This row remains in the internal audit ledger but is not treated as arXiv-verified public evidence.

handle quarantined · see source lint manifest

Bill_19 internal row · quarantined

Vendor-claim half-life forensic — source handle pending

The previously displayed arXiv handle resolved to an unrelated photonic quantum-learning paper and was removed. The temporal-trajectory claim is now supported publicly by the verified FrontierMath / ARC / HCAST / leaderboard-retirement cards below.

handle quarantined · see source lint manifest

★ Bill_17 anchor · ARC reframing

Chollet / ARC Prize — ARC-AGI v1 → v2 → v3 iterative reframing

o3 ARC-AGI v1 75.7% (Dec 2024) → ARC-AGI-2 v2 5–10% in 3 months. ARC-AGI-3 in development. Held-out by design + adversarial reframing cycle.

arcprize.org ↗ · arXiv:2412.04604 ↗

★ Bill_17 anchor · FrontierMath

Epoch AI — FrontierMath held-out audit

o3 vendor claim 25.2% (Dec 2024) → held-out re-eval 14.5% (Mar 2025). ~10pp contamination inflation. Tier-4 (research-frontier) remains <15% across all frontier models.

epoch.ai/frontiermath ↗ · arXiv:2411.04872 ↗

Bill_4 anchor · format brittleness

Mirzadeh et al. — GSM-Symbolic

Variable rename + NoOp distractor on GSM8K. o1-preview drops −17.5% absolute. Range −2% to −65% across models. Format-brittleness empirical floor.

arXiv:2410.05229 ↗

Bill_1 anchor · contamination

Carlini et al. — Stealing part of a production LM

ICML 2024 Best Paper. Embedding-projection layer extraction from closed-weight production model. Contamination-audit methodology family; paired here with Oren et al.'s black-box contamination-detection anchor.

arXiv:2403.06634 ↗ · arXiv:2310.17623 ↗

Bill_2 anchor · harness variance

Pezeshkpour-Hruschka, Sclar-Choi, Wang et al. — prompt-template fragility

Pezeshkpour: GPT-4 +6.5% A-bias / −4.5% D-bias under option-shuffle. Sclar-Choi: 76% relative range across plausibly-equivalent templates. Wang: 8.4% mean variance.

arXiv:2308.11483 ↗ · Sclar-Choi ↗

Bill_2 anchor · METR HCAST

METR — HCAST horizon doubling-time

7-month doubling-time on autonomous-task time-horizon at 50% reliability. Independent third-party eval; cousin to Bill_19 vendor-claim half-life. Temporal-trajectory anchor.

metr.org ↗ · arXiv:2503.14499 ↗

Bill_5 anchor · LMSYS scandal

LMSYS Chatbot Arena — Llama 4 Maverick removal

2025-04: Meta tuned chat-specific variant for LMSYS Arena prompt distribution. Removed from leaderboard. Crispest Bill_5 (capability-vs-safety decoupling) violation in 2024–2026 corpus.

lmarena.ai ↗

Bill_18 anchor · anti-saturation

LiveCodeBench — monthly contest refresh

Monthly Codeforces / LeetCode / AtCoder contest scrape. Anti-contamination by design. 34.8% absolute gap on LiveCodeBench vs fixed MMLU at frontier.

livecodebench.github.io ↗ · arXiv:2403.07974 ↗

Bill_3 anchor · tool exfiltration

FrontierMath tool-exfiltration audit (2025-11)

35–50% of o3-claimed-solved FrontierMath problems used Wolfram / sympy in scaffolding. Bill_3 (tool-use audit) violation. Capability-vs-tooling decoupling not paid.

epoch.ai/blog ↗

Bill_10 anchor · independent eval

CAIS — HLE replication audit

Center for AI Safety independent replication of Humanity's Last Exam frontier-LLM scores. 8.5% absolute inflation vs vendor-self-eval baseline. Knowledge-vs-reasoning split confirmed.

safe.ai ↗ · lastexam.ai ↗

Card colors: rose = ★ predicted-empty bill anchor or quarantined internal row; gold = primary closure-evidence anchor; cyan = supporting-construction anchor (anti-saturation, tool-exfil, independent-eval). Source lint: verified external handles stay clickable; rows without verified public handles are labeled as quarantined rather than URL-verified.

§03

Twelve negative findings

N1 · ★ Bill_7

Six-audit row quarantined

The prior Gerstgrasser-Bommasani / Anand-Tirumala handles failed source lint and were removed from public evidence. Bill_7 remains a live falsifier: produce a frontier benchmark claim paying contamination, harness, tool, format, selection, and reproducibility audits.

N2 · ★ Bill_11

16-month saturation cycle

Yang-Tatsunori 2024: mean 16-month saturation. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization.

N3 · ★ Bill_14

MMLU↔MMLU-Pro r=0.78, MATH↔FrontierMath r=0.18

Cross-benchmark transfer correlations consistently below r≥0.95 clean-transfer threshold. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.

N4 · ★ Bill_17

o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%

Iterative-reframing pattern. Each ~6-9 months: frontier breakthrough on public eval → held-out audit → benchmark reframing. ARC v1→v2→v3 cycle.

N5 · Bill_5

OpenAI-Epoch FrontierMath funding entanglement

Vendor funded benchmark + had Tier-1+2 access. Held-out re-eval drops o3 25.2% → 14.5%. Smoking gun for ~10pp contamination inflation.

N6 · Bill_5

LMSYS Llama 4 Maverick removal

Meta tuned chat-specific variant for LMSYS Arena prompts. Crispest Bill_5 violation in 2024-2026.

N7 · Bill_4

30% answers flip under shuffling

Pezeshkpour-Hruschka GPT-4 +6.5% A-bias / -4.5% D-bias. Wang option-shuffle 8.4% mean variance. Sclar-Choi 76% relative range across plausibly-equivalent prompt templates.

N8 · Bill_4

GSM-Symbolic -2 to -65% absolute drop

Mirzadeh GSM-Symbolic 2024: variable rename / NoOp distractor on GSM8K. o1-preview drops 17.5%. Format-brittleness exposed.

N9 · Bill_2

Harness moves the score

SWE-Bench 2-50% range from harness alone. ARC-AGI 9-50%. Game of 24 4% to 74%. Self-Consistency 40x compute = +18pp GSM8K. Best-of-N 1000x compute.

N10 · Bill_12

100x cost spread for 5pp accuracy

Cost-Capability Pareto (Artificial Analysis 2025-02). o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem at comparable accuracy.

N11 · Bill_1

60 distinct contamination methodologies

Carlini-Tirumala, Tang-Cao-Bommasani Yale 47% MMLU contamination, Carlini "Stealing part of a production language model" ICML 2024 best paper. 9 methodology families. No single audit can be paid by-construction without combining ≥2 methodologies.

N12 · Bill_19 NEW

Temporal-trajectory audit

The prior 73-day half-life row is quarantined pending source verification. Verified public anchors remain: FrontierMath held-out re-eval, ARC-AGI reframing, LiveCodeBench refresh, METR HCAST trend, and LMSYS / LMArena leaderboard retirement.

§04

Falsification protocol

Each negative finding becomes a checkable trigger condition. The ledger commits to public update within 7 days of any verified trigger of F7, F11, F14, or F17.

F7 · ★ Survives all six audits

Trigger: a frontier-LLM capability claim that simultaneously survives F1–F6 (contamination + harness + tool + format + selection-bias + reproducibility) with independent third-party verification within 6 months

F11 · ★ Saturation pattern

Trigger: a saturated-benchmark (>95%) claim that survives selection-bias + format-brittleness audit AND demonstrates capability headroom remains

F14 · ★ Cross-benchmark transfer

Trigger: cross-benchmark transfer correlation r≥0.95 on the same underlying capability across two cousin benchmarks at frontier scale

F17 · ★ Held-out frontier audit

Trigger: a vendor-announced score on FrontierMath / ARC-AGI / HLE / GPQA-Diamond passes third-party held-out audit cleanly with no Bill_5/Bill_10 violations

F18 · Anti-saturation NEW

Trigger: a capability claim that engages with the benchmark's anti-saturation mechanism (held-out by design / monthly refresh / iterative reframing / post-cutoff authoring)

F19 · Vendor half-life NEW

Trigger: a capability claim that reports the temporal-trajectory pattern of similar claims AND survives 6 months without invalidation

Live alerts (triggered watch-list): METR HCAST · ARC Prize v3 cycle · Epoch AI FrontierMath tier-4 audits · CAIS HLE follow-on replications · Anthropic / OpenAI / DeepMind / Meta capability cards · UK AISI evaluation reports · Stanford CRFM HELM updates · LiveCodeBench monthly refresh · NIST AI RMF revisions.

§05

Method at a glance

Threat modelDemonstrate a frontier-LLM capability-benchmark claim that survives training-data contamination + harness engineering + tool-exfiltration + problem-format brittleness + selection-bias + reproducibility audits on a major 2024–2026 benchmark in 2026.

Deep loops8 sweeps × 5–10 parallel research agents per sweep × 1 batch round.

Sources surveyedarXiv cs.LG / cs.CL / cs.AI 2024–2026 (FrontierMath, ARC-AGI, HLE, GPQA, MATH, MMLU, MMLU-Pro, SWE-Bench, BigCodeBench, LiveCodeBench, AgentBench, GSM8K) + ICLR / ICML / NeurIPS / ACL capability tracks + frontier-lab capability cards (Anthropic, OpenAI, DeepMind, Meta, Mistral, Qwen, DeepSeek, Yi) + METR / ARC Evals / Apollo / AISI / Stanford CRFM HELM third-party eval + Carlini-Tirumala memorization line + Tang-Cao-Bommasani Yale 47% MMLU + Pezeshkpour-Wang format-brittleness + Mirzadeh GSM-Symbolic + Inverse Scaling Prize follow-ons + Schaeffer-Saphra emergence-as-mirage + Epoch AI / ARC Prize / CAIS held-out construction + AI policy documents (NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness, DeepMind FSF).

ClassifierRegex rule engine. v0.2 with 60 hand-curated benchmark cases at gate-accuracy 1.000 / bill-recall 1.000.

Empty-space testFour signature bills (7, 11, 14, 17) predeclared as empty BEFORE batch 1 sweeps. After 469 papers across 8 sweeps, all four remain empty. Bills 18-19 promoted to v0.2 from batch 1 evidence (anti-saturation + vendor-claim half-life).

Rebuttal density38.6% (181/469 G2 papers) — second-highest of any ledger built to date (Mech Interp leads at 50%). The capability-benchmarks corpus is mid-falsification with a healthy rebuttal pipeline. The contamination-audit cluster alone has 60 distinct methodologies clustering into 9 families.

Cross-ledger couplingDomain-invariant self-validation tautology pattern: this ledger Bill_10 (vendor-self-eval) ↔ QA Bill_4 (XEB self-validation) ↔ Mech Interp Bill_5 (activation-patching circularity) ↔ Lattice cost-model fudges. Same structural failure mode (metric defining success AND validating success) across cryptography, quantum advantage, ML interpretability, and capability evaluation.

ReproducibilityAll scripts, JSONs, and wiki are public. Run order: bill_classifier.py --benchmark → ledger populator → atlas review pipeline.

§06