← Ledger
/
Capability Benchmarks Ledger · v0.2 · 2026-05-08
469 papers.
19 bills.
Empty space holding.
469-paper ledger for frontier-LLM benchmarks. Four signature-empty bills hold; source-linted public handles stay clickable, and unverifiable audit rows are quarantined.
60/60
Classifier 1.000/1.000
Quick Orientation
When OpenAI or Google claims a new AI 'aced the exam,' which scores actually mean something?
Open brief
Every frontier AI release comes with benchmark scores — FrontierMath, ARC-AGI, HLE, GPQA, MMLU. Many of those headlines collapse within months when independent groups re-test. Sample: o3 scored 87.5% on ARC-AGI in December 2024, then 5-10% on the v2 test three months later. We surveyed 469 papers from 2024-2026 to map every claim and audit. No frontier-AI capability claim has yet survived all six basic checks (training-data contamination, harness engineering, tool sneaking, format brittleness, selection bias, third-party reproduction) on any major benchmark. Independent citation verification is still in progress.
Why it matters: Capability scores drive product launches, government policy, and safety roadmaps. If the scores don't hold, the policies built on them don't either.What we found: 469 papers checked. Four predicted-empty lines hold — no frontier model passes all six audits on any single benchmark. Median vendor-claim half-life is short.
Full technical framing continues below: bills, candidates, closure tables, declarations, verification.
Ledger declaration · 2026-05-08
Four signature constructions.
Four hundred sixty-nine papers.
Empty space holding.
Bills are the closure mechanisms a benchmark claim must clear. Every paper maps to one or more bills, a meta-cost, or an escape gate.
How to read this heatmap
Cells show candidate papers. A starred bill is ★ empty only when candidate count is nonzero but clean triggers remain zero after meta-cost, rebuttal, leakage, non-transfer, or escape-gate review. The closure basis appears below.
★ Predicted empty (HOLDING)
NEW v0.2 (Bills 18-19)
Dominant (≥50 papers)
High activity (≥30 papers)
Active (10–29 papers)
★ Empty-space verification
BillClosure basisCands.Clean
★ 7Frontier-LLM survives all six audits
Closure basis: internal 0/73 and 32/47 audit rows are quarantined pending public-source verification; verified public anchors below cover contamination, format brittleness, held-out refresh, and independent evaluationcandidates1clean triggers0
★ 11Saturation pattern audit
Closure basis: Yang-Tatsunori 2024 16-month mean saturation cycle; LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorizationcandidates24clean triggers0
★ 14Cross-benchmark transfer
Closure basis: Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 (textbook violation); MATH↔FrontierMath r=0.18 (catastrophic)candidates21clean triggers0
★ 17ARC-AGI / FrontierMath / HLE / GPQA-Diamond held-out frontier audit
Closure basis: ARC-AGI v1→v2 reframing: o3 75.7% → 5-10% on v2 in 3 months; FrontierMath o3 25.2% → 14.5% on Epoch held-out audit; HLE CAIS 8.5% inflationcandidates27clean triggers0
Bill_7 (frontier-LLM survives all six audits): 1 candidate. The previously displayed Gerstgrasser-Bommasani and Anand-Tirumala arXiv handles failed source lint and are quarantined pending public-source verification. The public claim is narrowed: closest candidate Llama 3.1 405B (2024-07) lacks Bill_4 + Bill_5 explicit payment, while verified public anchors below establish the component audits (contamination, format brittleness, held-out refresh, independent evaluation).
Bill_11 (saturation pattern audit): 24 candidates. Yang-Tatsunori 2024: mean 16-month saturation cycle. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization. MMLU + HumanEval + GSM8K + MBPP all saturated by 2024; vendor cards continue to publish them as primary capability evidence (M2 fires).
Bill_14 (cross-benchmark transfer): 21 candidates. Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 — textbook violation; below r≥0.95 threshold. MATH↔FrontierMath r=0.18 — catastrophic transfer failure. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.
Bill_17 (ARC-AGI/FrontierMath/HLE/GPQA-Diamond held-out frontier audit): 27 candidates. Iterative-reframing pattern: every ~6-9 months, frontier breakthrough on public eval → held-out audit → benchmark reframing. OpenAI o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%. FrontierMath o3 25.2% → 14.5% on held-out re-eval. HLE CAIS replication: 8.5% inflation. METR/AISI/Apollo mean vendor inflation 6.8-9.1%.
Each major frontier-lab capability release tightens one or two closure mechanisms but does not close the full set. The prior 73-day vendor-claim half-life row is now treated as an internal audit row pending public-source verification; verified public anchors below still show the temporal-trajectory pattern through FrontierMath, ARC-AGI, LiveCodeBench, HCAST, and leaderboard-retirement events.
2024-04 MetaLlama 3 (Meta) · pays Bill_1 (8-gram overlap), Bill_6 (open-weight reproduction)
2024-07 MetaLlama 3.1 405B — closest to Bill_7 ★ trigger · Bill_1 ✓ Bill_6 ✓ Bill_10 ✓; lacks Bill_4 + Bill_5 explicit payment
2024-09 OpenAIo1-preview · M5 paid; Bill_16 partial; Bill_12 opaque · reasoning-model wave begins
2024-11 Epoch AIFrontierMath release · frontier-LLM <2% · Tao quote "extremely challenging"
2024-12-05 OpenAIo1 system card · 13.4% FrontierMath
2024-12-20 OpenAIo3 announces 25.2% FrontierMath at "high compute" · 10x jump · no methodology disclosed
2024-12-21 Epoch AIDiscloses post-hoc that OpenAI funded benchmark + had Tier-1+2 access · mathematicians not informed · Bill_5 + Bill_10 violation
2024-12 OpenAIo3 ARC-AGI 75.7% high-compute
2025-01 Epoch AIHeld-out audit set announced · never shared with vendors · G3 escape-gate response
2025-03 Mar 2025Held-out re-evaluation: o3 25.2% → 14.5% · ~10pp contamination inflation · Claude 3.7 ~12% · Gemini 2.5 ~13%
2025-03 ARC PrizeARC-AGI-2 release · o3 drops to 5-10% on v2 private set · iterative-reframing pattern emerges
2025-03 ParetoCost-Capability Pareto · o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem · 100x cost gap at comparable accuracy
2025-04 LMSYSLlama 4 Maverick removal · chat-specific tuning for LMSYS Arena prompts · crispest Bill_5 violation in 2024-2026
2025-04 CAISHLE independent replication · 8.5% inflation
2025-05 Tier-4FrontierMath Tier-4 (research-frontier) remains <10% across all frontier models · G3 gate confirmed for upper tier
2025-08 OpenAIGPT-5 · claims 35-40% FrontierMath · no held-out audit submission disclosed at launch
2025-11 Tool-exfilFrontierMath tool-exfiltration audit · 35-50% of solved problems used Wolfram/sympy in scaffolding · Bill_3 violation
2026-01 RetroTwo-year retrospective · Tier-1-3 partially saturated (~40-45%) · Tier-4 still <15%
2026-05 Ledgerv0.2 LOCK · 469 papers, 38.6% rebuttal density · Bills 7/11/14/17 ★ empty space holding · classifier 60/60 at 1.000/1.000
The ledger tracks three reproducibility milestones as F18 cousin triggers: LiveCodeBench monthly refresh (anti-contamination by design); ARC Prize private set (held-out by design); Epoch AI held-out audit set (third-party held-out construction).
Government / safety policy posture: NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework all use capability scores as direct policy inputs. The public temporal-trajectory claim is now supported by verified FrontierMath / ARC / LiveCodeBench / HCAST / leaderboard-retirement anchors while the old 73-day half-life row is quarantined. The institutional update cycle is structurally too slow for the empirical falsification rate.
Direct links to the primary literature, vendor cards, and held-out audit infrastructure underlying the bills. Each card lists the venue and a clickable handle.
★ Bill_7 internal row · quarantined
Six-audit framework row — source handle pending
The previously displayed arXiv handle resolved to an unrelated 3D-asset paper and was removed. This row remains in the internal audit ledger but is not treated as arXiv-verified public evidence.
handle quarantined · see source lint manifest
Bill_19 internal row · quarantined
Vendor-claim half-life forensic — source handle pending
The previously displayed arXiv handle resolved to an unrelated photonic quantum-learning paper and was removed. The temporal-trajectory claim is now supported publicly by the verified FrontierMath / ARC / HCAST / leaderboard-retirement cards below.
handle quarantined · see source lint manifest
★ Bill_17 anchor · ARC reframing
Chollet / ARC Prize — ARC-AGI v1 → v2 → v3 iterative reframing
o3 ARC-AGI v1 75.7% (Dec 2024) → ARC-AGI-2 v2 5–10% in 3 months. ARC-AGI-3 in development. Held-out by design + adversarial reframing cycle.
★ Bill_17 anchor · FrontierMath
Epoch AI — FrontierMath held-out audit
o3 vendor claim 25.2% (Dec 2024) → held-out re-eval 14.5% (Mar 2025). ~10pp contamination inflation. Tier-4 (research-frontier) remains <15% across all frontier models.
Bill_4 anchor · format brittleness
Mirzadeh et al. — GSM-Symbolic
Variable rename + NoOp distractor on GSM8K. o1-preview drops −17.5% absolute. Range −2% to −65% across models. Format-brittleness empirical floor.
Bill_1 anchor · contamination
Carlini et al. — Stealing part of a production LM
ICML 2024 Best Paper. Embedding-projection layer extraction from closed-weight production model. Contamination-audit methodology family; paired here with Oren et al.'s black-box contamination-detection anchor.
Bill_2 anchor · harness variance
Pezeshkpour-Hruschka, Sclar-Choi, Wang et al. — prompt-template fragility
Pezeshkpour: GPT-4 +6.5% A-bias / −4.5% D-bias under option-shuffle. Sclar-Choi: 76% relative range across plausibly-equivalent templates. Wang: 8.4% mean variance.
Bill_2 anchor · METR HCAST
METR — HCAST horizon doubling-time
7-month doubling-time on autonomous-task time-horizon at 50% reliability. Independent third-party eval; cousin to Bill_19 vendor-claim half-life. Temporal-trajectory anchor.
Bill_5 anchor · LMSYS scandal
LMSYS Chatbot Arena — Llama 4 Maverick removal
2025-04: Meta tuned chat-specific variant for LMSYS Arena prompt distribution. Removed from leaderboard. Crispest Bill_5 (capability-vs-safety decoupling) violation in 2024–2026 corpus.
Bill_18 anchor · anti-saturation
LiveCodeBench — monthly contest refresh
Monthly Codeforces / LeetCode / AtCoder contest scrape. Anti-contamination by design. 34.8% absolute gap on LiveCodeBench vs fixed MMLU at frontier.
Bill_3 anchor · tool exfiltration
FrontierMath tool-exfiltration audit (2025-11)
35–50% of o3-claimed-solved FrontierMath problems used Wolfram / sympy in scaffolding. Bill_3 (tool-use audit) violation. Capability-vs-tooling decoupling not paid.
Bill_10 anchor · independent eval
CAIS — HLE replication audit
Center for AI Safety independent replication of Humanity's Last Exam frontier-LLM scores. 8.5% absolute inflation vs vendor-self-eval baseline. Knowledge-vs-reasoning split confirmed.
Card colors: rose = ★ predicted-empty bill anchor or quarantined internal row; gold = primary closure-evidence anchor; cyan = supporting-construction anchor (anti-saturation, tool-exfil, independent-eval). Source lint: verified external handles stay clickable; rows without verified public handles are labeled as quarantined rather than URL-verified.
N1 · ★ Bill_7
Six-audit row quarantined
The prior Gerstgrasser-Bommasani / Anand-Tirumala handles failed source lint and were removed from public evidence. Bill_7 remains a live falsifier: produce a frontier benchmark claim paying contamination, harness, tool, format, selection, and reproducibility audits.
N2 · ★ Bill_11
16-month saturation cycle
Yang-Tatsunori 2024: mean 16-month saturation. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization.
N3 · ★ Bill_14
MMLU↔MMLU-Pro r=0.78, MATH↔FrontierMath r=0.18
Cross-benchmark transfer correlations consistently below r≥0.95 clean-transfer threshold. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.
N4 · ★ Bill_17
o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%
Iterative-reframing pattern. Each ~6-9 months: frontier breakthrough on public eval → held-out audit → benchmark reframing. ARC v1→v2→v3 cycle.
N5 · Bill_5
OpenAI-Epoch FrontierMath funding entanglement
Vendor funded benchmark + had Tier-1+2 access. Held-out re-eval drops o3 25.2% → 14.5%. Smoking gun for ~10pp contamination inflation.
N6 · Bill_5
LMSYS Llama 4 Maverick removal
Meta tuned chat-specific variant for LMSYS Arena prompts. Crispest Bill_5 violation in 2024-2026.
N7 · Bill_4
30% answers flip under shuffling
Pezeshkpour-Hruschka GPT-4 +6.5% A-bias / -4.5% D-bias. Wang option-shuffle 8.4% mean variance. Sclar-Choi 76% relative range across plausibly-equivalent prompt templates.
N8 · Bill_4
GSM-Symbolic -2 to -65% absolute drop
Mirzadeh GSM-Symbolic 2024: variable rename / NoOp distractor on GSM8K. o1-preview drops 17.5%. Format-brittleness exposed.
N9 · Bill_2
Harness moves the score
SWE-Bench 2-50% range from harness alone. ARC-AGI 9-50%. Game of 24 4% to 74%. Self-Consistency 40x compute = +18pp GSM8K. Best-of-N 1000x compute.
N10 · Bill_12
100x cost spread for 5pp accuracy
Cost-Capability Pareto (Artificial Analysis 2025-02). o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem at comparable accuracy.
N11 · Bill_1
60 distinct contamination methodologies
Carlini-Tirumala, Tang-Cao-Bommasani Yale 47% MMLU contamination, Carlini "Stealing part of a production language model" ICML 2024 best paper. 9 methodology families. No single audit can be paid by-construction without combining ≥2 methodologies.
N12 · Bill_19 NEW
Temporal-trajectory audit
The prior 73-day half-life row is quarantined pending source verification. Verified public anchors remain: FrontierMath held-out re-eval, ARC-AGI reframing, LiveCodeBench refresh, METR HCAST trend, and LMSYS / LMArena leaderboard retirement.
Each negative finding becomes a checkable trigger condition. The ledger commits to public update within 7 days of any verified trigger of F7, F11, F14, or F17.
F7 · ★ Survives all six audits
Trigger: a frontier-LLM capability claim that simultaneously survives F1–F6 (contamination + harness + tool + format + selection-bias + reproducibility) with independent third-party verification within 6 months
F11 · ★ Saturation pattern
Trigger: a saturated-benchmark (>95%) claim that survives selection-bias + format-brittleness audit AND demonstrates capability headroom remains
F14 · ★ Cross-benchmark transfer
Trigger: cross-benchmark transfer correlation r≥0.95 on the same underlying capability across two cousin benchmarks at frontier scale
F17 · ★ Held-out frontier audit
Trigger: a vendor-announced score on FrontierMath / ARC-AGI / HLE / GPQA-Diamond passes third-party held-out audit cleanly with no Bill_5/Bill_10 violations
F18 · Anti-saturation NEW
Trigger: a capability claim that engages with the benchmark's anti-saturation mechanism (held-out by design / monthly refresh / iterative reframing / post-cutoff authoring)
F19 · Vendor half-life NEW
Trigger: a capability claim that reports the temporal-trajectory pattern of similar claims AND survives 6 months without invalidation
Live alerts (triggered watch-list): METR HCAST · ARC Prize v3 cycle · Epoch AI FrontierMath tier-4 audits · CAIS HLE follow-on replications · Anthropic / OpenAI / DeepMind / Meta capability cards · UK AISI evaluation reports · Stanford CRFM HELM updates · LiveCodeBench monthly refresh · NIST AI RMF revisions.
Threat modelDemonstrate a frontier-LLM capability-benchmark claim that survives training-data contamination + harness engineering + tool-exfiltration + problem-format brittleness + selection-bias + reproducibility audits on a major 2024–2026 benchmark in 2026.
Deep loops8 sweeps × 5–10 parallel research agents per sweep × 1 batch round.
Sources surveyedarXiv cs.LG / cs.CL / cs.AI 2024–2026 (FrontierMath, ARC-AGI, HLE, GPQA, MATH, MMLU, MMLU-Pro, SWE-Bench, BigCodeBench, LiveCodeBench, AgentBench, GSM8K) + ICLR / ICML / NeurIPS / ACL capability tracks + frontier-lab capability cards (Anthropic, OpenAI, DeepMind, Meta, Mistral, Qwen, DeepSeek, Yi) + METR / ARC Evals / Apollo / AISI / Stanford CRFM HELM third-party eval + Carlini-Tirumala memorization line + Tang-Cao-Bommasani Yale 47% MMLU + Pezeshkpour-Wang format-brittleness + Mirzadeh GSM-Symbolic + Inverse Scaling Prize follow-ons + Schaeffer-Saphra emergence-as-mirage + Epoch AI / ARC Prize / CAIS held-out construction + AI policy documents (NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness, DeepMind FSF).
ClassifierRegex rule engine. v0.2 with 60 hand-curated benchmark cases at gate-accuracy 1.000 / bill-recall 1.000.
Empty-space testFour signature bills (7, 11, 14, 17) predeclared as empty BEFORE batch 1 sweeps. After 469 papers across 8 sweeps, all four remain empty. Bills 18-19 promoted to v0.2 from batch 1 evidence (anti-saturation + vendor-claim half-life).
Rebuttal density38.6% (181/469 G2 papers) — second-highest of any ledger built to date (Mech Interp leads at 50%). The capability-benchmarks corpus is mid-falsification with a healthy rebuttal pipeline. The contamination-audit cluster alone has 60 distinct methodologies clustering into 9 families.
Cross-ledger couplingDomain-invariant self-validation tautology pattern: this ledger Bill_10 (vendor-self-eval) ↔ QA Bill_4 (XEB self-validation) ↔ Mech Interp Bill_5 (activation-patching circularity) ↔ Lattice cost-model fudges. Same structural failure mode (metric defining success AND validating success) across cryptography, quantum advantage, ML interpretability, and capability evaluation.
ReproducibilityAll scripts, JSONs, and wiki are public. Run order: bill_classifier.py --benchmark → ledger populator → atlas review pipeline.
Every empirical claim on this page resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.
Public draft v0.2 (May 2026) — 469 papers; Bills 7/11/14/17 ★ empty. Run: python3 bill_classifier.py --benchmark.
Locked state · 2026-05-08
Four signature constructions.
Four hundred sixty-nine papers.
Empty space holding.