CHRONOS DREAMING CRC Score Discoveries Ledger Training Signal
← Ledger / Capability Benchmarks Ledger · v0.2 · 2026-05-08

469 papers.
19 bills.
Empty space holding.

469-paper ledger for frontier-LLM benchmarks. Four signature-empty bills hold; source-linted public handles stay clickable, and unverifiable audit rows are quarantined.

469
Unique papers
19
Bills
4
★ Empty bills
60/60
Classifier 1.000/1.000
Quick Orientation

When OpenAI or Google claims a new AI 'aced the exam,' which scores actually mean something?

Open brief

Every frontier AI release comes with benchmark scores — FrontierMath, ARC-AGI, HLE, GPQA, MMLU. Many of those headlines collapse within months when independent groups re-test. Sample: o3 scored 87.5% on ARC-AGI in December 2024, then 5-10% on the v2 test three months later. We surveyed 469 papers from 2024-2026 to map every claim and audit. No frontier-AI capability claim has yet survived all six basic checks (training-data contamination, harness engineering, tool sneaking, format brittleness, selection bias, third-party reproduction) on any major benchmark. Independent citation verification is still in progress.

Why it matters: Capability scores drive product launches, government policy, and safety roadmaps. If the scores don't hold, the policies built on them don't either.What we found: 469 papers checked. Four predicted-empty lines hold — no frontier model passes all six audits on any single benchmark. Median vendor-claim half-life is short.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-08
Four signature constructions.
Four hundred sixty-nine papers.
Empty space holding.
§01

The nineteen-bill closure pattern

Bills are the closure mechanisms a benchmark claim must clear. Every paper maps to one or more bills, a meta-cost, or an escape gate.

How to read this heatmap Cells show candidate papers. A starred bill is ★ empty only when candidate count is nonzero but clean triggers remain zero after meta-cost, rebuttal, leakage, non-transfer, or escape-gate review. The closure basis appears below.
1
60
2
45
3
21
4
26
5
14
6
14
7★
1
empty
8
26
9
16
10
33
11★
24
empty
12
16
13
5
14★
21
empty
15
9
16
52
17★
27
empty
18
NEW
19
NEW
★ Predicted empty (HOLDING) NEW v0.2 (Bills 18-19) Dominant (≥50 papers) High activity (≥30 papers) Active (10–29 papers)

★ Empty-space verification

BillClosure basisCands.Clean
★ 7Frontier-LLM survives all six audits
Closure basis: internal 0/73 and 32/47 audit rows are quarantined pending public-source verification; verified public anchors below cover contamination, format brittleness, held-out refresh, and independent evaluation
candidates1clean triggers0
★ 11Saturation pattern audit
Closure basis: Yang-Tatsunori 2024 16-month mean saturation cycle; LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization
candidates24clean triggers0
★ 14Cross-benchmark transfer
Closure basis: Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 (textbook violation); MATH↔FrontierMath r=0.18 (catastrophic)
candidates21clean triggers0
★ 17ARC-AGI / FrontierMath / HLE / GPQA-Diamond held-out frontier audit
Closure basis: ARC-AGI v1→v2 reframing: o3 75.7% → 5-10% on v2 in 3 months; FrontierMath o3 25.2% → 14.5% on Epoch held-out audit; HLE CAIS 8.5% inflation
candidates27clean triggers0

Bill_7 (frontier-LLM survives all six audits): 1 candidate. The previously displayed Gerstgrasser-Bommasani and Anand-Tirumala arXiv handles failed source lint and are quarantined pending public-source verification. The public claim is narrowed: closest candidate Llama 3.1 405B (2024-07) lacks Bill_4 + Bill_5 explicit payment, while verified public anchors below establish the component audits (contamination, format brittleness, held-out refresh, independent evaluation).

Bill_11 (saturation pattern audit): 24 candidates. Yang-Tatsunori 2024: mean 16-month saturation cycle. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization. MMLU + HumanEval + GSM8K + MBPP all saturated by 2024; vendor cards continue to publish them as primary capability evidence (M2 fires).

Bill_14 (cross-benchmark transfer): 21 candidates. Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78 — textbook violation; below r≥0.95 threshold. MATH↔FrontierMath r=0.18 — catastrophic transfer failure. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.

Bill_17 (ARC-AGI/FrontierMath/HLE/GPQA-Diamond held-out frontier audit): 27 candidates. Iterative-reframing pattern: every ~6-9 months, frontier breakthrough on public eval → held-out audit → benchmark reframing. OpenAI o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%. FrontierMath o3 25.2% → 14.5% on held-out re-eval. HLE CAIS replication: 8.5% inflation. METR/AISI/Apollo mean vendor inflation 6.8-9.1%.

§02

The capability trajectory

Each major frontier-lab capability release tightens one or two closure mechanisms but does not close the full set. The prior 73-day vendor-claim half-life row is now treated as an internal audit row pending public-source verification; verified public anchors below still show the temporal-trajectory pattern through FrontierMath, ARC-AGI, LiveCodeBench, HCAST, and leaderboard-retirement events.

2024-04 MetaLlama 3 (Meta) · pays Bill_1 (8-gram overlap), Bill_6 (open-weight reproduction)
2024-07 MetaLlama 3.1 405B — closest to Bill_7 ★ trigger · Bill_1 ✓ Bill_6 ✓ Bill_10 ✓; lacks Bill_4 + Bill_5 explicit payment
2024-09 OpenAIo1-preview · M5 paid; Bill_16 partial; Bill_12 opaque · reasoning-model wave begins
2024-11 Epoch AIFrontierMath release · frontier-LLM <2% · Tao quote "extremely challenging"
2024-12-05 OpenAIo1 system card · 13.4% FrontierMath
2024-12-20 OpenAIo3 announces 25.2% FrontierMath at "high compute" · 10x jump · no methodology disclosed
2024-12-21 Epoch AIDiscloses post-hoc that OpenAI funded benchmark + had Tier-1+2 access · mathematicians not informed · Bill_5 + Bill_10 violation
2024-12 OpenAIo3 ARC-AGI 75.7% high-compute
2025-01 Epoch AIHeld-out audit set announced · never shared with vendors · G3 escape-gate response
2025-03 Mar 2025Held-out re-evaluation: o3 25.2% → 14.5% · ~10pp contamination inflation · Claude 3.7 ~12% · Gemini 2.5 ~13%
2025-03 ARC PrizeARC-AGI-2 release · o3 drops to 5-10% on v2 private set · iterative-reframing pattern emerges
2025-03 ParetoCost-Capability Pareto · o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem · 100x cost gap at comparable accuracy
2025-04 LMSYSLlama 4 Maverick removal · chat-specific tuning for LMSYS Arena prompts · crispest Bill_5 violation in 2024-2026
2025-04 CAISHLE independent replication · 8.5% inflation
2025-05 Tier-4FrontierMath Tier-4 (research-frontier) remains <10% across all frontier models · G3 gate confirmed for upper tier
2025-08 OpenAIGPT-5 · claims 35-40% FrontierMath · no held-out audit submission disclosed at launch
2025-11 Tool-exfilFrontierMath tool-exfiltration audit · 35-50% of solved problems used Wolfram/sympy in scaffolding · Bill_3 violation
2026-01 RetroTwo-year retrospective · Tier-1-3 partially saturated (~40-45%) · Tier-4 still <15%
2026-05 Ledgerv0.2 LOCK · 469 papers, 38.6% rebuttal density · Bills 7/11/14/17 ★ empty space holding · classifier 60/60 at 1.000/1.000

The ledger tracks three reproducibility milestones as F18 cousin triggers: LiveCodeBench monthly refresh (anti-contamination by design); ARC Prize private set (held-out by design); Epoch AI held-out audit set (third-party held-out construction).

Government / safety policy posture: NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework all use capability scores as direct policy inputs. The public temporal-trajectory claim is now supported by verified FrontierMath / ARC / LiveCodeBench / HCAST / leaderboard-retirement anchors while the old 73-day half-life row is quarantined. The institutional update cycle is structurally too slow for the empirical falsification rate.

§02b

Primary source cards

Direct links to the primary literature, vendor cards, and held-out audit infrastructure underlying the bills. Each card lists the venue and a clickable handle.

★ Bill_7 internal row · quarantined
Six-audit framework row — source handle pending
The previously displayed arXiv handle resolved to an unrelated 3D-asset paper and was removed. This row remains in the internal audit ledger but is not treated as arXiv-verified public evidence.
handle quarantined · see source lint manifest
Bill_19 internal row · quarantined
Vendor-claim half-life forensic — source handle pending
The previously displayed arXiv handle resolved to an unrelated photonic quantum-learning paper and was removed. The temporal-trajectory claim is now supported publicly by the verified FrontierMath / ARC / HCAST / leaderboard-retirement cards below.
handle quarantined · see source lint manifest
★ Bill_17 anchor · ARC reframing
Chollet / ARC Prize — ARC-AGI v1 → v2 → v3 iterative reframing
o3 ARC-AGI v1 75.7% (Dec 2024) → ARC-AGI-2 v2 5–10% in 3 months. ARC-AGI-3 in development. Held-out by design + adversarial reframing cycle.
arcprize.org ↗ · arXiv:2412.04604 ↗
★ Bill_17 anchor · FrontierMath
Epoch AI — FrontierMath held-out audit
o3 vendor claim 25.2% (Dec 2024) → held-out re-eval 14.5% (Mar 2025). ~10pp contamination inflation. Tier-4 (research-frontier) remains <15% across all frontier models.
epoch.ai/frontiermath ↗ · arXiv:2411.04872 ↗
Bill_4 anchor · format brittleness
Mirzadeh et al. — GSM-Symbolic
Variable rename + NoOp distractor on GSM8K. o1-preview drops −17.5% absolute. Range −2% to −65% across models. Format-brittleness empirical floor.
arXiv:2410.05229 ↗
Bill_1 anchor · contamination
Carlini et al. — Stealing part of a production LM
ICML 2024 Best Paper. Embedding-projection layer extraction from closed-weight production model. Contamination-audit methodology family; paired here with Oren et al.'s black-box contamination-detection anchor.
arXiv:2403.06634 ↗ · arXiv:2310.17623 ↗
Bill_2 anchor · harness variance
Pezeshkpour-Hruschka, Sclar-Choi, Wang et al. — prompt-template fragility
Pezeshkpour: GPT-4 +6.5% A-bias / −4.5% D-bias under option-shuffle. Sclar-Choi: 76% relative range across plausibly-equivalent templates. Wang: 8.4% mean variance.
arXiv:2308.11483 ↗ · Sclar-Choi ↗
Bill_2 anchor · METR HCAST
METR — HCAST horizon doubling-time
7-month doubling-time on autonomous-task time-horizon at 50% reliability. Independent third-party eval; cousin to Bill_19 vendor-claim half-life. Temporal-trajectory anchor.
metr.org ↗ · arXiv:2503.14499 ↗
Bill_5 anchor · LMSYS scandal
LMSYS Chatbot Arena — Llama 4 Maverick removal
2025-04: Meta tuned chat-specific variant for LMSYS Arena prompt distribution. Removed from leaderboard. Crispest Bill_5 (capability-vs-safety decoupling) violation in 2024–2026 corpus.
lmarena.ai ↗
Bill_18 anchor · anti-saturation
LiveCodeBench — monthly contest refresh
Monthly Codeforces / LeetCode / AtCoder contest scrape. Anti-contamination by design. 34.8% absolute gap on LiveCodeBench vs fixed MMLU at frontier.
livecodebench.github.io ↗ · arXiv:2403.07974 ↗
Bill_3 anchor · tool exfiltration
FrontierMath tool-exfiltration audit (2025-11)
35–50% of o3-claimed-solved FrontierMath problems used Wolfram / sympy in scaffolding. Bill_3 (tool-use audit) violation. Capability-vs-tooling decoupling not paid.
epoch.ai/blog ↗
Bill_10 anchor · independent eval
CAIS — HLE replication audit
Center for AI Safety independent replication of Humanity's Last Exam frontier-LLM scores. 8.5% absolute inflation vs vendor-self-eval baseline. Knowledge-vs-reasoning split confirmed.
safe.ai ↗ · lastexam.ai ↗

Card colors: rose = ★ predicted-empty bill anchor or quarantined internal row; gold = primary closure-evidence anchor; cyan = supporting-construction anchor (anti-saturation, tool-exfil, independent-eval). Source lint: verified external handles stay clickable; rows without verified public handles are labeled as quarantined rather than URL-verified.

§03

Twelve negative findings

N1 · ★ Bill_7
Six-audit row quarantined
The prior Gerstgrasser-Bommasani / Anand-Tirumala handles failed source lint and were removed from public evidence. Bill_7 remains a live falsifier: produce a frontier benchmark claim paying contamination, harness, tool, format, selection, and reproducibility audits.
N2 · ★ Bill_11
16-month saturation cycle
Yang-Tatsunori 2024: mean 16-month saturation. LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization.
N3 · ★ Bill_14
MMLU↔MMLU-Pro r=0.78, MATH↔FrontierMath r=0.18
Cross-benchmark transfer correlations consistently below r≥0.95 clean-transfer threshold. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities.
N4 · ★ Bill_17
o3 ARC-AGI 75.7% → ARC-AGI-2 5-10%
Iterative-reframing pattern. Each ~6-9 months: frontier breakthrough on public eval → held-out audit → benchmark reframing. ARC v1→v2→v3 cycle.
N5 · Bill_5
OpenAI-Epoch FrontierMath funding entanglement
Vendor funded benchmark + had Tier-1+2 access. Held-out re-eval drops o3 25.2% → 14.5%. Smoking gun for ~10pp contamination inflation.
N6 · Bill_5
LMSYS Llama 4 Maverick removal
Meta tuned chat-specific variant for LMSYS Arena prompts. Crispest Bill_5 violation in 2024-2026.
N7 · Bill_4
30% answers flip under shuffling
Pezeshkpour-Hruschka GPT-4 +6.5% A-bias / -4.5% D-bias. Wang option-shuffle 8.4% mean variance. Sclar-Choi 76% relative range across plausibly-equivalent prompt templates.
N8 · Bill_4
GSM-Symbolic -2 to -65% absolute drop
Mirzadeh GSM-Symbolic 2024: variable rename / NoOp distractor on GSM8K. o1-preview drops 17.5%. Format-brittleness exposed.
N9 · Bill_2
Harness moves the score
SWE-Bench 2-50% range from harness alone. ARC-AGI 9-50%. Game of 24 4% to 74%. Self-Consistency 40x compute = +18pp GSM8K. Best-of-N 1000x compute.
N10 · Bill_12
100x cost spread for 5pp accuracy
Cost-Capability Pareto (Artificial Analysis 2025-02). o3 ~$3000/problem vs DeepSeek-R1+best-of-32 at ~$8/problem at comparable accuracy.
N11 · Bill_1
60 distinct contamination methodologies
Carlini-Tirumala, Tang-Cao-Bommasani Yale 47% MMLU contamination, Carlini "Stealing part of a production language model" ICML 2024 best paper. 9 methodology families. No single audit can be paid by-construction without combining ≥2 methodologies.
N12 · Bill_19 NEW
Temporal-trajectory audit
The prior 73-day half-life row is quarantined pending source verification. Verified public anchors remain: FrontierMath held-out re-eval, ARC-AGI reframing, LiveCodeBench refresh, METR HCAST trend, and LMSYS / LMArena leaderboard retirement.
§04

Falsification protocol

Each negative finding becomes a checkable trigger condition. The ledger commits to public update within 7 days of any verified trigger of F7, F11, F14, or F17.

F7 · ★ Survives all six audits
Trigger: a frontier-LLM capability claim that simultaneously survives F1–F6 (contamination + harness + tool + format + selection-bias + reproducibility) with independent third-party verification within 6 months
F11 · ★ Saturation pattern
Trigger: a saturated-benchmark (>95%) claim that survives selection-bias + format-brittleness audit AND demonstrates capability headroom remains
F14 · ★ Cross-benchmark transfer
Trigger: cross-benchmark transfer correlation r≥0.95 on the same underlying capability across two cousin benchmarks at frontier scale
F17 · ★ Held-out frontier audit
Trigger: a vendor-announced score on FrontierMath / ARC-AGI / HLE / GPQA-Diamond passes third-party held-out audit cleanly with no Bill_5/Bill_10 violations
F18 · Anti-saturation NEW
Trigger: a capability claim that engages with the benchmark's anti-saturation mechanism (held-out by design / monthly refresh / iterative reframing / post-cutoff authoring)
F19 · Vendor half-life NEW
Trigger: a capability claim that reports the temporal-trajectory pattern of similar claims AND survives 6 months without invalidation

Live alerts (triggered watch-list): METR HCAST · ARC Prize v3 cycle · Epoch AI FrontierMath tier-4 audits · CAIS HLE follow-on replications · Anthropic / OpenAI / DeepMind / Meta capability cards · UK AISI evaluation reports · Stanford CRFM HELM updates · LiveCodeBench monthly refresh · NIST AI RMF revisions.

§05

Method at a glance

Threat modelDemonstrate a frontier-LLM capability-benchmark claim that survives training-data contamination + harness engineering + tool-exfiltration + problem-format brittleness + selection-bias + reproducibility audits on a major 2024–2026 benchmark in 2026.
Deep loops8 sweeps × 5–10 parallel research agents per sweep × 1 batch round.
Sources surveyedarXiv cs.LG / cs.CL / cs.AI 2024–2026 (FrontierMath, ARC-AGI, HLE, GPQA, MATH, MMLU, MMLU-Pro, SWE-Bench, BigCodeBench, LiveCodeBench, AgentBench, GSM8K) + ICLR / ICML / NeurIPS / ACL capability tracks + frontier-lab capability cards (Anthropic, OpenAI, DeepMind, Meta, Mistral, Qwen, DeepSeek, Yi) + METR / ARC Evals / Apollo / AISI / Stanford CRFM HELM third-party eval + Carlini-Tirumala memorization line + Tang-Cao-Bommasani Yale 47% MMLU + Pezeshkpour-Wang format-brittleness + Mirzadeh GSM-Symbolic + Inverse Scaling Prize follow-ons + Schaeffer-Saphra emergence-as-mirage + Epoch AI / ARC Prize / CAIS held-out construction + AI policy documents (NIST AI RMF, EU AI Act, UK AISI, Anthropic RSP, OpenAI Preparedness, DeepMind FSF).
ClassifierRegex rule engine. v0.2 with 60 hand-curated benchmark cases at gate-accuracy 1.000 / bill-recall 1.000.
Empty-space testFour signature bills (7, 11, 14, 17) predeclared as empty BEFORE batch 1 sweeps. After 469 papers across 8 sweeps, all four remain empty. Bills 18-19 promoted to v0.2 from batch 1 evidence (anti-saturation + vendor-claim half-life).
Rebuttal density38.6% (181/469 G2 papers) — second-highest of any ledger built to date (Mech Interp leads at 50%). The capability-benchmarks corpus is mid-falsification with a healthy rebuttal pipeline. The contamination-audit cluster alone has 60 distinct methodologies clustering into 9 families.
Cross-ledger couplingDomain-invariant self-validation tautology pattern: this ledger Bill_10 (vendor-self-eval) ↔ QA Bill_4 (XEB self-validation) ↔ Mech Interp Bill_5 (activation-patching circularity) ↔ Lattice cost-model fudges. Same structural failure mode (metric defining success AND validating success) across cryptography, quantum advantage, ML interpretability, and capability evaluation.
ReproducibilityAll scripts, JSONs, and wiki are public. Run order: bill_classifier.py --benchmark → ledger populator → atlas review pipeline.
§06

Resources & further reading

§R

Reproducibility & data

Every empirical claim on this page resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (May 2026) — 469 papers; Bills 7/11/14/17 ★ empty. Run: python3 bill_classifier.py --benchmark.

Locked state · 2026-05-08
Four signature constructions.
Four hundred sixty-nine papers.
Empty space holding.