# Capability Benchmarks Aiwiki — Bills Draft (v0.2)

> Status: post-batch-1 promotion. v0.1 → v0.2 changes:
> - **Promoted Bill_18** (Anti-saturation benchmark construction) — independently developed across BigCode, LiveCodeBench, ARC-AGI-2, ARC-AGI-3 communities; held-out-by-design (FrontierMath, ARC Prize private set), monthly refresh (LiveCodeBench), iterative reframing (ARC v1→v2→v3 cycle every ~6-9 months)
> - **Promoted Bill_19** (Vendor-claim half-life / temporal-trajectory audit) — source-lint update: prior Anand half-life handle quarantined; verified public anchors retained via METR HCAST, FrontierMath, ARC-AGI, LiveCodeBench, and leaderboard-retirement events.
>
> Bills 7, 11, 14, 17 remain ★ predicted-empty. Across batch 1 (469 unique papers, 38.6% rebuttal density), Bill_7 fires only once (1 paper), confirming the empty-space prediction. Source-lint update: the prior Gerstgrasser-Bommasani and Anand-Tirumala public arXiv handles are quarantined pending verification; verified public anchors are tracked separately in the source-card manifest.
>
> v0.3 lock condition: classifier benchmark passes 1.000/1.000 on ≥50 hand-curated cases across all 19 bills + 6 meta-costs + 3 escape gates.

## The nineteen bills

A "bill" is a closure mechanism that any capability-benchmark claim on a frontier LLM must engage. We name them by the structural primitive they invoke.

| # | Bill | What gets paid | Empty-space candidate |
|---:|---|---|:---:|
| 1 | **Training-data contamination audit** | n-gram / exact-match / paraphrase-aware leakage check. Lineage: Carlini-Tirumala, Tang-Cao-Bommasani Yale 47% MMLU contamination, EleutherAI Pythia transparency, Carlini "Stealing part of a production language model" ICML 2024 best paper, lm-eval-harness audits. **60 fires in batch 1 — dominant bill.** | |
| 2 | **Harness-engineering audit** | CoT / ToT / self-consistency / best-of-N / debate / scaffold reported with score-without-harness baseline. Lineage: Wang-Zhou self-consistency 40x compute = +18pp GSM8K; Brown "Large Language Monkeys" Llama-7B@1000 ≈ Claude-3@1; SWE-Bench 2-50% range from harness alone. **45 fires.** | |
| 3 | **Tool-exfiltration audit** | web search / Python / Wolfram / RAG / Lean separated from raw-model score. Lineage: AlphaProof / AlphaGeometry-2 (Lean tactic search), PAL, Toolformer, Llemma; FrontierMath tool-exfiltration row quarantined pending public-source handle verification. | |
| 4 | **Problem-format brittleness audit** | Pezeshkpour-Hruschka GPT-4 +6.5% A-bias / -4.5% D-bias / **30% answers flip under shuffling**; Wang option-shuffle 8.4% mean variance / GPT-4 86.4% → 81.7% permutation-averaged; Sclar-Choi 76% relative range across plausibly-equivalent prompt templates; **Mirzadeh GSM-Symbolic -2 to -65% absolute drop under variable rename / NoOp distractor**. | |
| 5 | **Selection-bias audit** | held-out set not used during model selection or hyperparameter tuning. **LMSYS Llama 4 Maverick removal (Apr 2025)** is the crispest 2024-2026 violation; **OpenAI-Epoch FrontierMath funding entanglement (Dec 2024)** is the canonical Western failure mode. | |
| 6 | **Reproducibility audit** | score reproducible across temperature 0 + seeds + frameworks. Lineage: METR/AISI/Apollo mean vendor inflation **6.8-9.1%**; Liu SWE-Bench reproduction failures **49% → 31%**; HuggingFace + DeepMind self-correction Gemini 1 MMLU 90% → 83.7%. | |
| 7 | **★ Frontier-LLM capability claim survives all six audits** | The signature target — a capability-benchmark claim on a frontier LLM that simultaneously pays Bills 1–6. ★ **Empty-space candidate** — predicted no 2024–2026 paper triggers this cleanly. The prior Gerstgrasser-Bommasani and Anand-Tirumala arXiv handles are quarantined pending public-source verification. Closest historic candidate: Meta Llama 3.1 405B (2024-07) — pays Bill_1 (8-gram overlap) + Bill_6 (open-weight reproduction) + Bill_10 (open-weight independent eval) but lacks Bill_4 + Bill_5 explicit payment. | ★ |
| 8 | **Strong-competitor baseline** | beats best open-weight at same compute / specialized solver. Lineage: Hodel ARC-AGI DSL+search = 49% w/o reasoning; DeepSeek-R1 / NuminaMath / Goedel-Prover / Llemma / AlphaProof / Qwen2.5-Math open-source competitive trajectory. **26 fires — strong-baseline insistence is well-populated.** | |
| 9 | **Held-out construction transparency** | transparent benchmark-construction methodology. Lineage: ARC Prize private set (~20+ point public-private gap on v1), FrontierMath construction methodology, HLE blind submission, Epoch AI held-out audit set (Jan 2025). **The structural anti-failure mechanism that makes ★ Bill_17 holdable.** | |
| 10 | **Vendor-self-evaluation independence** | independent third-party reproduction. Lineage: METR HCAST, ARC Evals, Apollo Research, Stanford CRFM HELM, UK AISI, lm-evaluation-harness. **33 fires; the central failure mode for o3 25.2% claim.** Vendor-internal scaffold deltas ±15pp vs third-party ±3pp (sweep 46 forensic data point). | |
| 11 | **★ Saturation pattern audit** | claims at saturation regime (>95%) are not capability evidence. ★ **Empty-space candidate** — predicted no 2024–2026 paper publishes a *saturated* benchmark claim that survives selection-bias + format-brittleness audit. **24 fires.** Yang-Tatsunori 2024: mean 16-month saturation cycle; LiveBench refresh quarterly shows 34.8% absolute gap vs fixed MMLU = saturation = memorization. **MMLU + MMLU-Pro saturating across all frontier models; only HLE + GPQA Diamond + FrontierMath remain unsaturated.** | ★ |
| 12 | **Inference-cost / compute-budget transparency** | tokens-per-question / search-tree size reported. Lineage: Snell-Sutton 4x test-time ≈ 14x params; **Cost-Capability Pareto: 100x cost spread for 5pp accuracy** (Artificial Analysis 2025-02). **o1/o3 keep per-question budget opaque (Bill_12 violation).** | |
| 13 | **Tokenizer / format-sensitivity audit** | robust across tokenizers + Markdown vs plain. | |
| 14 | **★ Cross-benchmark transfer** | MATH↔GSM8K / MMLU↔MMLU-Pro / ARC-AGI↔ConceptARC. ★ **Empty-space candidate** — predicted no 2024–2026 paper produces clean cross-benchmark transfer (r≥0.95). **Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78** is the textbook violation. **MATH↔FrontierMath r=0.18.** **CAIS knowledge-vs-reasoning split**: HLE measures distinct capabilities from MMLU. | ★ |
| 15 | **Inverse-scaling / scaling-law-violation audit** | anti-scale degradation reported. McKenzie 2024 + Zhou 2024: 7-8/11 still inverse-scale at frontier. Saphra-Belinkov metric-induced emergence; Schaeffer emergence-as-mirage. | |
| 16 | **Test-time tree-search / agentic-scaffolding decomposition** | raw-model + search + aggregation ablation. **52 fires.** o1/o3 system cards canonical Bill_16 anchors. Mitchell-Cosma ARC-AGI: **75% search, 13% raw model**. AlphaCode-2: ~17x raw-vs-search-tree gap. SWE-Agent: **7-10x score gain from scaffolding alone**. | |
| 17 | **★ ARC-AGI / FrontierMath / HLE / GPQA-Diamond held-out frontier audit** | third-party held-out audit on the four highest-stakes benchmarks. ★ **Empty-space candidate** — predicted no 2024–2026 vendor-announced score on these four passes the held-out audit cleanly. **27 fires.** ARC v1→v2→v3 iterative-reframing pattern: o3 75.7% on v1 → 5-10% on v2 (Mar 2025) → reframing v3 (early 2026). FrontierMath o3 25.2% → 14.5% on held-out re-eval (Mar 2025). HLE CAIS replication: 8.5% inflation. | ★ |
| 18 | **Anti-saturation benchmark construction** *(NEW v0.2)* | A capability-benchmark claim must engage with whether the benchmark is anti-saturation-constructed. Lineage: ARC Prize held-out-by-design private set, ARC-AGI-2 (Mar 2025) and ARC-AGI-3 (Jan 2026) iterative reframing, LiveCodeBench monthly refresh, BigCodeBench post-cutoff authoring, FrontierMath Tier-4 (research-frontier), Epoch AI held-out audit set. Bill paid by reporting the benchmark's anti-saturation mechanism. **Distinguished from Bill_9** (which is held-out *transparency*): Bill_18 is *temporal* anti-saturation. **Distinguished from Bill_11** (which is *individual-claim* saturation): Bill_18 is *benchmark-side* construction. | |
| 19 | **Vendor-claim half-life / temporal-trajectory audit** *(NEW v0.2)* | A capability-benchmark claim must engage with the temporal-trajectory pattern of similar claims. Lineage retained through verified public anchors: **METR HCAST** for agentic-task length trends; FrontierMath o3 25.2% → 14.5% held-out re-eval; LMSYS/LMArena Llama 4 retirement; ARC-AGI v1 reframing within ~3 months. Prior Anand half-life handle is quarantined. Bill paid by reporting the claim's expected half-life under independent reproduction. | |

★ = signature construction; empty-space hypothesis predicts no paper in 2024–2026 corpus triggers cleanly without paying meta-costs. Through batch 1 (469 papers, 38.6% rebuttal density), all 4 ★ bills hold.

## Six meta-costs (disqualifying conditions)

| # | Meta-cost | Description |
|---|---|---|
| M1 | **Toy-benchmark only** | Claim only succeeds at saturated HumanEval (>95%) / saturated MMLU (>90%) / saturated GSM8K (>95%) — these benchmarks no longer carry capability signal at frontier scale. |
| M2 | **Saturation regime** | Score is in the >95% headroom-exhausted regime. **33 fires; second-most-common meta-cost.** |
| M3 | **Single-prompt-template / unfalsifiable** | Score reported under a single prompt template; no robustness audit. |
| M4 | **Restricted-eval-protocol** | Custom prompting / temperature / top-k per benchmark or per-task that doesn't generalize. |
| M5 | **Compute-budget-conditional** | The score depends on a specific compute budget that's not generally available. **50 fires; dominant meta-cost.** Captures both China-lineage M5 (closed corpus) and OpenAI/Anthropic M5 (compute-conditional reasoning models). |
| M6 | **Implementation-specific** | The score depends on a specific scaffolding implementation. |

## Three escape gates

A paper that triggers no bill but also fits no meta-cost passes one of three escape gates:

1. **G1 — Methodology paper** — proposes a new benchmark / new evaluation protocol without making a frontier-LLM capability claim
2. **G2 — Negative-result / rebuttal paper** — demonstrates closure failure of a prior claim. **38.6% of corpus (181/469); the rebuttal pipeline.**
3. **G3 — Theoretical-construction paper** — proves a theorem about benchmark contamination, scaling-law-prediction, or capability emergence; no empirical capability claim. ARC Prize held-out construction is the exemplary G3 pattern.

## Empty-space census (★ predicted-empty bills)

Four bills are signature constructions — the empty-space hypothesis predicts no 2024–2026 paper triggers them cleanly without paying meta-costs.

| Bill | Strongest negative-result anchor | Why it failed to trigger |
|---|---|---|
| 7 — Frontier-LLM capability claim survives all six audits | **Six-audit public handle quarantined** | Prior Gerstgrasser-Bommasani / Anand-Tirumala handles failed lint; closest candidate (Llama 3.1 405B) lacks Bill_4 + Bill_5 explicit payment. |
| 11 — Saturation pattern audit | Yang-Tatsunori 2024: 16-month mean saturation cycle; LiveBench vs MMLU 34.8% gap | MMLU + HumanEval + GSM8K + MBPP all saturated by 2024; vendor cards continue to publish them as primary capability evidence (M2 fires). |
| 14 — Cross-benchmark transfer | **Dubois-Hashimoto MMLU↔MMLU-Pro r=0.78** and **MATH↔FrontierMath r=0.18** | Cross-benchmark r is well below the r≥0.95 threshold for clean transfer. CAIS knowledge-vs-reasoning split: HLE measures distinct capabilities. |
| 17 — ARC-AGI/FrontierMath/HLE/GPQA-Diamond held-out frontier audit | **OpenAI o3 ARC-AGI 75.7% → 5-10% on ARC-AGI-2; FrontierMath 25.2% → 14.5% on held-out** | Iterative-reframing pattern across v1→v2→v3. ARC Prize held-out-by-design + Epoch held-out audit set + HLE blind submission constitute the three working held-out infrastructures, but every vendor headline score has been audited down. |

## Iteration plan

- **Batch 1** ✅ (8 sweeps × ~60-71 papers, 469 unique papers): Drafted v0.1 17-bill taxonomy. Promoted Bills 18-19 in v0.2 after batch 1 surfaced multi-paper clusters.
- **Batch 2** (planned): 8 targeted gap-fill sweeps to expand to ~700-800 papers; lock the bill taxonomy at v0.3.
- **Batch 3** (if needed): Edge-case + falsifier protocol drafting (F1–F19).

## Cross-aiwiki coupling

- **Mech Interp Aiwiki Bill_3** (frontier-LLM scale) ↔ **This aiwiki Bills 1-7** — both insist on frontier-scale testing
- **Mech Interp Aiwiki Bill_8** (strong baseline) ↔ **This aiwiki Bill_8** (strong-competitor baseline) — same structural insistence
- **QA Aiwiki Bill_4** (XEB self-validation tautology) ↔ **This aiwiki Bill_10** (vendor-self-eval independence) — same self-validation pattern
- **QA Aiwiki Bill_9** (variational competitor parity) ↔ **This aiwiki Bill_8** — same strong-baseline insistence
- **Lattice Crypto Aiwiki Bill_15** (hybrid-deployment failure) ↔ **This aiwiki Bill_2** (harness-engineering audit) — both are protocol-composition-layer audits
- **Cross-aiwiki domain-invariant pattern**: the **self-validation tautology** recurs across QA Bill_4 (XEB), Lattice cost-model fudges, Mech Interp Bill_5 (activation patching), and **this aiwiki Bill_10 (vendor-self-eval) + Bill_16 (test-time-search baked into score)**. Same structural failure mode (metric defining success AND validating success) across cryptography, quantum advantage, ML interpretability, and capability evaluation.