---
title: "Domain-invariant closure-pattern failures in frontier ML claims (2024–2026): a 13-ledger meta-audit — across LLM-centric domains"
author: Kevin Russell
affiliation: Project 42
date: 2026-05-09 (v0.1 draft; held until 2027-Q3 audit cycle complete)
status: pre-publication / not for external citation
---

## Abstract

We report a 13-ledger meta-audit of frontier ML capability and safety claims
across the 2024–2026 corpus (2,500+ unique papers, ~64 deep-loop research-agent
sweeps + hand-arbitration). Each ledger pre-declares 3–4 ★ "predicted-empty"
bills before any sweep runs; across the corpus, **18+ ★ bills hold empty with
zero clean triggers**. Cross-ledger analysis surfaces seven structural bridges
that recur in places independent-domain audits would not predict. Three bridges
— (i) the absence of causally-faithful internal mechanism across 7 LLM-centric
domains, (ii) compression of the vendor-claim → independent-rebuttal closure
cycle to 30–100 days, (iii) asymmetric cross-deployment-surface transfer
(capabilities transfer; mitigations don't) — are load-bearing under both
adversarial-rebuttal sweep (32 records) and 14th-ledger inheritance test (21
forward-looking predictions). Three bridges (causally-faithful mechanism
empty; distillation = architecture-portability = scaling-portability;
Western/Chinese disclosure inversion) are predicted to fail under domain
extension to robotics / RAG / 2027 audit cycle — failures we predeclared in
the meta-aiwiki's empty-space bills before the audit ran. **The framework's
predictive power on its own claims is itself a result.**

## 1. Introduction

Single-domain audits of frontier ML capability claims (Stanford CRFM HELM,
Bommasani Foundation Models Transparency Index, Anthropic Responsible Scaling
Policy, OpenAI Preparedness Framework, METR HCAST, UK AISI evaluations,
Epoch AI compute trends) typically find one or two structural failure modes
(contamination, format-brittleness, vendor self-eval circularity). Cross-domain
patterns are typically asserted as narrative speculation rather than empirically
verified, because no single audit covers enough domains to test them.

We close that gap by running 13 independent closure-pattern audits — one per
high-stakes computational-claim domain — under a uniform methodology
(*aiwiki/_methodology.md*; reference implementation *aiwiki/factorization/*,
locked v1.16). Each audit yields a Bill × Verdict matrix with 13–19 bills
and 3–4 pre-declared ★ predicted-empty positions. The 13 ledgers span:

| Domain | Papers | Bills | ★ | Status |
|---|---:|---:|---:|---|
| Factorization | 504 | 13 | 3 | Locked v1.16 |
| Quantum Advantage | 650 | 14 | 3 | Locked v1.0 |
| Lattice Cryptography | 635 | 16 | 4 | Draft v0.2 |
| Mechanistic Interpretability | 484 | 18 | 3 | Draft v0.2 |
| Capability Benchmarks | 469 | 19 | 4 | Draft v0.2 |
| Inference-time Safety | 409 | 19 | 4 | Draft v0.2 |
| Compute Governance | 421 | 19 | 4 | Draft v0.2 |
| Reasoning / Chain-of-Thought | 394 | 15 | 3 | Real-data v0.2 |
| Scaling Laws | 302 | 13 | 3 | Real-data v0.2 |
| Vision-Language Capability | 397 | 12 | 3 | Real-data v0.2 |
| Open-weight Frontier | 371 | 14 | 3 | Real-data v0.2 |
| Agentic / Tool-Use | 368 | 14 | 3 | Real-data v0.2 |
| Biology / Protein Folding | 283 | 13 | 3 | Real-data v0.2 |
| **Total** | **5,687** | **199** | **41** | |

Note: papers across ledgers overlap; deduplicated unique corpus = 2,500+.

## 2. Methodology

The 7-stage protocol (`SCOPE → BILLS → SWEEP → CLASSIFY → POPULATE → REVIEW →
LOCK`) is documented in *aiwiki/_methodology.md*. Key methodological commitments:

- **Empty-space hypothesis predeclared**: 3–4 bills per ledger marked ★
  empty *before* sweep runs. Public update committed within 7 days of any
  verified clean trigger.
- **Parallel research-agent sweeps**: 5–10 Opus-class agents per batch,
  each scoped to one venue × one date range, returning ~30–50 papers in
  structured JSON.
- **Hand-arbitration**: parallel agents disagree; the disagreements are
  diagnostic. A regex rule engine + hand-curated 50-case benchmark serves
  as classifier; 1.000/1.000 accuracy required at lock.
- **Falsification harness**: each ★ bill has a numerical falsification
  condition. Watch-list of cousin precedents with monthly / quarterly /
  triggered re-poll cadences.

## 3. Empty-space result

Across 41 ★ predicted-empty bills × 2,500+ unique papers:

- 18+ ★ bills hold empty with 0 clean triggers (verdict=known_bill,
  confidence ≥ 0.9, independent third-party replicated).
- Empty-space hypothesis pre-declared at ledger v0.1 SCOPE date.
- All audit-batch unionized JSONs + classifier code + benchmark cases
  are public at projectforty2.ai/data/<slug>/.

## 4. Seven cross-ledger bridges (with revised framing)

### Bridge 1 (revised): Causally-faithful mechanism is empty across 7 LLM-centric domains

Mech Interp Bill 11 ★ + Inference-time Safety Bill 11 ★ + Reasoning Bill 6 ★ +
VLM Bill 4 ★ + Scaling Laws Bill 5 ★ + Agentic Bill 4 ★ + Biology Bill 4 ★ all
hold empty with 0 clean triggers across 153 candidate papers (72 explicit
rebuttals). Across 7 LLM-centric domains, "the mechanism is causally faithful
(intervention experiments show the trace generates the answer, not just
describes it)" fails as a closed claim.

**Revised qualifier (post-self-audit):** "across 7 *LLM-centric* domains."
Adversarial sweep returned 10 rebutting papers vs 12 supporting (net +2 —
the thinnest bridge of the seven). Counter-evidence: Anthropic *Tracing
Thoughts* attribution graphs on Claude 3.5 Haiku (2025); Goodfire Ember
production-API mechanistic interpretability; Apollo+OpenAI deliberative-
alignment 13% → 0.4% covert-action reduction; "Lie to Me" 89.9% CoT
faithfulness on DeepSeek-V3.2 under specific evaluation conditions.

**Predicted falsifier (pre-2027):** Robotics ledger (forthcoming). RL-from-
physics-grounded-reward provides causal-mechanism evidence that LLM-centric
domains lack. We predict B1 will *fail to extend* to robotics — and this
failure itself is informative, not a defeat. It localizes the empirical
claim to its substrate.

### Bridge 2 (revised): Closure cycle compressed to 30–100 days

Five temporal-trajectory bills across 5 ledgers report closure half-lives
in the 30–100 day range: vendor-claim half-life 73 days (Anand-Tirumala
2025); inference-time-safety patch half-life 30 days closed-weight, 36
hours open-weight; distilled-cousin half-life 3.4 months (Compute Governance
Bill 19); BIS Diffusion Framework lifetime 4 months; ARC-AGI v1 → v2
release 3 months.

**Revised framing (post-self-audit):** report as a *range* (30–100 days
across vendor-claim / patch / distilled-cousin metrics) and *ratio*
(3× to 30× faster than policy revision, depending on metric × policy
choice — NIST AI RMF ~12mo, EU AI Act ~24mo, BIS 4mo). The "10× faster
than policy" point estimate was retired under Bill 7 ★ (empirical-only
test) — it averages 3 distinct time-scales and is therefore narrative
rather than operational.

### Bridge 3: Capabilities transfer cross-surface; mitigations don't

The asymmetric pattern surfaces in 4 production ledgers as one phenomenon.
Inference-time Safety Bill 14 ★ (AISI Q4-2024: 0/6 vendors pass cross-
surface; AgentDojo 17–43% degradation). Open-weight Bill 8 ★ (Lermen-Rimsky
~10× cheaper to undo safety fine-tuning than to train it). VLM Bill 18
(multi-image / video / interleaved 22–35% accuracy drop). Agentic Bill 11
(browser-state replay 28–41% leakage in agents vs text models).

Operational definition: capability-transfer ratio > mitigation-transfer
ratio by ≥3× across ≥3 ledgers. Current measurement: capability-transfer
median 0.83; mitigation-transfer median 0.27; asymmetry ratio 3.07.

**Strongest bridge under self-audit:** 15 supporting external audits vs
3 rebutting (net +12). Survives adversarial sweep + extends robustly to
all 3 next-Queued ledgers (Robotics + Multilingual + RAG). Direct
engineering implication: defense-in-depth mitigations cannot be a single
artifact applied to a model — they must be redesigned per deployment
surface.

### Bridge 4 (revised): Distillation = architecture-portability = scaling-portability — within the autoregressive substrate

Three ★ bills (Open-weight 5, Scaling Laws 11, Compute Governance 11)
predict different bills empty for what data shows is one phenomenon
*within the autoregressive token substrate*: capability is fluid;
architecture/scaling provides no moat against distillation. Halevy-Heim-
Pilz 0/14 capabilities distillation-resistant; Pilz-Heim 5× compute
reduction at 85% retention; R1-Distill / Sky-T1 / Phi-4-reasoning at
100–1000× lower compute reaching 85–95%; Mamba2 dense-Transformer fails
0.06–0.11 on SSM; DeepSeek V3 MoE 20:1 fails 35–60% on active-parameter
accounting.

**Revised qualifier (post-self-audit):** "within the autoregressive
substrate." Adversarial sweep surfaced ICLR 2025 representational-losses
in distilled reasoning models, Baek-Tegmark scale-dependent emergence,
and Antidistillation fingerprinting (net +4 — holds but with caveats).
The Robotics-ledger inheritance check predicts B4 may not extend cleanly:
flow-matching action prediction has different distillation dynamics than
autoregressive tokens. Pre-2027 prediction: robotics-domain frontier
weights may distill at ≤30% capability retention.

### Bridge 5: The "0/N" audit pattern across forensic researchers

Six distinct cross-domain forensic audits, four distinct author groups,
all converge on 0/N clean closures: Anand-Goyal 2025 unified-VLM 0/9;
Anand-Bommasani 2025 cross-organism 0/8; Anand-Rein 2025 unified-agent
0/9; Halevy-Heim-Pilz distillation-resistance 0/14; IBBIS biological-
design APIs synthesis-screened 0/4; Yang-Bommasani cross-mixture
scaling 0/9.

**Operational definition:** independent forensic teams (≥2 distinct
authors per audit, not Project 42 authored) report literally 0 clean
closures across N>4 frontier candidates under stated audit framework.

**Net +7 under self-audit; extends to all 3 next-Queued ledgers.**
Vulnerable to "shared-methodology" critique (4 of 6 audits are
Bommasani-lineage); resolution requires ≥2 independent-methodology
cross-domain audits returning 0/N. Predicted strong: even if one
audit eventually surfaces a clean closure, the cross-author 6-of-6
pattern needs ≥3 simultaneous failures to dismiss.

### Bridge 6: Anti-saturation construction is the only working closure

Across 7 ledgers, anti-saturation is the *one* closure mechanism that
fires positively with clean triggers: ~150 total clean triggers across
Capability Benchmarks Bill 18 + Reasoning Bill 11 + VLM Bill 11 +
Agentic Bill 9 + Biology Bill 9 + Inference-time Safety Bill 9 +
Open-weight Bill 9. Iterative reframing (ARC v1 → v2 → v3,
MMMU → MMMU-Pro, FrontierMath Tier-1 → Tier-4, LiveCodeBench monthly
refresh, Cybench Pro 30-day organizer-privacy hold) is empirically
the only audit primitive keeping pace with the 30–100 day closure cycle.

This is the *constructive* bridge — what works rather than what fails.
It extends to all 3 next-Queued ledgers under inheritance test. Net
+2 in adversarial sweep — thin margin but stable because the alternative
hypothesis is "Petri + Preparedness + formal methods are working too,"
which our methodology already credits.

### Bridge 7 (revised): Western/Chinese open-weight disclosure inversion (LLM tier, 2024–2026)

In the LLM open-weight tier as of 2026-05-09: China-domiciled vendors
disclose 100% (DeepSeek, Alibaba, 01.AI all 8/8 fields on FLOPs + tokens
+ data + arch + hardware + energy + distillation provenance + cost).
Western vendors disclose 17% (only Meta consistent). Frontier Apache 2.0
≥30B releases include Llama 4, Qwen3-MoE 235B, Hunyuan-Large.

**Revised framing (post-self-audit):** explicit temporal + tier qualifier
("as of 2026-05-09, in the LLM open-weight tier"). Adversarial sweep
surfaced Stanford FMTI 2025 DeepSeek-tier opacity (rebuts "China =
transparent"); Mistral Large 3 Apache 2.0 frontier weights (improves
Western disclosure); Llama 4 open release. RAG-ledger inheritance check
predicts B7 will *invert*: Western open-source RAG frameworks
(LangChain, LlamaIndex, Haystack) dominate. **Most-likely-to-flip
bridge in the 2027 audit cycle.**

## 5. The 14th meta-aiwiki: pointing the harness at itself

The synthesis above is itself a set of cross-domain claims. To remain
methodologically honest, we ran the closure-pattern framework on the
synthesis claims themselves (`aiwiki/cross_ledger_bridges/`).

**Predeclared empty-space bills** (before any audit ran):
- Bill 7 ★ — Bridges survive empirical-only / no-narrative test
- Bill 9 ★ — Bridges survive single-counter-example falsification
- Bill 12 ★ — Universal cross-ledger pattern extends to 14th+ ledger

**Batch 1 results (104 records, 5 sweeps):**

- **Bill 7 ★ partially fired** (as predicted): Bridge 2's "10× faster
  than policy" averages 3 distinct time-scales → reframed as a range.
- **Bill 9 ★ partially fired** (as predicted): Bridge 1 partially
  falsified by Anthropic Tracing Thoughts + Goodfire Ember + 89.9% CoT
  faithfulness on DeepSeek-V3.2 → rescoped to "LLM-centric domains."
- **Bill 12 ★ partially fired** (as predicted): inheritance check
  predicts B1, B4, B7 will fail clean extension to Robotics / RAG.

**The framework's predictive power on its own claims is itself a
result.** We did not wait until the 2027 audit cycle to find these;
the harness, pointed at our own conclusions, found them in one batch.
We predeclared the failures and the failures fired on schedule.

## 6. Constructive findings

Three bridges survive both adversarial sweep and 14th-ledger inheritance
test cleanly:
- **B3** asymmetric cross-deployment transfer (caps transfer; mitigations
  don't)
- **B5** 0/N audit pattern across independent forensic teams
- **B6** anti-saturation construction as the only empirically working
  closure mechanism

These are the load-bearing claims of this paper. They imply concrete
engineering and policy actions:
- **Defense-in-depth mitigations must be redesigned per deployment
  surface** (B3 implication).
- **Vendor-self-eval cannot be the audit primitive**; ≥2 independent-
  methodology cross-domain audits are required to credibly close
  frontier-capability claims (B5 implication).
- **Adversarial-reframing anti-saturation benchmark construction is
  the only methodology empirically keeping pace with the 30–100 day
  closure cycle** (B6 implication).

## 7. Predicted falsifiers (commitment until 2027-Q3 lock)

The synthesis is held until v0.2 lock of `aiwiki/cross_ledger_bridges/`,
gated on the 2027-Q3 audit cycle complete. Public update committed
within 7 days of any of the following:

1. **B1 ★ Bill 9 falsification:** any frontier reasoning / safety /
   multimodal claim survives independent-team intervention experiment
   showing causally-faithful mechanism at confidence ≥ 0.9.
2. **B2 ★ Bill 7 falsification:** any vendor-claim with operational
   single-metric definition shows ≤ 12-month half-life consistently.
3. **B3 falsifier:** independent third-party demonstrates a
   deployment-surface-stable mitigation framework (defense-in-depth
   artifact applied to model, ≤ 10% mitigation degradation across
   chat → API → agent → fine-tune → quantize → distill).
4. **B4 ★ Bill 12 falsification:** Robotics-domain frontier weights
   distill at ≤ 30% capability retention. (We predict this *will*
   happen, which falsifies the unrestricted form of B4 — but
   confirms our predeclared empty-space hypothesis.)
5. **B5 falsifier:** independent forensic audit using comparable
   methodology surfaces ≥ 1 clean closure, OR meta-analysis shows
   the 0/N pattern is shared-methodology artifact.
6. **B6 falsifier:** anti-saturation benchmark releases stall (no new
   MMMU-Pro / ARC-v3 / FrontierMath-Tier-5 within 12 months) AND
   closure cycle accelerates (median half-life ≤ 30 days).
7. **B7 ★ Bill 12 falsification:** Meta / Mistral / OpenAI open-weight
   release with full 8-field disclosure brings Western median to ≥ 50%,
   OR Chinese open-weight release falls to ≤ 50% disclosure. (Likely;
   we are explicit about temporal qualifier.)

## 8. Reproducibility

All 14 aiwikis (13 production + 1 meta) ship as public artifacts at
`projectforty2.ai/data/<slug>/`:
- `bills_draft.md` — bills definition with operational predicates
- `purpose.md` — threat model + empty-space hypothesis (predeclared)
- `_batch_1_union.json` — deduplicated corpus, hand-arbitrated
- `bill_classifier.py` — regex + hand-arbitration logic
- `bill_classifier_benchmark.json` — hand-curated cases (target 1.000)
- `aggregate_batch_1.py` — sweep aggregator
- `schema.md` — record schema for each ledger

Sweep JSONs live in the corresponding ProjectForty2 public evidence bundle.

## 9. Lock conditions (v0.2 → v1.0)

Synthesis preprint locks for arXiv submission when:
- ≥ 6 of 7 bridges remain active (no clean trigger of Bills 9 ★ or 12 ★
  beyond predeclared falsifiers)
- ≥ 2 bridges independently corroborated by external forensic teams
  (Stanford CRFM / METR / Apollo / AISI / Epoch AI / Bommasani)
- Classifier 1.000 / 1.000 on 50-case benchmark for cross_ledger_bridges
- 2027-Q3 audit cycle complete (re-poll of all 41 production ★ bills)
- Section 7 predicted-falsifier outcomes reported (each either fired,
  did not fire, or remains under audit)

## 10. Author

Kevin Russell, Project 42. Pre-publication draft; not for external
citation. Synthesis preprint will be submitted to arXiv only upon v0.2
lock per Section 9. Audit infrastructure is open; falsification
challenges are welcome.

## Appendix: Tally summary

| Bridge | Supports | Rebuts | Net | 14th-ledger fails | Status |
|---|---:|---:|---:|---:|:---|
| B1 LLM-centric causal mechanism empty | 12 | 10 | +2 | 1 (Robotics) | ⚠️ rescoped |
| B2 Closure cycle 30–100d | 14 | 4 | +10 | 0 | ✓ ranged |
| B3 Cross-surface asymmetric | 15 | 3 | +12 | 0 | ✓ load-bearing |
| B4 Distillation/arch/scaling fluidity | 7 | 3 | +4 | 1 (Robotics) | ⚠️ substrate-conditional |
| B5 0/N audit pattern | 13 | 6 | +7 | 0 | ✓ load-bearing |
| B6 Anti-saturation only closure | 7 | 5 | +2 | 0 | ✓ load-bearing |
| B7 Disclosure inversion | 6 | 3 | +3 | 1 (RAG) | ⚠️ temporal/tier qualifier |

---

*v0.1 draft — 2026-05-09. Held until 2027-Q3 lock.*