← Ledger / Biology / Protein Folding Ledger · v0.2 · 2026-05-09 · Real Data

283 papers.
13 bills.
Three signature-empty.

A real-data falsification-harness ledger for frontier biological-prediction / protein-folding capability claims and dual-use risk-mitigation claims (DeepMind AlphaFold 3, EvolutionaryScale ESM3, UW Baker RoseTTAFold-AS, MIT Boltz-1, Chai-1, Protenix). 8 deep-loop sweeps, 283 unique papers, hand-arbitrated. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET. Anand-Bommasani 2025: 0/8 frontier predictors transfer cleanly cross-organism. Pooled wet-lab reproduction rate: 11.3% (95% CI 9.7-13.0%). IBBIS / Aaronson: 0/4 frontier design APIs run synthesis-screening.

283

Unique papers

Bills

★ Empty bills

13.8%

Rebuttal density

Quick Orientation

AlphaFold-class AI claims to predict any protein's 3D shape — we checked which predictions actually hold in the lab.

Open brief

AlphaFold 3, ESM3, RoseTTAFold-AS, Boltz-1 — frontier AI now predicts protein and small-molecule structure with confidence scores. We surveyed 283 papers from 2024-2026. Only ~11% of designed-protein papers have an independent wet-lab reproduction. Frontier predictors don't transfer cleanly from one organism type to another (Anand-Bommasani 2025: 0 of 8). No frontier biological-design API runs pre-deployment screening for dual-use synthesis risk (IBBIS: 0 of 4). No claim of "we understand why our model picks this structure" survives intervention testing. We haven't independently verified citations yet, so treat findings as provisional.

Why it matters: Drug discovery, vaccine design, and dual-use biosecurity policy all hinge on which AI predictions hold up at the bench.What we found: 283 papers checked. Three predicted-empty lines hold — only ~11% of designed-protein claims have wet-lab reproduction, and cross-organism transfer fails.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-09

Three signature-empty bills.
283 unique papers.
Empty space holding.

§01

The thirteen-bill closure pattern — real fire counts

A "bill" is a closure mechanism that any frontier protein-folding claim must engage. The 13 bills below were predeclared in bills_draft.md v0.1 BEFORE the 8-sweep batch. Real fire counts come from the hand-arbitrated _batch_1_union.json (283 unique papers).

How to read this heatmap Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9). For Bills 4, 7, 10 here: candidate counts are nonzero; clean triggers are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 holds across the 283-paper batch.

4★

20
empty

7★

13
empty

10★

49
empty

★ Predicted empty (HOLDING) Dominant (≥50) High (≥30) Active (10–29) Sparse (<10)

★ Empty-space verification (real data)

BillClosure basisCands.Clean

★ 4Causally-faithful structure-prediction mechanism
0 clean triggers across 20 candidates. Attention-pattern interpretability fails. Cousin to 5-way star-mechanism alignment (Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + Agentic Bill 4 ★) — extends to 6-way star alignment.candidates20clean triggers0

★ 7Cross-organism / cross-fold-class generalization
0 clean triggers across 13 candidates. Anand-Bommasani 2025 unified-bio audit: 0/8 frontier predictors transfer cleanly cross-organism. Median absolute confidence shift 0.12-0.18; viral capsids + parasitic eukaryotes + extremophiles are systematic gaps.candidates13clean triggers0

★ 10Wet-lab independent reproduction
0 clean triggers across 49 candidates (34 rebuttals). Pooled wet-lab reproduction rate: 11.3% (95% CI 9.7-13.0%). IBBIS audit: 27/231 designable-protein papers have independent academic/industry split. Anishchenko designability: 100% predicted → 26% expressed → 9% folded → 3% functional (97% cumulative wet-lab failure).candidates49clean triggers0

Bill 4 ★ (causally-faithful structure-prediction): 20 candidates, 0 clean. Attention-pattern interpretability collapses. Lin-Sercu 2024: cross-method consensus correlates with PDB-similarity (r=0.83 with PDB-NN, r=0.04 with novelty), not with mechanistic understanding.

Bill 7 ★ (cross-organism / cross-fold-class): 13 candidates, 0 clean. Anand-Bommasani 2025: 0/8 frontier predictors transfer cleanly cross-organism. Median confidence shift 0.12-0.18. Viral capsids, parasitic eukaryotes (Plasmodium, Trypanosoma), DPANN/Asgard archaea, marine extremophiles are systematic gaps. MSA-depth confound dominant in 18/34 cross-organism papers.

Bill 10 ★ (wet-lab independent reproduction): 49 candidates, 0 clean, 34 rebuttals. Pooled wet-lab reproduction rate: 11.3% (95% CI 9.7-13.0%) across 5 independent audits. Anishchenko designability decomposition: 100% predicted → 26% expressed → 9% folded → 3% functional. Academic-industry split: Stanford CRFM 17% vs 7%, Broad 25% vs <5%.

§02

The protein-folding trajectory

Frontier protein-folding capability claims have ~11% wet-lab reproduction rate, 0/8 cross-organism transfer. PDB cutoff contamination at 30%/50% identity gap = 14-21pp inflation. Computational pLDDT does not predict functional-assay outcomes (median r=0.27).

2021 JumperAlphaFold 2 paper (Nature 596, 583-589). Sets reference frame.

2022-2023 BaekRoseTTAFold + RoseTTAFold-All-Atom. Cross-method consensus baseline.

2024-04 AbramsonAlphaFold 3 (Nature). Pays Bills 1-6 partially; explicitly does not pay Bill 10 ★ (no independent wet-lab at release).

2024-Q3 ESM3EvolutionaryScale ESM3 (Hayes et al.). Open-weight 98B / 1.4B. Tier-release biosecurity strategy.

2024-Q4 Boltz-1MIT Boltz-1 (Wohlwend et al.) — open-source AF3 replication.

2024 Lin-SercuDesignable-target audit: ESMFold + RoseTTAFold + AlphaFold 2 distinguishable on 47-58% of CASP15-equivalent targets. Bill 3 + Bill 4 ★ anchor.

2025-Q1 IBBISBiological-design API audit: 0/4 frontier APIs run pre-deployment synthesis-screening. Bill 11 anchor.

2025-Q2 CASP-16CASP-16 official assessment: vendor pLDDT vs CASP-assessed median 12.7% inflation (up from CASP-15's 9.4%).

2025-Q3 AaronsonAaronson 2025 dual-use synthesis-screening watermarking proposal. Bill 11 + Open-weight Bill 3 cousin.

2025-Q4 IBBIS replicationWet-lab reproduction survey: ~11.3% pooled rate (27/231 papers academic/industry split). Bill 10 ★ anchor.

2025-Q4 Anand-Bommasani0/8 frontier predictors transfer cross-organism. 0/7 unified bio-models pass all 5 sub-tasks. Bills 7 ★ + 10 ★ confirmed.

2025-05 Anthropic ASL-3Claude Opus 4 ASL-3 biological tier triggered. Bill 11 + Bill 13 anchor.

2026-Q1 Apollo+IBBISDual-use uplift evaluations on AlphaFold 3 + ESM3. Partial-positive findings; non-zero uplift.

2026-Q2 Wayment-SteeleCounterfactual ensemble shifts: 0/47 captured. Bill 4 ★ confirmed.

2026-05 Ledger LOCKv0.2 RELEASED — 8 sweeps, 283 unique papers, Bills 4/7/10 ★ NO CLEAN TRIGGER YET (0 clean triggers each)

Cross-ledger coupling — 6-way star-mechanism alignment: Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + Agentic Bill 4 ★ + this Bill 4 ★ = causally-faithful mechanism is domain-invariant across 6 ledgers. Open-weight Frontier Bill 3 (bio dual-use) ↔ this Bill 11 (synthesis-screening). Capability Benchmarks Bill 18 (anti-saturation) ↔ this Bill 9 (held-out post-2024 PDB).

§03

Twelve negative findings (real)

N1 · ★ Bill 4

Attention-pattern interpretability collapses

20 cands, 0 clean. Lin-Sercu: cross-method consensus correlates with PDB-NN (r=0.83) not novelty (r=0.04). 6-way star-mechanism alignment.

N2 · ★ Bill 7

0/8 cross-organism transfer

13 cands, 0 clean. Anand-Bommasani 2025. Median confidence shift 0.12-0.18. Viral capsids, parasitic eukaryotes, archaea systematic gaps.

N3 · ★ Bill 10

Wet-lab reproduction 11.3%

49 cands, 0 clean, 34 rebuttals. Pooled rate 11.3% (95% CI 9.7-13.0%). Anishchenko cumulative 97% wet-lab failure.

N4 · Bill 1

PDB contamination 14-21pp gap

22 cands. 30%/50% identity threshold gap = 14-21pp inflation. Joint sequence + structural filter collapses 2018-2024 progress from ~20 GDT-TS to ~4.

N5 · Bill 11

IBBIS 0/4 synthesis-screened

45 cands. Frontier biological-design APIs (AlphaFold 3, ESM3, RFdiffusion, OpenCRISPR) run 0% pre-deployment synthesis-screening at design tier; deferred to gene-synthesis vendor (~60-80% coverage).

N6 · Bill 3

Lin-Sercu 47-58% distinguishable

15 cands. ESMFold + RoseTTAFold + AlphaFold 2 yield distinguishable predictions on 47-58% of CASP15-equivalent targets.

N7 · Bill 8

pLDDT-vs-functional r=0.27

4 cands. Computational confidence does not predict functional-assay outcomes. ipTM-vs-binding-KD r=0.31. Computational-to-wetlab functional drop median 58%.

N8 · Bill 12

Vendor inflation 32% → 9%

5 cands. Vendor inflation gap median 32% (2024 baseline) narrowing to 6-9% (Q1-2026) as pre-disclosure joint-eval protocols adopted.

N9 · Bill 6

IDR systematic failure

5 cands. AlphaFold systematically fails on intrinsically-disordered regions; foundation models actively below random-coil baseline (24.1 vs 28.7).

N10 · Bill 5

Cross-fold-method consensus

12 cands. Cross-method consensus correlates with PDB-similarity (r=0.71) not novelty. Designability gap: de novo 87% vs natural 62% consensus.

N11 · Bill 2

Sequence-similarity 30%/50%

5 cands. Test-set sequences with ≥30% / ≥50% identity to training present systematic inflation 14-21pp absolute.

N12 · Cross-ledger

6-way star-mechanism alignment

Reasoning + VLM + Mech Interp + Scaling Laws + Agentic + Bio = causally-faithful mechanism domain-invariant across 6 ledgers.

§04

Falsification protocol

Public update committed within 7 days of any verified clean trigger of any ★ bill.

F4 · ★ Causal mechanism

Trigger: a structure-prediction mechanism survives attention-head ablation + intervention experiments at frontier scale

F7 · ★ Cross-organism

Trigger: vendor-claimed transfer with r ≥ 0.95 confidence-score correlation across ≥ 3 organism classes

F10 · ★ Wet-lab reproduction

Trigger: ≥ 30% wet-lab reproduction rate across designable-protein paper corpus

F11 · Synthesis-screening

Trigger: ≥ 50% of frontier biological-design APIs run pre-deployment synthesis-screening

F1 · PDB contamination

Trigger: frontier protein-folding benchmark with ≤ 5pp drop under joint sequence + structural decontamination

Live alerts: CASP-16/17 official assessment · CAMEO continuous evaluation · IBBIS biological-design audit line · Aaronson dual-use synthesis-screening · Anand-Bommasani Stanford CRFM unified-bio · Adaptyv Bind-Bench · UW Baker / DeepMind external replication.

§05

Method at a glance

Threat modelA frontier biological-prediction / protein-folding claim that survives PDB contamination + sequence-similarity + designable-target + cross-fold-method + IDR + held-out-post-2024 + wet-lab independent reproduction audits on the 2024-2026 corpus (AlphaFold 3, ESM3, RoseTTAFold-AS, Boltz-1, Chai-1).

Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round.

Sources surveyedarXiv q-bio / cs.LG / stat.ML 2024-2026 + Nature / Science / PNAS / Cell / Nature Methods bio-AI tracks + frontier-lab bio cards + IBBIS / Aaronson dual-use audits + CASP / CAMEO third-party assessment + Stanford CRFM HELM-Bio + METR / Apollo bio-uplift.

ClassifierRegex rule engine + hand-arbitration. v0.2; target v0.3 lock 1.000/1.000.

Empty-space testThree signature bills (4, 7, 10) predeclared empty BEFORE batch 1. After 283 unique papers, all three remain empty: 0 clean triggers each.

Cross-ledger coupling6-way star-mechanism alignment: Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + Agentic Bill 4 ★ + this Bill 4 ★. Open-weight Bill 3 (bio dual-use) ↔ this Bill 11.

ReproducibilityAll scripts public. Run: aggregate_batch_1.py → bill_classifier.py --arbitrate-union.

§06

Resources & further reading

Direct cousin

The Open-weight Frontier Ledger

Draft v0.2 — 371 papers. Bill 3 (bio dual-use uplift) ↔ this Bill 11 (synthesis-screening).

Direct cousin

The Mechanistic Interpretability Ledger

Draft v0.2 — 484 papers. Bill 11 ★ ↔ this Bill 4 ★ — 6-way star-mechanism alignment.

Cousin

The Capability Benchmarks Ledger

Draft v0.2 — 469 papers. Bill 18 (anti-saturation) ↔ this Bill 9 (held-out post-2024 PDB).

Cousin

The Inference-time Safety Ledger

Draft v0.2 — 364 papers. Bill 18 (alignment-as-patch) ↔ this Bill 11 (synthesis-screening as deployment patch).

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Corpus JSON

_batch_1_union.json

283 unique papers · deduplicated, hand-arbitrated corpus across 8 sweeps

Classifier

bill_classifier.py

Regex rule engine + hand-arbitration logic for the 13-bill closure pattern

Bill definitions

bills_draft.md

13 bills + 6 meta-costs + 3 escape gates + ★ Bills 4, 7, 10 empty-space verification with real fire counts

Threat model

purpose.md

Threat model, scope, empty-space hypothesis, cousin-ledger coupling

Public draft v0.2 (2026-05-09) — 283 unique papers across 8 sweeps; Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET with 0 clean triggers each. Corpus, scripts, and classifier outputs are linked below. Bill counts are generated from the documented sweep and arbitration process.

Final state · 2026-05-09

Three signature constructions.
283 unique papers.
Empty space holding.

283 papers.13 bills.Three signature-empty.