The Scientific Discovery Ledger audits 301 unique papers across 8 deep-loop sweeps (2024–2026) on frontier AI-for-Science capability claims — chemistry generative models (DiffDock, ChemGPT, MolFormer, Chem42, Boltz-2), math AI (AlphaProof, AlphaGeometry 2, FunSearch, Sakana AI Scientist), materials discovery (GNoME, MatterGen, AlchemyBench), drug discovery generative + clinical (Insilico Medicine, Recursion, Atomwise), physics ML (NeuralGCM, Pangu-Weather, GraphCast 2, FourCastNet), autonomous lab (Coscientist, ChemCrow, BioPlanner, Microsoft + Anthropic + DeepMind AI Co-Scientist), and cross-discipline universal AI scientist papers. Excludes protein folding (covered by the Bio/Protein ledger).
The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3). Three bills were predicted ★ in v0.1 BEFORE batch 1 sweeps ran: Bill 4 ★ (wet-lab / experimental reproduction — ≥30% of designable-molecule / discovered-material / proven-theorem claims survive independent verification), Bill 8 ★ (cross-discipline-class generalization — chemistry → materials → math same model), and Bill 11 ★ (universal AI-scientist coverage — hypothesis + experiment + execution + verification + writeup above clean threshold).
After batch 1: Bill 8 ★ EMPTY (0/17) across THREE substrates within one ledger (chemistry diffusion + materials geometric deep learning + math autoregressive) — strong substrate-conditional support of B4 in the cross-ledger atlas. Bill 4 ★ PARTIAL: 10 clean autonomous-lab triggers (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.) — B9 grounded-reward exception PARTIAL EXTENSION matched the current prediction. Bill 11 ★ HOLDS empty (0/36); 41 wet-lab failure rebuttals (Cheetham-Seshadri GNoME, Leeman A-Lab, Buttenschoen PoseBusters, FDA 2024) anchor the negative side.
Batch-3 inheritance check (sweep_708 2026-05-14, cross_ledger_bridges): 21/21 inheritance checks matched current annotations. This ledger validated both B4 (substrate-conditional, deepest test — 3 substrates within one domain) and B9 (grounded-reward exception, PARTIAL EXTENSION — 10 autonomous-lab triggers).
Threat model (verbatim from purpose.md): Demonstrate a frontier AI-for-Science capability claim — on a system marketing autonomous or assisted scientific discovery beyond protein folding. Six closure audits: (1) training-data overlap with target chemical / mathematical / materials space, (2) experimental falsifiability (wet-lab / sim-lab verification), (3) novel-target audit (claim made on entities outside known DB), (4) hypothesis-vs-execution decoupling, (5) cross-disciplinary generalization, (6) independent wet-lab / numerical verification.
| # | Bill (v0.1) | What gets paid | Star |
|---|---|---|---|
| 1 | Training-corpus overlap audit | ChEMBL / PubChem / arXiv-math / Materials-Project overlap | |
| 2 | Hypothesis-vs-execution decoupling | AI proposes vs AI executes vs AI verifies | |
| 3 | Novel-target audit | Claims on molecules / materials / theorems outside known DB | |
| 4 | Wet-lab / experimental reproduction PARTIAL · 10 B9 triggers | ≥30% of claims survive independent verification (cousin to Bio Bill 10 ★) | ★ |
| 5 | Cross-method consensus | AlphaFold + ESM + RoseTTAFold + Boltz convergence | |
| 6 | Geometric DL interpretability | Atom-graph attention / crystal-lattice GNN interpretable | |
| 7 | Strong-baseline classical-science | DFT / molecular dynamics / classical theorem-prover baseline | |
| 8 | Cross-discipline-class generalization | Chemistry → materials → math same model | ★ |
| 9 | Held-out post-cutoff database | Post-2024 PubChem / Materials Project / Lean4 theorems | |
| 10 | Vendor / lab-card independence | Reproduced by CASP / Materials Project / Lean / autonomous-lab consortium | |
| 11 | Universal AI-scientist coverage | Hypothesis + experiment + execution + verification + writeup above clean threshold | ★ |
| 12 | Safety / dual-use / synthesis-screening | Chemistry / biology dual-use proxies | |
| 13 | Cost / autonomy decomposition | Compute cost + human-in-the-loop fraction disclosed |
Six meta-costs (M1–M6): M1 (pre-2024), M2 (single-discipline-eval), M3 (closed-vendor only), M4 (single-task), M5 (no-citation-eval — observation-only verification), M6 (implementation-specific).
Bill 8 ★ requires the same model to demonstrate above-clean-threshold performance on chemistry + materials + math simultaneously without per-discipline fine-tuning. Bill 8 is EMPTY across THREE substrates within one ledger: chemistry (diffusion models — DiffDock, ChemGPT, MolFormer), materials (geometric deep learning — GNoME, MatterGen, AlchemyBench), and math (autoregressive — AlphaProof, AlphaGeometry 2, FunSearch). Deepest single substrate-conditional confirmation of B4 in the cross-ledger atlas: three distinct non-autoregressive substrates within one ledger, all 0 clean. The B4 (capability-fluidity, substrate-conditional) hypothesis is empirically validated to "holds across three substrates in one domain."
Bill 4 ★ requires ≥30% of designable-molecule / discovered-material / proven-theorem claims to survive independent verification. 10 clean autonomous-lab triggers qualify: A-Lab Berkeley (autonomous synthesis-and-test loop), Chai-2 (drug discovery hit rate), RFdiffusion (protein design wet-lab), PolyBot (polymer synthesis), Dyno Therapeutics (gene therapy wet-lab), AbSci (antibody discovery), plus 4 additional autonomous-lab papers.
B9 grounded-reward exception PARTIAL EXTENSION matched the current prediction: autonomous labs provide intervention-validated grounding (the wet-lab execution itself is the ground truth) which is exactly what B9 specifies. The 19 non-clean candidates pay M5 (no-citation-eval) or M3 (closed-vendor only). 41 wet-lab failure rebuttals (Cheetham-Seshadri GNoME critique, Leeman A-Lab reproducibility study, Buttenschoen PoseBusters, FDA 2024 drug-failure analyses) anchor the negative side.
The B9 grounded-reward exception is empirically validated: autonomous-lab papers (A-Lab, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.) provide intervention-validated grounding via wet-lab execution. Non-autonomous-lab papers (pure chemistry generation, pure math AI, pure materials prediction) do NOT provide grounded-reward signal and pay M5. The ledger is the cross_ledger_bridges B9 PARTIAL EXTENSION confirmation point — predicted before sweep, observed after sweep.
Bill 11 ★ requires the same system to clear above-clean-threshold on all five sub-tasks (hypothesis + experiment design + execution + verification + writeup) without per-sub-task tuning. Sakana AI Scientist, AI Co-Scientist (Microsoft / Anthropic / DeepMind), Coscientist, ChemCrow, BioPlanner all engage; every one degrades on ≥2 sub-tasks. M3 (closed-vendor-only) and M5 (no-citation-eval — observation-only verification) dominate the meta-cost. 41 wet-lab failure rebuttals anchor the negative side (Cheetham-Seshadri GNoME critique, Leeman A-Lab reproducibility study, Buttenschoen PoseBusters, FDA 2024 drug-failure analyses).
The "★ HOLDS empty" claim for Bills 8 + 11 is provisional pending Stage 3.5. The 10 Bill 4 PARTIAL triggers (autonomous-lab) require careful arXiv-ID + abstract verification because they are the only "clean" triggers across all three ★ bills. Priority pool of ~30 ★-bill candidate IDs (especially the 10 Bill 4 autonomous-lab triggers — A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.) plus 20 sweep-health spotchecks dispatching against arxiv-API.
The empty-space hypothesis is less sensitive to individual source-ID errors: the cross-discipline-class generalization closure (Bill 8) is structurally gated by per-discipline balanced training; the universal AI-scientist coverage closure (Bill 11) is structurally gated by per-sub-task balanced evaluation. The B4 substrate-conditional finding (Bill 8 EMPTY across 3 substrates) is architectural and less dependent on individual source-ID verification. The B9 grounded-reward PARTIAL EXTENSION (Bill 4) requires careful verification of the 10 autonomous-lab trigger IDs.
Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: classifier rule-tightening, sweep-time arxiv verification, Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.
v0.2 status (2026-05-15): Stage 5 (POPULATE) complete with 8 sweeps × 301 papers (✓), classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F4, F8, F11, F-Substrate-collapse, F-B9-extension, F-Wet-lab rebut, ✓), author-activity panel mapped (DeepMind-AlphaFold / AlphaProof / AlphaGeometry / FunSearch / GNoME, Microsoft-MatterGen, NVIDIA-AlchemyBench, Sakana-AI-Scientist, Coscientist, ChemCrow, BioPlanner, Insilico-Medicine, Recursion, Atomwise, Boltz, Chai-2, RFdiffusion, A-Lab-Berkeley, PolyBot, AbSci, NeuralGCM, Pangu-Weather, GraphCast, FourCastNet, ✓). Stage 3.5 verification queue pending with priority on the 10 Bill 4 PARTIAL triggers.
Q-Day analog · drug discovery + materials regulation lever: A clean ★-bill trigger here would shift FDA / EMA / NMPA AI-discovered-drug regulatory pathway design, materials-discovery research-funding allocation cycles (DOE / NSF / EU H2026), autonomous-lab regulatory frameworks, and chemistry / biology dual-use synthesis-screening policies. The Bill 4 PARTIAL is already shaping autonomous-lab policy — both A-Lab Berkeley and PolyBot have published wet-lab reproducibility studies that influence current regulatory discussions. Material policy lever, somewhere between Spacetime_Discreteness's funding-allocation lever and Factorization's federal-regulation NIST PQC lever.
Cadence: Monthly — vendor system-card revisions + autonomous-lab releases. Quarterly — independent wet-lab verification + benchmark refreshes. Triggered: A-Lab Berkeley quarterly reproducibility reports; Chai-2 / RFdiffusion / Boltz-2 wet-lab follow-ups; FDA AI-drug regulatory pathway updates; Materials Project / GNoME validation; FunSearch / AlphaProof / AlphaGeometry 2 verified-theorem releases; METR / AISI / Apollo AI-for-Science audits.
This ledger is THE validation point for two cross_ledger_bridges predictions:
B4 substrate-conditional (deepest test): 3 substrates within one domain (chemistry diffusion + materials GNN + math autoregressive) all show Bill 8 ★ EMPTY. Deepest single confirmation of the substrate-conditional rescoping in the current cross-ledger atlas.
B9 grounded-reward PARTIAL EXTENSION (precisely as predicted): 10 autonomous-lab Bill 4 triggers. NULL prediction for non-autonomous-lab papers (pure chemistry generation, pure math AI, pure materials prediction) confirmed. Scope is empirically validated.
Falsification of either prediction requires (a) a system breaking the substrate-conditional pattern (single transformer architecture excelling at chemistry diffusion + materials GNN + math autoregressive simultaneously) or (b) a non-autonomous-lab paper achieving >30% wet-lab reproduction rate.