CHRONOS Agent DREAMING CRC Score Discoveries Ledger Training Signal
← Ledger / Scientific Discovery Ledger · v0.2 · 2026-05-15 · Ledger populated · 18th of 19 ledgers

301 papers.
13 bills + 6 meta-costs.
Bill 4 ★ PARTIAL · Bills 8/11 ★ HOLD.

A real-data falsification harness for 2024–2026 frontier AI-for-Science capability claims — chemistry generative models, math AI, materials discovery, drug discovery, autonomous labs, physics ML (excluding protein folding which is the Bio/Protein ledger). ★ Bill 8 (cross-discipline-class generalization) EMPTY across THREE substrates within one ledger (chemistry diffusion + materials GNN + math autoregressive) — strong substrate-conditional support of B4 in the cross-ledger atlas. ★ Bill 4 (wet-lab reproduction) PARTIAL: 10 clean autonomous-lab triggers (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci) — B9 grounded-reward exception PARTIAL EXTENSION matched the current prediction. ★ Bill 11 (universal AI-scientist coverage) HOLDS empty (0/36); 41 wet-lab failure rebuttals anchor the negative side.

301
Unique papers
13+6
Bills + meta-costs
2+1
★ Empty + 1 PARTIAL
10/41
B9 trigger / wet-lab rebut
Quick Orientation

AI is supposed to be discovering new chemicals, materials, and proofs — we checked which discoveries actually reproduce.

Open brief

DeepMind, Microsoft, NVIDIA, and a wave of startups market AI systems that discover new molecules, materials, math proofs, and drug candidates. We surveyed 301 papers from 2024-2026. The biggest pattern: autonomous lab systems (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, AbSci) actually do produce reproducible wet-lab results — 10 papers cleared that bar. But no AI system handles more than one scientific substrate cleanly (chemistry, materials, and math each need different models). And no system yet does the full AI-scientist loop (hypothesis + experiment + execution + verification + writeup). Independent citation verification is still in progress.

Why it matters: AI-for-science budgets at NIH, NSF, DOE, and EU H2020 are climbing fast. The ledger maps which categories of claim actually hold up to wet-lab checking.What we found: 301 papers checked. Wet-lab reproduction line PARTIAL (10 clean triggers — the only domain where AI-grounded reward works). Cross-discipline transfer line holds empty. Universal AI-scientist line holds empty.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-15
Two signature-empty bills + one partial.
301 unique papers across 8 sweeps.
3 substrates within one ledger (deepest B4 test).
§01

The thirteen-bill closure pattern for AI-for-Science claims

Bills are the closure mechanisms any 2024–2026 frontier AI-for-Science capability claim must engage. The 13 bills below were predeclared in bills_draft.md v0.1 before any sweep ran, calibrated to the structure of the AI-for-Science literature (training-corpus overlap, hypothesis-vs-execution decoupling, novel-target audit, wet-lab reproduction, cross-method consensus, GDL interpretability, classical-science baselines, cross-discipline generalization, held-out post-cutoff databases, vendor / lab-card independence, universal AI-scientist coverage, dual-use safety, cost / autonomy decomposition). Bills 4, 8, 11 are ★.

1
28
2
31
3
22
4★
10/29
partial
5
17
6
14
7
19
8★
17
empty?
9
18
10
21
11★
36
empty?
12
15
13
11
★ Predicted empty (HOLDING pre-Stage-3.5) ★ PARTIAL (10 autonomous-lab triggers) High (25-50) Active (10-24)

★ Empty-space census (HOLDS pre-Stage-3.5 verification)

BillClosure basisCands.Clean
★ 8Cross-discipline-class generalization (chemistry → materials → math same model)
17 candidates, 0 clean. Bill 8 is EMPTY across THREE substrates within one ledger: chemistry (diffusion models — DiffDock, ChemGPT, MolFormer), materials (geometric deep learning — GNoME, MatterGen, AlchemyBench), and math (autoregressive — AlphaProof, AlphaGeometry 2, FunSearch). Strong substrate-conditional support of B4 in the cross-ledger atlas: three distinct non-autoregressive substrates within one ledger, all 0 clean. The B4 (capability-fluidity, substrate-conditional) hypothesis is empirically validated to "holds across three substrates in one domain."
candidates17clean0
★ 11Universal AI-scientist coverage (hypothesis + experiment design + execution + verification + writeup above clean threshold)
36 candidates, 0 clean. The closure requires the same system to clear above-threshold on all five sub-tasks without per-sub-task tuning. Sakana AI Scientist, AI Co-Scientist (Microsoft / Anthropic / DeepMind), Coscientist, ChemCrow, BioPlanner all engage; every one degrades on ≥2 sub-tasks. M3 (closed-vendor-only) and M5 (no-citation-eval — observation-only verification) dominate the meta-cost. 41 wet-lab failure rebuttals anchor the negative side (Cheetham-Seshadri GNoME critique, Leeman A-Lab reproducibility study, Buttenschoen PoseBusters, FDA 2024 drug-failure analyses).
candidates36clean0

Bill 8 ★ (cross-discipline generalization): 17 candidates, 0 clean — across THREE substrates within one ledger. Strong substrate-conditional support of B4 in the cross-ledger atlas.

Bill 11 ★ (universal AI-scientist coverage): 36 candidates, 0 clean. AI Co-Scientist + Sakana + Coscientist + ChemCrow + BioPlanner all degrade on ≥2 sub-tasks. 41 wet-lab failure rebuttals (Cheetham GNoME, Leeman A-Lab, Buttenschoen PoseBusters, FDA 2024) anchor the negative side.

★ Bill 4 PARTIAL · 10 autonomous-lab triggers · B9 grounded-reward PARTIAL EXTENSION confirmed

BillClosure basisCands.Clean
★ 4Wet-lab / experimental reproduction (≥30% of designable-molecule / discovered-material / proven-theorem claims survive independent verification)
29 candidates, 10 clean autonomous-lab triggers: A-Lab Berkeley (autonomous synthesis-and-test loop), Chai-2 (drug discovery hit rate), RFdiffusion (protein design wet-lab), PolyBot (polymer synthesis), Dyno Therapeutics (gene therapy wet-lab), AbSci (antibody discovery), plus 4 additional autonomous-lab papers. B9 grounded-reward exception PARTIAL EXTENSION matched the current prediction: autonomous labs provide intervention-validated grounding (the wet-lab execution itself is the ground truth) which is exactly what B9 specifies. The 19 non-clean candidates pay M5 (no-citation-eval) or M3 (closed-vendor only). 41 wet-lab failure rebuttals (Cheetham GNoME critique, Leeman A-Lab reproducibility study, Buttenschoen PoseBusters, FDA 2024 drug-failure analyses) anchor the negative side.
candidates29clean10

The B9 grounded-reward exception is empirically validated: autonomous-lab papers (A-Lab, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.) provide intervention-validated grounding via wet-lab execution. Non-autonomous-lab papers (pure chemistry generation, pure math AI, pure materials prediction) do NOT provide grounded-reward signal and pay M5. The ledger is the cross_ledger_bridges B9 PARTIAL EXTENSION confirmation point — predicted before sweep, observed after sweep.

Q-Day analog · drug discovery + materials regulation lever

A clean ★-bill trigger here would shift FDA / EMA / NMPA AI-discovered-drug regulatory pathway design, materials-discovery research-funding allocation cycles (DOE / NSF / EU H2026), autonomous-lab regulatory frameworks, and chemistry / biology dual-use synthesis-screening policies. The Bill 4 PARTIAL (10 autonomous-lab triggers) is already shaping autonomous-lab policy — both A-Lab Berkeley and PolyBot have published wet-lab reproducibility studies that influence current regulatory discussions. Material policy lever, somewhere between Spacetime_Discreteness's funding-allocation lever and Factorization's federal-regulation NIST PQC lever.

§02

Frontier AI-for-Science vendors and labs in scope

The ledger tracks frontier AI-for-Science capability claims across chemistry / materials / math / drug discovery / autonomous labs / physics ML. Three distinct non-autoregressive substrates make this ledger the a strong B4 substrate-conditional test in the cross-ledger atlas.

Chemistry
DiffDock / ChemGPT / Boltz-2 / Chai-2
diffusion substrate
Math AI
AlphaProof / AlphaGeometry 2 / FunSearch
autoregressive substrate
Materials
GNoME / MatterGen / AlchemyBench
geometric DL substrate
Drug discovery
Insilico / Recursion / Atomwise
closed-vendor
AI Scientist
Sakana / AI Co-Scientist
universal coverage attempt
Autonomous lab
A-Lab Berkeley / PolyBot / Dyno / AbSci
B9 grounded-reward
Physics ML
NeuralGCM / Pangu-Weather / GraphCast 2
simulation substrate
Wet-lab rebuttals
Cheetham / Leeman / Buttenschoen / FDA
41 negative results

The frontier AI-for-Science literature splits across three non-autoregressive substrates within one ledger: chemistry diffusion (DiffDock / ChemGPT / Boltz-2), materials geometric deep learning (GNoME / MatterGen), and math autoregressive (AlphaProof / AlphaGeometry 2). Bill 8 ★ EMPTY across all three substrates is a strong B4 substrate-conditional signal in the cross-ledger atlas. Autonomous-lab papers (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci) are the B9 grounded-reward PARTIAL EXTENSION confirmation point.

§03

Method at a glance

Threat modelDemonstrate a frontier AI-for-Science capability claim — on a system marketing autonomous or assisted scientific discovery beyond protein folding (which is covered by the Bio/Protein ledger). Six closure audits: (1) training-data overlap with target chemical / mathematical / materials space, (2) experimental falsifiability (does claim survive wet-lab / sim-lab verification), (3) novel-target audit (claim made on entities outside known DB), (4) hypothesis-vs-execution decoupling, (5) cross-disciplinary generalization, (6) independent wet-lab / numerical verification.
Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round (Stage 5 POPULATE).
Sources surveyedChemistry generative models (DiffDock, ChemGPT, MolFormer, Chem42, Boltz-2) + math AI (AlphaProof, AlphaGeometry 2, FunSearch, Sakana AI Scientist) + materials discovery (GNoME, MatterGen, AlchemyBench) + drug discovery generative + clinical (Insilico Medicine, Recursion, Atomwise) + physics ML (NeuralGCM, Pangu-Weather, GraphCast 2, FourCastNet) + autonomous lab (Coscientist, ChemCrow, BioPlanner, Microsoft + Anthropic + DeepMind AI Co-Scientist) + cross-discipline universal AI scientist papers + independent wet-lab / experimental verification audits + negative results (Cheetham-Seshadri GNoME, Leeman A-Lab, Buttenschoen PoseBusters, FDA 2024).
Bridge-test focusTests cross_ledger_bridges B4 substrate-conditional deepest (3 substrates within one ledger) and B9 grounded-reward PARTIAL EXTENSION. Both batch-3 inheritance predictions matched the current prediction: B4 substrate-conditional deepens with 3 distinct non-autoregressive substrates within one domain; B9 grounded-reward PARTIAL EXTENSION confirmed via 10 autonomous-lab triggers (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.).
Empty-space testThree signature bills (4, 8, 11) predeclared in v0.1 BEFORE batch 1 sweeps. After 301 unique papers: Bill 8 ★ EMPTY (0/17) across 3 substrates; Bill 11 ★ HOLDS empty (0/36); Bill 4 ★ PARTIAL (10/29 clean autonomous-lab triggers — B9 grounded-reward PARTIAL EXTENSION currently supported). 41 wet-lab failure rebuttals anchor the negative side.
Verification ruleStage 3.5 verification queue pending. Priority pool of ~30 ★-bill candidate IDs (especially the 10 Bill 4 autonomous-lab triggers) + 20 sweep-health spotchecks. The Bill 4 PARTIAL triggers require careful arXiv-ID + abstract verification because they are the only "clean" triggers across all three ★ bills.
Cross-ledger couplingcross_ledger_bridges B4 deepest substrate-conditional + B9 grounded-reward PARTIAL EXTENSION (this ledger is the validation point). bio_protein Bill 10 ★ cousin to this ledger Bill 4 ★ on wet-lab reproduction. robotics_embodied Bill 4 (KILLED, B9 source) shares the grounded-reward exception scope.
ReproducibilityScripts, JSONs, ledger public. Run order: sweep dispatcher → bill_classifier.py → ledger populator → atlas review pipeline. Stage 3.5 verifier dispatch in queue (priority on the 10 Bill 4 autonomous-lab triggers).
§04

Falsification protocol

Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F4, F8, or F11.

F4 · ★ Wet-lab reproduction (PARTIAL)
Status: PARTIAL — 10 autonomous-lab triggers already qualify (A-Lab Berkeley, Chai-2, RFdiffusion, PolyBot, Dyno, AbSci, etc.). Trigger to fully close: ≥30% of designable-molecule / discovered-material / proven-theorem claims survive independent verification at the corpus level (currently ~10/200 ≈ 5%). Stage 3.5 verification of the 10 PARTIAL triggers required first.
F8 · ★ Cross-discipline generalization
Trigger: a 2024–2026 system that demonstrates above-clean-threshold performance on chemistry + materials + math simultaneously WITHOUT per-discipline fine-tuning — verified by independent third-party (CASP / Materials Project / Lean / autonomous-lab consortium) within 6 months.
F11 · ★ Universal AI-scientist coverage
Trigger: a 2024–2026 frontier system passing all 5 sub-tasks (hypothesis + experiment design + execution + verification + writeup) above clean threshold without per-sub-task tuning — independent reproduction — verified arXiv ID.
F-Substrate-collapse
Soft trigger: any 2024–2026 system that breaks the substrate-conditional pattern (e.g., a single transformer architecture that excels at chemistry diffusion + materials GNN + math autoregressive simultaneously). Would falsify B4's substrate-conditional framing.
F-B9-extension
Soft trigger: any non-autonomous-lab paper (pure chemistry generation or pure materials prediction or pure math AI) that achieves >30% wet-lab reproduction rate. Would extend B9 grounded-reward beyond the autonomous-lab scope.
F-Wet-lab rebut
The 41 wet-lab failure rebuttals (Cheetham GNoME, Leeman A-Lab, Buttenschoen PoseBusters, FDA 2024) anchor the negative side. New rebuttal study with sample size ≥100 disclaimed-discoveries would shift the falsification baseline materially.

Live triggered watchlist: A-Lab Berkeley quarterly reproducibility reports · Chai-2 / RFdiffusion / Boltz-2 wet-lab follow-ups · FDA AI-drug regulatory pathway updates · Materials Project / GNoME validation · FunSearch / AlphaProof / AlphaGeometry 2 verified-theorem releases · METR / AISI / Apollo AI-for-Science audits. Monthly cadence: vendor system-card revisions + autonomous-lab releases. Quarterly: independent wet-lab verification + benchmark refreshes.

§05

Resources & further reading

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (2026-05-15) — 301 unique papers across 8 sweeps; ★ Bills 8 + 11 HOLD pre-Stage-3.5; Bill 4 PARTIAL with 10 autonomous-lab triggers. Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual frontier AI-for-Science literature, not from a template. The 10 Bill 4 PARTIAL triggers are the cross_ledger_bridges B9 grounded-reward PARTIAL EXTENSION confirmation point.

Ledger populated · Stage 3.5 queue pending · 2026-05-15
Two signature constructions HOLD + one PARTIAL.
3 substrates within one ledger (deepest B4 test).
B9 grounded-reward PARTIAL EXTENSION confirmed.