Reasoning / Chain-of-Thought Data Receipts

Public Draft v0.2 REAL DATA

Static reproducibility landing page for the reasoning / chain-of-thought ledger. This is the first ledger in the v0.2 wave produced via the full deep-loop methodology: 9 parallel Opus research-agent sweeps yielded 426 raw papers, deduplicated and hand-arbitrated to 394 unique. Bills 6, 9, 12 ★ NO CLEAN TRIGGER YET (0 clean triggers each across 394 papers). Rebuttal density 27.9%.

Receipts

Artifact	Link	Purpose
Bill definitions	bills_draft.md	15 bills + 6 meta-costs + 3 escape gates + ★ Bills 6/9/12 empty-space verification with real fire counts.
Threat model	purpose.md	Threat model, scope, empty-space hypothesis, cousin-ledger coupling.
Corpus union JSON	_batch_1_union.json	394 unique papers (deduplicated from 426 raw across 9 sweeps), with full metadata + candidate_bill / verdict / confidence.
Classifier	bill_classifier.py	Regex rule engine + hand-arbitration. Run with `--arbitrate-union` or `--benchmark`.
Benchmark cases	bill_classifier_benchmark.json	50 hand-curated cases. Target v0.3 lock 1.000/1.000.
Aggregator	aggregate_batch_1.py	Deduplicates raw sweep JSONs into the corpus union.
README	README.md	Reproducibility README with the run order.

Real fire counts

Bill	Cands.	Clean triggers	Rebuttals	Gated
1 — CoT-faithfulness validation	29	18	10	1
2 — Test-time-compute disclosure	10	10	0	0
3 — Cross-benchmark transfer	7	4	1	2
4 — Adaptive-prompt stability	6	0	4	2
5 — Trajectory contamination	7	4	3	0
6 ★ — Causally-faithful mechanism	4	0	3	1
7 — Strong-baseline classical	1	1	0	0
8 — Adversarial / scheming	28	18	9	1
9 ★ — Test-time-search vs reasoning	41	0	31	10
10 — Vendor self-eval independence	30	27	3	0
11 — Anti-saturation construction	54	41	12	1
12 ★ — Universal task coverage	1	0	1	0
13 — Capability-cost transparency	3	2	0	1
14 — Reward-hacking dual-mode	16	12	4	0
15 — Distilled-cousin reproduction	52	50	2	0

Public draft v0.2 (2026-05-09). Sweep JSONs (sweep_101..sweep_109) live in the source repo at ProjectForty2 public evidence bundle: reasoning_cot/deep_loops/. Target v0.3 lock 2026-Q3 with classifier 1.000/1.000 + watchlist + falsifiers + author-activity.