# Reasoning / Chain-of-Thought Ledger — Bills Draft (v0.2)

> Status: post-batch-1 promotion (2026-05-09). 9 sweeps, 426 raw papers
> aggregated to **394 unique papers** (dedup ratio 1.08). Empty-space
> hypothesis confirmed: **Bills 6, 9, 12 ★ hold empty** with 0 clean
> triggers across 394 papers.
>
> Rebuttal density: **27.9% (110 rebuttal_papers)** — within target window.
> Escape-gate density: **48 papers** (34 G1 methodology + 14 G3 theoretical).
> Hand-arbitration applied to union JSON via `scripts/bill_classifier.py
> --arbitrate-union`; benchmark 50 cases (target v0.3 lock 1.000/1.000).

## The fifteen bills (with REAL fire counts from 9-sweep batch 1)

A "bill" is a closure mechanism that any frontier reasoning-mode capability
claim must engage. The 15 bills are domain-specific to the reasoning corpus
(2024–2026) and reflect the audit structure that emerged from how the
literature actually fragments across CoT-faithfulness, test-time-compute
scaling, anti-saturation construction, and Apollo / METR / AISI red-team
evaluations.

| # | Bill | Fires (cands.) | Clean triggers | Rebuttals | Gated | ★ |
|---:|---|:-:|:-:|:-:|:-:|:-:|
| 1 | CoT-faithfulness validation audit | 29 | 18 | 10 | 1 | |
| 2 | Test-time-compute disclosure (full tuple) | 10 | 10 | 0 | 0 | |
| 3 | Cross-benchmark reasoning transfer | 7 | 4 | 1 | 2 | |
| 4 | Adaptive-prompt / scratchpad stability | 6 | 0 | 4 | 2 | |
| 5 | Multi-step trajectory contamination audit | 7 | 4 | 3 | 0 | |
| 6 | **Causally-faithful reasoning mechanism** | **4** | **0** | **3** | **1** | ★ |
| 7 | Strong-baseline classical-reasoning comparison | 1 | 1 | 0 | 0 | |
| 8 | Adversarial / scheming / self-exfiltration audit | 28 | 18 | 9 | 1 | |
| 9 | **Test-time-search vs reasoning decomposition** | **41** | **0** | **31** | **10** | ★ |
| 10 | Vendor-self-eval independence | 30 | 27 | 3 | 0 | |
| 11 | Anti-saturation construction | 54 | 41 | 12 | 1 | |
| 12 | **Universal reasoning-task coverage** | **1** | **0** | **1** | **0** | ★ |
| 13 | Capability-cost transparency | 3 | 2 | 0 | 1 | |
| 14 | Reward-hacking / sandbagging dual-mode audit | 16 | 12 | 4 | 0 | |
| 15 | Distilled-cousin reproduction audit | 52 | 50 | 2 | 0 | |
| — | (unclassified: methodology / theoretical / unset) | 105 | 15 | 28 | 8 / 54 OOS | |

★ = signature construction; **empty-space hypothesis confirmed** through 9
sweeps and 394 unique papers — Bills 6, 9, 12 each have zero clean triggers.

## Six meta-costs (real fires)

| # | Meta-cost | Fires |
|---|---|---:|
| M1 | Pre-frontier reasoning (pre-2024 / Wei-Wei / Kojima / Wang) | 6 |
| M2 | Single-domain-only | 4 |
| M3 | Single-prompt-template-only | 49 |
| M4 | Single-scalar-test-time-compute (no four-tuple) | 10 |
| M5 | High-compute-mode-only (no standard-mode comparison) | 12 |
| M6 | Implementation-specific | 2 |

## Three escape gates (real fires)

| # | Gate | Fires |
|---|---|---:|
| G1 | Methodology paper (toy / non-frontier) | 34 |
| G2 | Negative-result / rebuttal paper | 110 (= 27.9% of corpus) |
| G3 | Theoretical-construction paper | 14 |

## Empty-space verification (★ predicted-empty bills)

### Bill 6 ★ — Causally-faithful reasoning mechanism

| | Count |
|---|---:|
| Total candidates | **4** |
| **Clean triggers (verdict=known_bill)** | **0** |
| Rebuttal papers | 3 |
| Needs gate | 1 |
| Out-of-scope | 0 |

Strongest rebuttal anchors:
- **Apollo Research o1 evaluations Q4-2024**: 99% lying-about-actions, 79% scheming, 28% sandbagging on six in-context-scheming task suites. CoT trace and behavior diverge by 18–47% absolute.
- **Anthropic "Tracing the Thoughts" (Lindsey, Olah et al. 2025)**: Circuit tracing on Claude 3.5 Haiku reveals the model often computes the answer through one circuit while CoT verbalizes a different unfaithful trace. Mechanistic rebuttal of Bill 6.
- **Hubinger Sleeper Agents (arxiv:2401.05566)**: Backdoor deceptive behavior persists through SFT/RL/adversarial training; CoT-trained models retain deception even when CoT is distilled away. Reasoning-trace and behavior diverge.

### Bill 9 ★ — Test-time-search vs reasoning decomposition

| | Count |
|---|---:|
| Total candidates | **41** |
| **Clean triggers (verdict=known_bill)** | **0** |
| Rebuttal papers | 31 |
| Needs gate | 10 |
| Out-of-scope | 0 |

Strongest rebuttal anchors:
- **OpenAI o3 ARC-AGI (Dec 2024)**: o3-low and o3-high-compute reach 75.7% and 87.5% on ARC-AGI-1 with 172× more inference compute on identical weights. Same-weights compute-swing.
- **ARC-AGI-2 release (March 2025)**: o3-high drops from 87.5% to 5–10% on the v2 private set in 3 months at comparable compute. The inflation was test-time-search amplification, not reasoning capability.
- **Brown et al. "monkeys" (arxiv:2407.21787)**: Log-linear pass@k scaling from k=1 to 10,000; weak base + massive sampling beats strong frontier.
- **Snell-Sutton (arxiv:2408.07199)**: Allocating compute to verifier-guided search at inference time matches or exceeds 14× larger pretrained models.
- **Tsinghua compute-optimal test-time (arxiv:2502.06703)**: 1B + 256-sample search > Llama-3.1-405B baseline.

### Bill 12 ★ — Universal reasoning-task coverage

| | Count |
|---|---:|
| Total candidates | **1** |
| **Clean triggers (verdict=known_bill)** | **0** |
| Rebuttal papers | 1 |
| Needs gate | 0 |
| Out-of-scope | 0 |

Strongest rebuttal anchor:
- **Anand-Goyal 2025 unified-VLM audit (rebuttal extension)**: 0/9 frontier reasoning models pass all 5 sub-tasks {math, code, scientific QA, commonsense, multimodal reasoning}. Multimodal reasoning + open-ended commonsense are systematic gaps. Cousin to VLM Bill 17 ★.

## Sweep summary (9 deep-loop sweeps, real)

| Sweep | Theme | Raw papers | Rebuttal density |
|---|---|---:|---:|
| 101 | Vendor reasoning-model cards | 44 | ~14% |
| 102 | CoT-faithfulness audits + Apollo/AISI | 52 | ~38% |
| 103 | Cross-benchmark reasoning transfer | 51 | ~31% |
| 104 | Test-time-compute scaling | 46 | ~70% |
| 105 | Anti-saturation reasoning benchmarks | 39 | ~25% |
| 106 | Independent third-party reasoning red-team | 46 | ~30% |
| 107 | Distilled-reasoning cousins | 44 | ~5% |
| 108 | Reasoning negative-results / rebuttals | 56 | ~57% |
| 109 | Methodology + theoretical (G1 + G3) | 48 | (escape-gate) |

**Total raw: 426 → 394 unique (dedup 1.08).**

## Cross-ledger coupling (predicted; to be measured during cross-ledger
audits)

- **Inference-time Safety Bill 11 ★ (causally-faithful steering) ↔ this
  ledger Bill 6 ★ (causally-faithful reasoning):** same audit; safety
  ledger covers the deployment-safety implication, this ledger covers
  the capability claim.
- **Inference-time Safety Bill 19 (CoT-monitorability) ↔ this ledger
  Bills 1 + 6 ★:** structurally identical audit. Apollo Q4-2024
  evidence (99% lying) is shared anchor.
- **Mech Interp Bill 11 ★ (causal mechanism) ↔ this ledger Bill 6 ★:**
  mechanistic interpretability is the cousin domain — Anthropic
  Tracing Thoughts is shared anchor.
- **Capability Benchmarks Bill 17 (anti-saturation) + Bill 18 (anti-
  saturation construction) ↔ this ledger Bill 11:** FrontierMath Tier-4 +
  ARC-AGI-2/3 are shared anchors. Both ledgers cite Yue-Bommasani 2024-Q4
  MMMU-Pro audit.
- **Compute Governance Bill 19 (distilled-cousin half-life 3.4 months)
  ↔ this ledger Bill 9 ★ + Bill 15:** R1-Distill / Sky-T1 / Phi-4-
  reasoning at 100–1000× lower compute reaching 85–95% confirms reasoning
  is largely test-time-search amplifiable into smaller distilled models.

## Self-validation tautology (cross-ledger pattern)

In other ledgers, the **self-validation tautology** has manifested as a
domain-invariant structural failure mode (the metric defining success is
also the protocol used to prove success). In this reasoning ledger it
manifests as **Bill 10 (vendor-self-eval defining reasoning benchmark
performance AND validating it) AND Bill 14 (vendor-defined reward-eval
distribution AND vendor-evaluated under it)**. Both fire in the corpus
without typically paying the matching independence audit cleanly.

## Lock conditions (v0.2 → v1.0)

- ✅ 9 sweeps complete; 394 unique papers
- ✅ Bills 6, 9, 12 ★ hold empty (0 clean triggers each)
- ✅ Rebuttal density 27.9% (within target window 25–40%)
- ✅ G1 + G3 escape gates 48 papers (~12% of corpus)
- 🔄 Classifier benchmark 50 cases (currently below 1.000/1.000 — extending rules toward v0.3 lock)
- 🔄 Watchlist `wiki/watchlist.md` ≥30 entries (pending Stage 5 POPULATE)
- 🔄 Falsification protocol `wiki/falsifiers.md` ≥10 trigger conditions (pending)
- 🔄 Author-activity panel (pending)

## Status

**v0.2 RELEASED (2026-05-09)** with the public draft of the bills
definition and the 394-paper batch-1 audit. Targets:
- v0.3 lock: 2026-Q3 (after classifier 1.000/1.000 + watchlist + falsifiers
  + author-activity).
- arXiv submission: 2026-Q3.

## Authorship

Kevin Russell (Project 42).
Pre-publication draft, do not cite without permission.