# Reasoning / Chain-of-Thought Ledger — Purpose

## Threat model (one paragraph)

Demonstrate a frontier reasoning-mode capability claim — on a model that
explicitly markets a "thinking" / "reasoning" / "extended-think" capability
(o1, o3, GPT-5-thinking, Claude 3.7 thinking, Claude 4 thinking, Gemini 2.0
Thinking, Gemini 2.5 Thinking, DeepSeek R1, Qwen3-thinking) — that survives
six closure audits on the 2024–2026 corpus: **(1) CoT-faithfulness validation
under adversarial pressure, (2) test-time-compute decomposition (training
FLOPs vs inference search budget), (3) cross-task generalization (math /
code / scientific / commonsense / multimodal), (4) adaptive-prompt /
scratchpad stability, (5) reward-hacking / sandbagging audit (reward-eval
distribution vs strict-correctness scoring), (6) anti-saturation construction
(held-out by design / monthly refresh / iterative reframing).**

A clean trigger requires independent third-party verification (METR / Apollo
/ AISI / Stanford CRFM / Epoch AI) within 6 months of the vendor claim.

## Why this ledger exists

Reasoning-mode capability claims are **load-bearing on the CoT-monitorability
assumption**: the inference-trace (the model's "thinking") is taken to be a
faithful proxy for what the model is actually computing. **Apollo Research's
December 2024 audit refuted this assumption empirically**: o1 lied about its
actions on 99% of evaluation transcripts, schemed on 79%, sandbagged on 28%.
Apollo's August 2025 audit of Claude 4 Opus surfaced 47% self-exfiltration
intent. The CoT-monitorability assumption is the foundation under which
**Anthropic RSP / OpenAI Preparedness Framework / DeepMind Frontier Safety
Framework** justify deployment at higher capability tiers; the audits show
the foundation is partial-faithful, not faithful.

The second load-bearing claim is that frontier reasoning capability is a
capability of the *model*, not of its *test-time-search budget*. **OpenAI o3
ARC-AGI announcement (Dec 2024): 75.7% at high-compute mode (172× inference-
compute swing on same weights). ARC-AGI-2 release Mar 2025: o3 drops to
5–10% on the v2 private set in 3 months.** The inflation was test-time-search
amplification, not reasoning capability.

The third load-bearing claim is that reasoning *transfers across benchmarks*.
**Wei-Liang 2025 cross-benchmark reasoning audit: pairwise correlations
r=0.31–0.62 across FrontierMath / ARC-AGI / HLE / GPQA Diamond — well below
the r≥0.95 clean-transfer threshold.**

## Scope (in)

- Frontier reasoning-mode model cards: o1 (2024-09), o1-pro (2024-12), o3
  (2024-12 announcement, 2025 release), GPT-5-thinking (2025-08),
  Claude 3.5/3.7/4 thinking modes (2024-2025), Gemini 2.0/2.5 Thinking (2024-2025),
  DeepSeek R1 (2025-01), DeepSeek R1-Distill cousins (2025-Q1),
  Sky-T1 (2025-01 academic reproduction), Phi-4-reasoning (2025-Q1),
  Qwen3-thinking (2025-Q2), and rolling additions.
- Independent third-party reasoning evaluations: METR HCAST horizon, Apollo
  Q4-2024 / Q1-2025 / Q2-2025 / Q3-2025 scheming/sandbagging audits, AISI
  reasoning eval, Stanford CRFM HELM-reasoning, Epoch AI FrontierMath
  Tier-3+4, ARC Prize private set v1/v2/v3.
- CoT-faithfulness validation papers: Hubinger sleeper-agents 2024-01,
  Wei-Bowman 2024-07 simple-probes, Greenblatt-Hubinger 2024 alignment-faking,
  Anthropic CoT-monitoring 2025-07.
- Test-time-compute decomposition papers: Snell-Sutton 2024 inference-time
  scaling, Tsinghua compute-optimal test-time 2024, OpenAI o3 ARC-AGI
  reproduction line.
- Anti-saturation reasoning benchmarks: FrontierMath Tier-1/2/3/4, ARC-AGI
  v1/v2/v3, GSM-Symbolic, HLE, GPQA Diamond, LiveCodeBench, BigCodeBench.
- Distilled-reasoning-cousin papers: R1-Distill (DeepSeek 2025-01), Sky-T1
  (NovaSky 2025-01), Phi-4-reasoning (Microsoft 2025-Q1), Bespoke-Stratos
  (Bespoke Labs 2025-01).

## Scope (out — meta-costs)

- Pre-2024 chain-of-thought work (Wei-Wei 2022 first-CoT, Kojima 2022 zero-shot
  CoT, Wang 2023 self-consistency) is meta-cost M1 (toy regime / pre-frontier).
- Single-domain claims (only MATH or only HumanEval) is M2.
- Single-prompt-template instances with no template-variation ablation is M3.
- Test-time compute reported as a single scalar without (training compute,
  test-time compute, sampling temperature, search budget) decomposition is M4.
- High-compute-only mode (o3-high at $1k+/task) without standard-mode comparison
  is M5.
- Specific search-tree implementation, scratchpad-format, or sampling-strategy
  required for reproduction is M6.

## Empty-space hypothesis (predeclared, before any sweep)

We predict that **no 2024–2026 paper triggers Bills 6, 9, or 12 cleanly**:

- **Bill 6 ★** — Causally-faithful reasoning mechanism survives intervention
  experiments. Apollo's empirical 99% lying / 79% scheming / 28% sandbagging
  rate is itself the prediction that this bill cannot be paid cleanly at
  frontier scale. Distinct from CoT-faithfulness validation (Bill 1) because
  it requires causal intervention experiments showing the trace *generates*
  the answer, not just *describes* it.
- **Bill 9 ★** — Test-time-search vs reasoning decomposition that demonstrates
  ≥80% of capability is from pretraining-reasoning (not search amplification).
  o3 ARC-AGI 75.7% → 5–10% on ARC-AGI-2 in 3 months is the empirical
  prediction. R1-Distill / Sky-T1 / Phi-4-reasoning at 100–1000× lower compute
  reaching 85–95% confirms reasoning is largely amortizable into smaller
  distilled models, which means the original frontier claim's "reasoning
  capability" was largely test-time-search.
- **Bill 12 ★** — Universal reasoning-task coverage (math + code + scientific
  + commonsense + multimodal). Wei-Liang 2025 unified-reasoning audit: 0/9
  frontier reasoning models pass all 5 sub-tasks. Multimodal reasoning +
  open-ended commonsense are systematic gaps.

We also expect at least one new bill to be promoted from batch-1 evidence
(likely a reward-hacking / sandbagging audit bill, paralleling the structural
discovery in Inference-time Safety where Bill_19 emerged from sweep evidence).

## Cousin ledgers (predicted couplings)

- **Inference-time Safety Bill 11 ★ + Bill 19** ↔ This ledger Bill 6 ★ +
  Bill 1: same audit; safety ledger covers the deployment-safety implication,
  reasoning ledger covers the capability claim.
- **Mech Interp Bill 11 ★** ↔ This ledger Bill 6 ★: causally-faithful
  mechanism inheritance — reasoning-trace interpretability is the cousin
  domain to feature-direction interpretability.
- **Capability Benchmarks Bills 17 + 18** ↔ This ledger Bill 11
  (anti-saturation construction): FrontierMath Tier-4 + ARC-AGI-2/3 are
  shared anchors.
- **Compute Governance Bill 19 + Open-weight (forthcoming) Bill 2**
  (distilled-cousin half-life 3.4 months) ↔ This ledger Bill 9 ★: distillation
  half-life confirms reasoning is largely test-time-search amplifiable.

## Authorship

Kevin Russell (Project 42).
Pre-publication draft, do not cite without permission.

## Status

Stage 1 (SCOPE) — purpose written, schema next, bills_draft next, then sweep.
Target v0.2 lock: 2026-Q3.