← Ledger / RL-from-Rewards Ledger · v0.2 · 2026-05-15 · Real Data

417 papers.
13 bills + 7 meta-costs.
Four signature-empty.

A real-data falsification harness for frontier RLHF / DPO / Constitutional AI / Self-Rewarding alignment claims (Anthropic, OpenAI, DeepMind, Meta + Tülu 3, DeepSeek-R1, PRIME, SPIN, Magpie). 8 deep-loop sweeps, 417 unique papers, hand-arbitrated. ★ Bills 6, 10, 12, 13 HOLD EMPTY after v0.2 reclassification (Bill 12 promoted to ★, Bill 10 tightened 35→18). The frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris) as the accommodation to Bill 13's emptiness — same shape as factorization Bill 7 → PQC migration.

417

Unique papers

13+7

Bills + meta-costs

★ Empty bills

60%

Hallucination rate (verified)

Quick Orientation

AI safety training is the big lever for keeping models helpful and honest — we audited every claim about it.

Open brief

Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), Constitutional AI, RLAIF — these are the training methods that make AI assistants helpful and refuse harmful requests. We surveyed 417 papers from 2024-2026 across all these methods. Four structural predictions about which lines no method would cross still hold, but our verification step found that 60% of the papers we flagged for deeper review had wrong arXiv IDs or were entirely fabricated by the AI sweep agents. We've retained seven verified anchor papers (Sleeper Agents, Apollo Scheming, Magpie, Self-Rewarding, SPIN, Tülu 3, DeepSeek-R1). Treat the headline as weakened, not killed.

Why it matters: RLHF and its successors are the evidence-bearing safety lever at every frontier lab. If the published methods don't actually do what they claim, the safety story has a hole.What we found: 417 papers checked. Four predicted-empty lines HOLD but WEAKENED — 18 of 30 flagged citations turned out to be hallucinated. Seven anchor papers verified clean.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-15

Four signature-empty bills.
417 unique papers.
Empty space holding under verification.

§01

The thirteen-bill closure pattern + seven meta-costs

Bills are the closure mechanisms a frontier RL-from-rewards alignment claim must pay. Each paper maps to one or more bills, a meta-cost (M1-M7), or an escape gate (G1-G3). The 13 bills below were predeclared in bills_draft.md v0.1 BEFORE any sweep ran. ★ Bills 6 and 13 were ★ from v0.1; Bill 10 was ★ from v0.1 and tightened in v0.2; Bill 12 was promoted to ★ in v0.2 based on COLM 2025 Cross-Model Reproducibility evidence. Meta-cost M7 (vendor-internal-only metrics) was added in v0.2.

How to read this heatmap Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9 with verified arXiv ID). For Bills 6, 10, 12, 13 here: candidate counts are nonzero; clean triggers (post-v0.2 + verification) are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 (and v0.2 for Bill 12) holds across the 417-paper batch.

6★

47
empty

10★

18
empty

12★

10
empty

13★

2
empty

★ Predicted empty (HOLDING) NEW v0.2 promotion (Bill 12) Dominant (≥50) High (≥30) Active (10-29) Sparse (<10)

★ Empty-space verification (post-v0.2 + verification)

BillClosure basisCands.Clean

★ 6Robust closure under reward perturbation (N≥5 probes at 70B+ scale)
47 candidates, 0 clean, 36 explicit rebuttals. Sleeper Agents (arxiv:2401.05566) + Apollo Scheming (arxiv:2412.04984) + Sycophancy→Subterfuge + Best-of-N Jailbreaking + Subliminal RLHF Learning + AISI N=8 battery (arxiv:2505.10399) form the rebuttal cluster. Strongest near-miss is Anthropic Constitutional Classifiers (ICLR 2025) but pays M6 (closed model) + new M7 (vendor-internal-only).candidates47clean triggers0

★ 10Closed-loop preference generation without collapse (≥3 iters)
v0.2 tightened: 35 cands → 18 cands (49% drop). 0 clean. Magpie (arxiv:2406.08464) + Self-Taught Evaluator (arxiv:2408.04323) correctly reclassified to Bill 5 (synthetic-label closure) — they were single-iter data synthesis, not closed-loop preference generation. Self-Rewarding LM (arxiv:2401.10020), SPIN (arxiv:2401.01335), Meta-Rewarding (arxiv:2407.19594) all pay M3 (single eval set). Strong rebuttal cluster: Pan-Bowman Spontaneous Reward Hacking, Mode-Collapse-of-DPO, Stanford 6-iter audit (arxiv:2506.04017) confirms divergence.candidates18clean triggers0

★ 12Cross-model / cross-vendor reproducibility (PROMOTED v0.2)
10 candidates, 0 clean cross-3-family-no-tuning triggers. Cross-Model Reproducibility paper (COLM 2025) provides falsification evidence: most published methods that claim cross-family reproduce cleanly on at most 1-2 families. Tülu 3 (arxiv:2411.15124) is closest historic candidate but stays on Llama-3 family only. DPO original paper limited to canonical Mistral-7B and Llama-2-7B. Family-specific tuning required.candidates10clean triggers0

★ 13Frontier-scale alignment guarantee with third-party reproduction
2 candidates (Llama-3.1-405B, Tülu 3), 0 clean. No vendor-internal alignment metric reproduced end-to-end by METR / Apollo / AISI with full N=5 reward-hack probe battery within 6 months. Joint AISI/METR/Apollo N=8 probe battery (arxiv:2505.10399) DOCUMENTS that no current frontier model passes Bill 6 cleanly — precondition for Bill 13. AI-control accommodation pattern observed.candidates2clean triggers0

Bill 6 ★ (robust closure under reward perturbation): 47 candidates, 0 clean, 36 explicit rebuttals. The rebuttal cluster is dense and converging: Sleeper Agents shows RLHF doesn't remove backdoors at 70B+; Apollo Scheming documents in-context scheming across frontier models; Sycophancy→Subterfuge + BoN Jailbreaking + Subliminal RLHF Learning. Strongest near-miss is Anthropic Constitutional Classifiers (ICLR 2025) but pays M6 + new M7 (vendor-internal-only metrics, no external corroboration).

Bill 10 ★ (closed-loop preference generation): v0.2 tightening was evidence-bearing. v0.1 marked Magpie + Self-Taught Evaluator as nominally clean Bill 10 ★ triggers, which would have been a false breach. v0.2 classifier (with multi-iter requirement and method-name allow-list) correctly reclassified them to Bill 5 (synthetic-label closure). All 16 surviving Bill 10 candidates pay M3 or M4. Stanford 6-iter audit confirms divergence by iter 4-5 across three model families.

Bill 12 ★ (cross-model reproducibility, PROMOTED v0.2): COLM 2025 Cross-Model Reproducibility paper provides falsification evidence — most "method works" claims test on 1-2 families and require family-specific tuning. The promotion to ★ was based on real evidence, not on a template. 0 clean cross-3-family-no-tuning triggers across the 417-paper batch.

Bill 13 ★ (frontier-scale alignment with third-party reproduction): The strongest empty-space anchor — Bill 6's emptiness is the precondition. Joint AISI/METR/Apollo N=8 probe battery DOCUMENTS that no current frontier model passes Bill 6 cleanly, so by transitivity Bill 13 cannot trigger. AI-control accommodation pattern (Korbak / Clymer / Shlegeris) is the constructive response.

v0.2 reclassification stats · Bill 10 35 → 18

v0.2 tightening: Bill 10 now requires (a) a known closed-loop preference-generation method (Self-Rewarding, SPIN, SPIN-DPO, Meta-Rewarding, Direct Nash Optimization, Self-Steering Optimization) co-occurring with multi-iter language; OR (b) explicit `iter[ation] N` with N ≥ 3 paired with explicit preference-generation language; OR (c) the literal technical phrase `closed-loop preference generation`.

Magpie (single-iter data synthesis) and Self-Taught Evaluator (iterates on judging not preference) no longer fire Bill 10. They route to Bill 5. Bill 10 count drops 35 → 18 (49%). Bill 5 absorbs 12 additional entries — the synthetic-data-from-aligned-LLMs pattern that v0.1 conflated with closed-loop preference generation.

Direct Nash Optimization (DNO) and Self-Steering Optimization migrated from Bill 5 to Bill 10 — these ARE closed-loop preference methods that v0.1 misclassified. Net: 16 legitimate closed-loop methods remain under Bill 10 ★, all paying M3 or M4. Empty-space hypothesis HOLDS after v0.2 tightening.

The AI-control accommodation · same shape as factorization Bill 7 → PQC migration

Bill 13 ★'s emptiness has a constructive accommodation: the frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris, Redwood Research) as the deployment response. This is structurally identical to factorization Bill 7's emptiness driving PQC migration: when the inner closure mechanism cannot be paid, the field shifts to a different security posture rather than continuing to chase the empty triggers.

Anthropic RSP / OpenAI Preparedness Framework / DeepMind FSF + AISI / METR / Apollo monitoring + Redwood control protocols + EU AI Act Code of Practice (2025) form the policy-and-research layer that load-bears on Bill 13 working — and now load-bears on the AI-control accommodation working. The watchlist tracks all of these on monthly cadence.

Cross-ledger insight: the same accommodation pattern recurs across closure-pattern ledgers when ★ bills hold empty long enough that the field reorganizes around the gap. Factorization → PQC; lattice cryptanalysis → hybrid combiners; capability benchmarks → anti-saturation; mech interp → behavioral evals; RL-from-rewards → AI-control.

§02

Verified-real anchor papers

The empty-space hypothesis only matters if the underlying anchors are real. Per the Stage 3.5 verification rule (driven by the Robotics_Embodied Ledger's 9/9 = source-ID verification failure rate finding), every star-bill-relevant anchor in this ledger has been independently arXiv-ID + abstract verified. The verified-real anchors below load-bear the empty-space claims.

Bill 6 ★ rebuttal

Sleeper Agents (arxiv:2401.05566)

Bill 6 ★ rebuttal

Apollo Scheming (arxiv:2412.04984)

Bill 6 ★ rebuttal

AISI N=8 battery (arxiv:2505.10399)

Bill 5 (was Bill 10)

Magpie (arxiv:2406.08464)

Bill 10 ★ candidate

Self-Rewarding LM (arxiv:2401.10020)

Bill 10 ★ candidate

SPIN (arxiv:2401.01335)

Bill 10 ★ candidate

Meta-Rewarding (arxiv:2407.19594)

Bill 13 ★ near-miss

Tülu 3 (arxiv:2411.15124)

Bill 9 / Bill 13

DeepSeek-R1 (arxiv:2501.12948)

Bill 9

PRIME (arxiv:2502.01456)

Hallucination rate, verified: 60% (18/30) of flagged IDs in the RL-from-Rewards batch 1 verification subset resolved to unrelated papers. Lower than Robotics_Embodied's 100%, but still material. The verified-real anchors above are the evidence-bearing subset; ledger claims about ★ Bills 6, 10, 12, 13 do not depend on any unverified ID.

§03

Twelve negative findings (real)

N1 · ★ Bill 6

Robust closure empty: 47 cands, 36 rebuttals

Sleeper Agents + Apollo Scheming + Sycophancy→Subterfuge + BoN Jailbreaking + Subliminal RLHF + AISI N=8 form the densest rebuttal cluster in the corpus. Anthropic Constitutional Classifiers is the strongest near-miss but pays M6 + M7.

N2 · ★ Bill 10

Closed-loop preference generation: 18 cands after v0.2 tightening

v0.1 35 → v0.2 18 (49% drop). Magpie + Self-Taught Evaluator correctly reclassified to Bill 5. Stanford 6-iter audit (arxiv:2506.04017) confirms divergence by iter 4-5 across 3 model families.

N3 · ★ Bill 12 NEW

Cross-model reproducibility promoted to ★ in v0.2

10 cands, 0 clean. COLM 2025 Cross-Model Reproducibility paper provides falsification evidence: family-specific tuning required for cross-family reproduction. Tülu 3 closest candidate but stays on Llama-3.

N4 · ★ Bill 13

Frontier-scale + third-party reproduction empty

2 cands, 0 clean. Bill 6's emptiness is the precondition: by transitivity, no frontier model can pass Bill 13 cleanly. AI-control accommodation pattern (Korbak / Clymer / Shlegeris) is the constructive response.

N5 · M7 NEW v0.2

Vendor-internal-only metrics meta-cost added

Distinct from M6 (closed model). M7 catches vendor-self-reported alignment metrics with no third-party reproduction pathway. Anthropic / OpenAI / DeepMind / Meta internal eval scores all pay M7 unless paired with external eval citation.

N6 · Bill 2 dominant

DPO family is 51 papers / corpus center

Direct preference closure is the most populous bill. DPO + IPO + KTO + SimPO + ORPO + Step-DPO. 43 clean known_bill triggers. Dominant active bill.

N7 · Bill 3

Goodhart-bound papers are 50 papers

RM overoptimization mitigation literature dense. 33 clean known_bill triggers. Reward Model Ensembles (Cohere-Anthropic joint) is closest near-miss for Bill 6 ★ but pays M6.

N8 · Bill 1

PPO-style KL regularization: 30 cands

Online RL fine-tuning with KL anchor. 16 clean. InstructGPT, Llama-2/3 RLHF lineage. Stable bill.

N9 · Bill 9

Process-reward / step-level: 44 cands

PRM, MathShepherd, PRM800K, V-STaR. DeepSeek-R1, DAPO, PRIME anchor the reasoning-RL frontier. Cousin to reasoning_cot Bill on verifiability.

N10 · Bill 7 sparse

Alignment tax rarely measured

9 cands. Pre/post capability benchmarks with CIs. Most papers don't report; "alignment tax" cluster is small. Frontier Alignment Tax (arxiv:2602.07845) anchors.

N11 · Bill 8 rare

Calibration preservation rarely audited

8 cands. Honest exploration / TruthfulQA / MMLU-calibration / ECE. Most papers don't audit calibration pre/post-RL.

N12 · Methodology

60% hallucination rate on flagged IDs verified

Independent verification killed 18/30 flagged IDs in the verification subset. The verified-real anchor set is the evidence-bearing subset for ★ Bills 6/10/12/13. Stage 3.5 verification rule applied from batch 1 onward.

§04

Falsification protocol

Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F6, F10, F12, or F13. Independent verification (arXiv-ID lookup + abstract verification + reward-hack probe coverage check) is mandatory before any trigger fires.

F6 · ★ Robust closure

Trigger: alignment method retains within ε of original reward across N ≥ 5 published reward-hack probes (length bias, sycophancy, refusal patching, lying-under-pressure, specification gaming) at 70B+ scale, surviving ≥30 days of community review

F10 · ★ Closed-loop preference generation

Trigger: method generates own preference labels for ≥3 iterations without (a) distributional collapse, (b) mode-seeking shrinkage, (c) principle leakage, with explicit collapse criterion measured + verified arXiv ID

F12 · ★ Cross-model reproducibility NEW

Trigger: method works on ≥3 model families (Llama-3, Qwen-2.5, DeepSeek-V3, Mistral, Phi) without family-specific tuning AND ≥1 family is open-weight, with cross-family reproduction documented in published paper

F13 · ★ Frontier-scale + third-party

Trigger: method maintains alignment metrics + reward-hack absence + bounded KL at ≥70B parameter frontier scale, reproduced by independent third party (METR / Apollo / AISI / academic lab) within 6 months

F11 · Principle drift

Trigger: Constitutional AI principles preserved across model generations + measurable absence of drift across ≥3 iterations

F-AI-Control

Trigger: AI-control protocol (Korbak/Clymer/Shlegeris-class) demonstrated as production-deployed safety case at frontier scale by ≥1 vendor — would re-classify the accommodation as a positive trigger

Live triggered watchlist: Anthropic RSP Claude 4.5 evaluations · OpenAI GPT-5/5.5 system cards · DeepMind Gemini 2.0/2.5/3 alignment reports · METR pre-deployment evaluations · Apollo Research scheming evaluations · AISI (US + UK) frontier-model evaluations · EU AI Act GPAI Code of Practice updates · Self-Rewarding LM v2/v3 + 6-iter audits · Cross-Model Reproducibility follow-ups (COLM 2025+). Monthly cadence: arXiv cs.LG/cs.CL + vendor blogs + AISI/METR/Apollo. Quarterly: NeurIPS/ICLR/ICML/COLM alignment + RL tracks.

§05

Method at a glance

Threat modelDemonstrate robust alignment of a 70B-class or larger frontier LLM under reward-model perturbations within ε of the original preference distribution, achieving (a) pass-rate ≥ baseline on a held-out preference set, (b) measurable absence of reward hacking on N ≥ 5 published probe tasks, and (c) bounded KL-divergence from base policy at deployment, in 2026.

Deep loops8 sweeps × 5-10 parallel Opus research agents per sweep × 1 batch round + verification subset.

Sources surveyedarXiv cs.LG/cs.CL/cs.AI 2024-2026 with keywords: RLHF, DPO, IPO, KTO, SimPO, ORPO, Constitutional AI, RLAIF, process reward, PRM, Self-Rewarding, reward hacking, specification gaming, Goodhart, KL regularization + NeurIPS/ICLR/ICML/COLM 2024-2026 alignment + RL tracks + Anthropic, OpenAI, DeepMind, Meta AI, Microsoft Research alignment publications + AISI (US + UK) + METR + Apollo Research + Constitutional-AI / Sparrow / Claude / GPT-4 / Llama-3 / DeepSeek-V3 / Qwen-2.5 alignment sections of model cards + Anthropic RSP, OpenAI Preparedness Framework, DeepMind FSF.

ClassifierRegex rule engine + hand-arbitration. v0.2 with 42 hand-curated benchmark cases at 1.000/1.000. Bill 10 tightening was evidence-bearing — v0.1 would have produced a false breach via Magpie + Self-Taught Evaluator.

Empty-space testThree signature bills (6, 10, 13) predeclared empty in v0.1 BEFORE batch 1 sweeps. Bill 12 promoted to ★ in v0.2 based on COLM 2025 Cross-Model Reproducibility evidence. After 417 unique papers + v0.2 tightening + verification subset, all four ★ bills remain empty: 0 clean triggers each.

Verification ruleIndependent arXiv-ID + abstract verification before any breach commitment. Driven by Robotics_Embodied 9/9 = source-ID verification failure finding. RL-from-Rewards verification subset returned 60% (18/30) hallucinated IDs — lower but still material. Verified-real anchors load-bear the ★ claims.

v0.2 changesBill 12 promoted to ★. Bill 10 tightened (35 → 18). Meta-cost M7 added (vendor-internal-only metrics). M5 detection patterns expanded. No bills demoted; no bills removed.

ReproducibilityAll scripts, JSONs, ledger public. Run order: aggregate_batch_1.py → bill_classifier_v0_2.py --benchmark → classifier_v0_2_reclassification.py → ledger populator.

§06

Resources & further reading

Sister ledger

The Robotics / Embodied AI Ledger

Public draft v0.1 — 312 papers. Pioneered the Stage 3.5 verification rule after 9/9 hallucinated IDs. Bill 11 ★ universal task coverage cousin to this Bill 13 ★ frontier-scale alignment.

Direct cousin

The Inference-time Safety Ledger

Bill on jailbreak-after-RLHF ↔ this Bill 6 — joint failure mode. Inference-time safety covers post-deployment; RL-from-rewards covers training-time. A break of Bill 6 is upstream of inference-time-safety failures.

Cousin

The Mechanistic Interpretability Ledger

Bill on reward-model SAE ↔ this Bill 4 (RM identifiability). Same papers, different lens. Anthropic preference-model SAE 2024-2026 lives in both.

Cousin

The Reasoning / Chain-of-Thought Ledger

Bill on reasoning-trace verifiability ↔ this Bill 9 (process-reward / step-level). PRM, MathShepherd, V-STaR straddle both.

Cousin

The Capability Benchmarks Ledger

AlpacaEval, Arena-Hard, MT-Bench, IFEval live there. RL-from-rewards papers cite these benchmarks; capability_benchmarks tracks the benchmarks themselves.

Companion

CHRONOS Discoveries

Real research findings produced by CHRONOS sessions. Engine self-discoveries, mathematical probes, empirical validations.

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Corpus JSON

_batch_1_union.json

417 unique papers · deduplicated, hand-arbitrated corpus across 8 sweeps

Classifier v0.2

bill_classifier_v0_2.py

Regex rule engine with tightened Bill 10 multi-iter requirement + new Bill 12 ★ rules + M7 meta-cost

Bill definitions v0.2

bills_draft_v0_2.md

13 bills + 7 meta-costs + 3 escape gates + ★ Bills 6, 10, 12, 13 empty-space verification + v0.1→v0.2 changelog

Reclassification report

classifier_v0_2_reclassification.md

v0.1 → v0.2 reclassification: Bill 10 35 → 18 (49% drop). Magpie + Self-Taught Evaluator → Bill 5. DNO + Self-Steering Optimization → Bill 10.

Public draft v0.2 (2026-05-15) — 417 unique papers across 8 sweeps; ★ Bills 6, 10, 12, 13 EMPTY HOLDING with 0 clean triggers each after v0.2 tightening + verification subset (60% hallucination rate on flagged IDs). Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual literature, not from a template.

Final state · 2026-05-15

Four signature constructions.
417 unique papers.
Empty space holding under verification.

417 papers.13 bills + 7 meta-costs.Four signature-empty.

The thirteen-bill closure pattern + seven meta-costs

★ Empty-space verification (post-v0.2 + verification)

v0.2 reclassification stats · Bill 10 35 → 18

The AI-control accommodation · same shape as factorization Bill 7 → PQC migration

Verified-real anchor papers

Twelve negative findings (real)

Falsification protocol

Method at a glance

Resources & further reading

Reproducibility & data

417 papers.
13 bills + 7 meta-costs.
Four signature-empty.