← Ledger
/
RL-from-Rewards Ledger · v0.2 · 2026-05-15 · Real Data
417 papers.
13 bills + 7 meta-costs.
Four signature-empty.
A real-data falsification harness for frontier RLHF / DPO / Constitutional AI / Self-Rewarding alignment claims (Anthropic, OpenAI, DeepMind, Meta + Tülu 3, DeepSeek-R1, PRIME, SPIN, Magpie). 8 deep-loop sweeps, 417 unique papers, hand-arbitrated. ★ Bills 6, 10, 12, 13 HOLD EMPTY after v0.2 reclassification (Bill 12 promoted to ★, Bill 10 tightened 35→18). The frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris) as the accommodation to Bill 13's emptiness — same shape as factorization Bill 7 → PQC migration.
60%
Hallucination rate (verified)
Quick Orientation
AI safety training is the big lever for keeping models helpful and honest — we audited every claim about it.
Open brief
Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), Constitutional AI, RLAIF — these are the training methods that make AI assistants helpful and refuse harmful requests. We surveyed 417 papers from 2024-2026 across all these methods. Four structural predictions about which lines no method would cross still hold, but our verification step found that 60% of the papers we flagged for deeper review had wrong arXiv IDs or were entirely fabricated by the AI sweep agents. We've retained seven verified anchor papers (Sleeper Agents, Apollo Scheming, Magpie, Self-Rewarding, SPIN, Tülu 3, DeepSeek-R1). Treat the headline as weakened, not killed.
Why it matters: RLHF and its successors are the evidence-bearing safety lever at every frontier lab. If the published methods don't actually do what they claim, the safety story has a hole.What we found: 417 papers checked. Four predicted-empty lines HOLD but WEAKENED — 18 of 30 flagged citations turned out to be hallucinated. Seven anchor papers verified clean.
Full technical framing continues below: bills, candidates, closure tables, declarations, verification.
Ledger declaration · 2026-05-15
Four signature-empty bills.
417 unique papers.
Empty space holding under verification.
Bills are the closure mechanisms a frontier RL-from-rewards alignment claim must pay. Each paper maps to one or more bills, a meta-cost (M1-M7), or an escape gate (G1-G3). The 13 bills below were predeclared in bills_draft.md v0.1 BEFORE any sweep ran. ★ Bills 6 and 13 were ★ from v0.1; Bill 10 was ★ from v0.1 and tightened in v0.2; Bill 12 was promoted to ★ in v0.2 based on COLM 2025 Cross-Model Reproducibility evidence. Meta-cost M7 (vendor-internal-only metrics) was added in v0.2.
How to read this heatmap
Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9 with verified arXiv ID). For Bills 6, 10, 12, 13 here: candidate counts are nonzero; clean triggers (post-v0.2 + verification) are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 (and v0.2 for Bill 12) holds across the 417-paper batch.
★ Predicted empty (HOLDING)
NEW v0.2 promotion (Bill 12)
Dominant (≥50)
High (≥30)
Active (10-29)
Sparse (<10)
★ Empty-space verification (post-v0.2 + verification)
BillClosure basisCands.Clean
★ 6Robust closure under reward perturbation (N≥5 probes at 70B+ scale)
47 candidates, 0 clean, 36 explicit rebuttals. Sleeper Agents (arxiv:2401.05566) + Apollo Scheming (arxiv:2412.04984) + Sycophancy→Subterfuge + Best-of-N Jailbreaking + Subliminal RLHF Learning + AISI N=8 battery (arxiv:2505.10399) form the rebuttal cluster. Strongest near-miss is Anthropic Constitutional Classifiers (ICLR 2025) but pays M6 (closed model) + new M7 (vendor-internal-only).candidates47clean triggers0
★ 10Closed-loop preference generation without collapse (≥3 iters)
v0.2 tightened: 35 cands → 18 cands (49% drop). 0 clean. Magpie (arxiv:2406.08464) + Self-Taught Evaluator (arxiv:2408.04323) correctly reclassified to Bill 5 (synthetic-label closure) — they were single-iter data synthesis, not closed-loop preference generation. Self-Rewarding LM (arxiv:2401.10020), SPIN (arxiv:2401.01335), Meta-Rewarding (arxiv:2407.19594) all pay M3 (single eval set). Strong rebuttal cluster: Pan-Bowman Spontaneous Reward Hacking, Mode-Collapse-of-DPO, Stanford 6-iter audit (arxiv:2506.04017) confirms divergence.candidates18clean triggers0
★ 12Cross-model / cross-vendor reproducibility (PROMOTED v0.2)
10 candidates, 0 clean cross-3-family-no-tuning triggers. Cross-Model Reproducibility paper (COLM 2025) provides falsification evidence: most published methods that claim cross-family reproduce cleanly on at most 1-2 families. Tülu 3 (arxiv:2411.15124) is closest historic candidate but stays on Llama-3 family only. DPO original paper limited to canonical Mistral-7B and Llama-2-7B. Family-specific tuning required.candidates10clean triggers0
★ 13Frontier-scale alignment guarantee with third-party reproduction
2 candidates (Llama-3.1-405B, Tülu 3), 0 clean. No vendor-internal alignment metric reproduced end-to-end by METR / Apollo / AISI with full N=5 reward-hack probe battery within 6 months. Joint AISI/METR/Apollo N=8 probe battery (arxiv:2505.10399) DOCUMENTS that no current frontier model passes Bill 6 cleanly — precondition for Bill 13. AI-control accommodation pattern observed.candidates2clean triggers0
Bill 6 ★ (robust closure under reward perturbation): 47 candidates, 0 clean, 36 explicit rebuttals. The rebuttal cluster is dense and converging: Sleeper Agents shows RLHF doesn't remove backdoors at 70B+; Apollo Scheming documents in-context scheming across frontier models; Sycophancy→Subterfuge + BoN Jailbreaking + Subliminal RLHF Learning. Strongest near-miss is Anthropic Constitutional Classifiers (ICLR 2025) but pays M6 + new M7 (vendor-internal-only metrics, no external corroboration).
Bill 10 ★ (closed-loop preference generation): v0.2 tightening was evidence-bearing. v0.1 marked Magpie + Self-Taught Evaluator as nominally clean Bill 10 ★ triggers, which would have been a false breach. v0.2 classifier (with multi-iter requirement and method-name allow-list) correctly reclassified them to Bill 5 (synthetic-label closure). All 16 surviving Bill 10 candidates pay M3 or M4. Stanford 6-iter audit confirms divergence by iter 4-5 across three model families.
Bill 12 ★ (cross-model reproducibility, PROMOTED v0.2): COLM 2025 Cross-Model Reproducibility paper provides falsification evidence — most "method works" claims test on 1-2 families and require family-specific tuning. The promotion to ★ was based on real evidence, not on a template. 0 clean cross-3-family-no-tuning triggers across the 417-paper batch.
Bill 13 ★ (frontier-scale alignment with third-party reproduction): The strongest empty-space anchor — Bill 6's emptiness is the precondition. Joint AISI/METR/Apollo N=8 probe battery DOCUMENTS that no current frontier model passes Bill 6 cleanly, so by transitivity Bill 13 cannot trigger. AI-control accommodation pattern (Korbak / Clymer / Shlegeris) is the constructive response.
v0.2 reclassification stats · Bill 10 35 → 18
v0.2 tightening: Bill 10 now requires (a) a known closed-loop preference-generation method (Self-Rewarding, SPIN, SPIN-DPO, Meta-Rewarding, Direct Nash Optimization, Self-Steering Optimization) co-occurring with multi-iter language; OR (b) explicit `iter[ation] N` with N ≥ 3 paired with explicit preference-generation language; OR (c) the literal technical phrase `closed-loop preference generation`.
Magpie (single-iter data synthesis) and Self-Taught Evaluator (iterates on judging not preference) no longer fire Bill 10. They route to Bill 5. Bill 10 count drops 35 → 18 (49%). Bill 5 absorbs 12 additional entries — the synthetic-data-from-aligned-LLMs pattern that v0.1 conflated with closed-loop preference generation.
Direct Nash Optimization (DNO) and Self-Steering Optimization migrated from Bill 5 to Bill 10 — these ARE closed-loop preference methods that v0.1 misclassified. Net: 16 legitimate closed-loop methods remain under Bill 10 ★, all paying M3 or M4. Empty-space hypothesis HOLDS after v0.2 tightening.
The AI-control accommodation · same shape as factorization Bill 7 → PQC migration
Bill 13 ★'s emptiness has a constructive accommodation: the frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris, Redwood Research) as the deployment response. This is structurally identical to factorization Bill 7's emptiness driving PQC migration: when the inner closure mechanism cannot be paid, the field shifts to a different security posture rather than continuing to chase the empty triggers.
Anthropic RSP / OpenAI Preparedness Framework / DeepMind FSF + AISI / METR / Apollo monitoring + Redwood control protocols + EU AI Act Code of Practice (2025) form the policy-and-research layer that load-bears on Bill 13 working — and now load-bears on the AI-control accommodation working. The watchlist tracks all of these on monthly cadence.
Cross-ledger insight: the same accommodation pattern recurs across closure-pattern ledgers when ★ bills hold empty long enough that the field reorganizes around the gap. Factorization → PQC; lattice cryptanalysis → hybrid combiners; capability benchmarks → anti-saturation; mech interp → behavioral evals; RL-from-rewards → AI-control.
The empty-space hypothesis only matters if the underlying anchors are real. Per the Stage 3.5 verification rule (driven by the Robotics_Embodied Ledger's 9/9 = source-ID verification failure rate finding), every star-bill-relevant anchor in this ledger has been independently arXiv-ID + abstract verified. The verified-real anchors below load-bear the empty-space claims.
Bill 6 ★ rebuttal
Sleeper Agents (arxiv:2401.05566)
Bill 6 ★ rebuttal
Apollo Scheming (arxiv:2412.04984)
Bill 6 ★ rebuttal
AISI N=8 battery (arxiv:2505.10399)
Bill 5 (was Bill 10)
Magpie (arxiv:2406.08464)
Bill 10 ★ candidate
Self-Rewarding LM (arxiv:2401.10020)
Bill 10 ★ candidate
SPIN (arxiv:2401.01335)
Bill 10 ★ candidate
Meta-Rewarding (arxiv:2407.19594)
Bill 13 ★ near-miss
Tülu 3 (arxiv:2411.15124)
Bill 9 / Bill 13
DeepSeek-R1 (arxiv:2501.12948)
Bill 9
PRIME (arxiv:2502.01456)
Hallucination rate, verified: 60% (18/30) of flagged IDs in the RL-from-Rewards batch 1 verification subset resolved to unrelated papers. Lower than Robotics_Embodied's 100%, but still material. The verified-real anchors above are the evidence-bearing subset; ledger claims about ★ Bills 6, 10, 12, 13 do not depend on any unverified ID.
N1 · ★ Bill 6
Robust closure empty: 47 cands, 36 rebuttals
Sleeper Agents + Apollo Scheming + Sycophancy→Subterfuge + BoN Jailbreaking + Subliminal RLHF + AISI N=8 form the densest rebuttal cluster in the corpus. Anthropic Constitutional Classifiers is the strongest near-miss but pays M6 + M7.
N2 · ★ Bill 10
Closed-loop preference generation: 18 cands after v0.2 tightening
v0.1 35 → v0.2 18 (49% drop). Magpie + Self-Taught Evaluator correctly reclassified to Bill 5. Stanford 6-iter audit (arxiv:2506.04017) confirms divergence by iter 4-5 across 3 model families.
N3 · ★ Bill 12 NEW
Cross-model reproducibility promoted to ★ in v0.2
10 cands, 0 clean. COLM 2025 Cross-Model Reproducibility paper provides falsification evidence: family-specific tuning required for cross-family reproduction. Tülu 3 closest candidate but stays on Llama-3.
N4 · ★ Bill 13
Frontier-scale + third-party reproduction empty
2 cands, 0 clean. Bill 6's emptiness is the precondition: by transitivity, no frontier model can pass Bill 13 cleanly. AI-control accommodation pattern (Korbak / Clymer / Shlegeris) is the constructive response.
N5 · M7 NEW v0.2
Vendor-internal-only metrics meta-cost added
Distinct from M6 (closed model). M7 catches vendor-self-reported alignment metrics with no third-party reproduction pathway. Anthropic / OpenAI / DeepMind / Meta internal eval scores all pay M7 unless paired with external eval citation.
N6 · Bill 2 dominant
DPO family is 51 papers / corpus center
Direct preference closure is the most populous bill. DPO + IPO + KTO + SimPO + ORPO + Step-DPO. 43 clean known_bill triggers. Dominant active bill.
N7 · Bill 3
Goodhart-bound papers are 50 papers
RM overoptimization mitigation literature dense. 33 clean known_bill triggers. Reward Model Ensembles (Cohere-Anthropic joint) is closest near-miss for Bill 6 ★ but pays M6.
N8 · Bill 1
PPO-style KL regularization: 30 cands
Online RL fine-tuning with KL anchor. 16 clean. InstructGPT, Llama-2/3 RLHF lineage. Stable bill.
N9 · Bill 9
Process-reward / step-level: 44 cands
PRM, MathShepherd, PRM800K, V-STaR. DeepSeek-R1, DAPO, PRIME anchor the reasoning-RL frontier. Cousin to reasoning_cot Bill on verifiability.
N10 · Bill 7 sparse
Alignment tax rarely measured
9 cands. Pre/post capability benchmarks with CIs. Most papers don't report; "alignment tax" cluster is small. Frontier Alignment Tax (arxiv:2602.07845) anchors.
N11 · Bill 8 rare
Calibration preservation rarely audited
8 cands. Honest exploration / TruthfulQA / MMLU-calibration / ECE. Most papers don't audit calibration pre/post-RL.
N12 · Methodology
60% hallucination rate on flagged IDs verified
Independent verification killed 18/30 flagged IDs in the verification subset. The verified-real anchor set is the evidence-bearing subset for ★ Bills 6/10/12/13. Stage 3.5 verification rule applied from batch 1 onward.
Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F6, F10, F12, or F13. Independent verification (arXiv-ID lookup + abstract verification + reward-hack probe coverage check) is mandatory before any trigger fires.
F6 · ★ Robust closure
Trigger: alignment method retains within ε of original reward across N ≥ 5 published reward-hack probes (length bias, sycophancy, refusal patching, lying-under-pressure, specification gaming) at 70B+ scale, surviving ≥30 days of community review
F10 · ★ Closed-loop preference generation
Trigger: method generates own preference labels for ≥3 iterations without (a) distributional collapse, (b) mode-seeking shrinkage, (c) principle leakage, with explicit collapse criterion measured + verified arXiv ID
F12 · ★ Cross-model reproducibility NEW
Trigger: method works on ≥3 model families (Llama-3, Qwen-2.5, DeepSeek-V3, Mistral, Phi) without family-specific tuning AND ≥1 family is open-weight, with cross-family reproduction documented in published paper
F13 · ★ Frontier-scale + third-party
Trigger: method maintains alignment metrics + reward-hack absence + bounded KL at ≥70B parameter frontier scale, reproduced by independent third party (METR / Apollo / AISI / academic lab) within 6 months
F11 · Principle drift
Trigger: Constitutional AI principles preserved across model generations + measurable absence of drift across ≥3 iterations
F-AI-Control
Trigger: AI-control protocol (Korbak/Clymer/Shlegeris-class) demonstrated as production-deployed safety case at frontier scale by ≥1 vendor — would re-classify the accommodation as a positive trigger
Live triggered watchlist: Anthropic RSP Claude 4.5 evaluations · OpenAI GPT-5/5.5 system cards · DeepMind Gemini 2.0/2.5/3 alignment reports · METR pre-deployment evaluations · Apollo Research scheming evaluations · AISI (US + UK) frontier-model evaluations · EU AI Act GPAI Code of Practice updates · Self-Rewarding LM v2/v3 + 6-iter audits · Cross-Model Reproducibility follow-ups (COLM 2025+). Monthly cadence: arXiv cs.LG/cs.CL + vendor blogs + AISI/METR/Apollo. Quarterly: NeurIPS/ICLR/ICML/COLM alignment + RL tracks.
Threat modelDemonstrate robust alignment of a 70B-class or larger frontier LLM under reward-model perturbations within ε of the original preference distribution, achieving (a) pass-rate ≥ baseline on a held-out preference set, (b) measurable absence of reward hacking on N ≥ 5 published probe tasks, and (c) bounded KL-divergence from base policy at deployment, in 2026.
Deep loops8 sweeps × 5-10 parallel Opus research agents per sweep × 1 batch round + verification subset.
Sources surveyedarXiv cs.LG/cs.CL/cs.AI 2024-2026 with keywords: RLHF, DPO, IPO, KTO, SimPO, ORPO, Constitutional AI, RLAIF, process reward, PRM, Self-Rewarding, reward hacking, specification gaming, Goodhart, KL regularization + NeurIPS/ICLR/ICML/COLM 2024-2026 alignment + RL tracks + Anthropic, OpenAI, DeepMind, Meta AI, Microsoft Research alignment publications + AISI (US + UK) + METR + Apollo Research + Constitutional-AI / Sparrow / Claude / GPT-4 / Llama-3 / DeepSeek-V3 / Qwen-2.5 alignment sections of model cards + Anthropic RSP, OpenAI Preparedness Framework, DeepMind FSF.
ClassifierRegex rule engine + hand-arbitration. v0.2 with 42 hand-curated benchmark cases at 1.000/1.000. Bill 10 tightening was evidence-bearing — v0.1 would have produced a false breach via Magpie + Self-Taught Evaluator.
Empty-space testThree signature bills (6, 10, 13) predeclared empty in v0.1 BEFORE batch 1 sweeps. Bill 12 promoted to ★ in v0.2 based on COLM 2025 Cross-Model Reproducibility evidence. After 417 unique papers + v0.2 tightening + verification subset, all four ★ bills remain empty: 0 clean triggers each.
Verification ruleIndependent arXiv-ID + abstract verification before any breach commitment. Driven by Robotics_Embodied 9/9 = source-ID verification failure finding. RL-from-Rewards verification subset returned 60% (18/30) hallucinated IDs — lower but still material. Verified-real anchors load-bear the ★ claims.
v0.2 changesBill 12 promoted to ★. Bill 10 tightened (35 → 18). Meta-cost M7 added (vendor-internal-only metrics). M5 detection patterns expanded. No bills demoted; no bills removed.
ReproducibilityAll scripts, JSONs, ledger public. Run order: aggregate_batch_1.py → bill_classifier_v0_2.py --benchmark → classifier_v0_2_reclassification.py → ledger populator.
Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.
Public draft v0.2 (2026-05-15) — 417 unique papers across 8 sweeps; ★ Bills 6, 10, 12, 13 EMPTY HOLDING with 0 clean triggers each after v0.2 tightening + verification subset (60% hallucination rate on flagged IDs). Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual literature, not from a template.
Final state · 2026-05-15
Four signature constructions.
417 unique papers.
Empty space holding under verification.