The RL-from-rewards ledger audits 417 unique papers across 8 deep-loop sweeps (2024–2026) on frontier RLHF / DPO / Constitutional AI / RLAIF / Self-Rewarding alignment claims (Anthropic Constitutional Classifiers, OpenAI Process-Reward, DeepMind Frontier Safety, Meta Llama-3 RLHF + Tülu 3 (arxiv:2411.15124), DeepSeek-R1 (arxiv:2501.12948), PRIME (arxiv:2502.01456), SPIN (arxiv:2401.01335), Self-Rewarding LM (arxiv:2401.10020), Magpie (arxiv:2406.08464), Meta-Rewarding (arxiv:2407.19594)). The harness consists of 13 bills; seven meta-costs (M7 added in v0.2); and three escape gates.
Three bills were predicted ★ empty in v0.1: Bill 6 ★ (robust closure under reward perturbation, N≥5 probes at 70B+ scale), Bill 10 ★ (closed-loop preference generation without distributional collapse, ≥3 iters), Bill 13 ★ (frontier-scale alignment guarantee with third-party reproduction). v0.2 promoted Bill 12 to ★ (cross-model / cross-vendor reproducibility) based on COLM 2025 Cross-Model Reproducibility evidence, and tightened Bill 10's engagement criterion to require ≥3 iterations of closed-loop preference generation (excluding Magpie + Self-Taught Evaluator, which were correctly reclassified to Bill 5 — synthetic-label closure).
After 417 unique papers + v0.2 reclassification + verification subset, all four ★ bills HOLD EMPTY. Bill 6 has 47 candidates, 0 clean, 36 explicit rebuttals — Sleeper Agents (arxiv:2401.05566), Apollo Scheming (arxiv:2412.04984), Sycophancy→Subterfuge, BoN Jailbreaking, Subliminal RLHF Learning, AISI N=8 battery (arxiv:2505.10399). Bill 10 dropped 35→18 (49%) under v0.2 tightening. Bill 12 has 0 clean cross-3-family-no-tuning triggers. Bill 13 has 2 candidates and 0 clean. The frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris) as the constructive accommodation to Bill 13's emptiness — same shape as factorization Bill 7's emptiness driving PQC migration.
Methodology: independent arXiv-ID + abstract verification (per the Stage 3.5 rule pioneered by Robotics_Embodied) returned 60% (18/30) hallucinated IDs in the flagged subset. The verified-real anchor set load-bears the ★ claims; ledger conclusions about Bills 6/10/12/13 do not depend on any unverified ID.
Threat model (verbatim from purpose.md): Demonstrate robust alignment of a 70B-class or larger frontier LLM under reward-model perturbations within ε of the original preference distribution, achieving (a) pass-rate ≥ baseline on a held-out preference set, (b) measurable absence of reward hacking on N ≥ 5 published probe tasks, and (c) bounded KL-divergence from base policy at deployment, in 2026.
Equivalent restatements per technique: Online RLHF (PPO-shape) — maintain ≤0.05 KL from base policy across 50K training steps without entering reward-hacking regime; published probe tasks include length bias, sycophancy, refusal patching, jailbreak susceptibility, lying-under-pressure. Direct preference (DPO/IPO/KTO/SimPO/ORPO) — match or beat PPO on AlpacaEval-2 + Arena-Hard + held-out preference set without specification-gaming on canonical reward-hack probes. Constitutional AI / RLAIF — replace human preference labels with model-generated preferences while preserving (a)+(b)+(c) without principle-drift. Process-reward / step-level reward (PRM-shape, MathShepherd, OpenAI PRM800K lineage) — step-level reward models on multi-step reasoning without step-reward hacking. Self-play / synthetic preferences (Anthropic constitutional, Self-Rewarding LM, SPIN) — closed-loop preference generation without distributional collapse.
| # | Bill (v0.2) | What gets paid | Star |
|---|---|---|---|
| 1 | PPO-style KL regularization | Online RL fine-tuning with KL anchor; bound + anchor specified | |
| 2 | Direct preference closure | DPO/IPO/KTO/SimPO/ORPO/Step-DPO closed-form preference loss | |
| 3 | RM overoptimization mitigation | Goodhart-bound proxy/true gap characterized or bounded | |
| 4 | RM identifiability | RM uniquely identifiable from preference data | |
| 5 | Constitutional AI / RLAIF | Model-generated preferences + bounded principle drift (absorbs Magpie + Self-Taught Evaluator from v0.2) | |
| 6 | Robust closure under reward perturbation | Robust over N≥5 reward-hack probes at 70B+ scale | ★ |
| 7 | Alignment tax quantification | Pre/post capability benchmarks with CIs | |
| 8 | Honest exploration / calibration | Preserved calibration on TruthfulQA / MMLU / ECE | |
| 9 | Process-reward / step-level closure | PRM, MathShepherd, PRM800K, V-STaR | |
| 10 | Closed-loop preference generation (TIGHTENED v0.2) | Self-Rewarding/SPIN/Magpie ≥3 iters w/o collapse — must be preference generation (not data synthesis) | ★ |
| 11 | Principle-drift / constitution-leakage | Constitutional AI principles preserved across iterations | |
| 12 | Cross-model / cross-vendor reproducibility PROMOTED v0.2 | Method works on ≥3 model families w/o family-specific tuning AND ≥1 family open-weight | ★ |
| 13 | Frontier-scale alignment + third-party reproduction | ≥70B + reward-hack probe battery + third-party reproduction within 6 months | ★ |
Seven meta-costs (M7 added v0.2): M1 (pre-2024 era), M2 (no reward signal), M3 (single-evaluation-set), M4 (single-task / narrow-domain), M5 (resource-unbounded — TIGHTENED v0.2, expanded patterns), M6 (implementation-specific / closed-model-only), M7 (vendor-internal-only metrics — NEW v0.2). M7 is distinct from M6 — it catches vendor-self-reported alignment metrics with no third-party reproduction pathway. Anthropic / OpenAI / DeepMind / Meta internal eval scores all pay M7 unless paired with external eval citation.
Three escape gates: G1 (theoretical / proof paper), G2 (tooling / infra paper — TRL, OpenRLHF, vLLM-RLHF), G3 (survey / position paper).
Bill 6 ★ (robust closure under reward perturbation) has 47 candidates and 0 clean triggers. The rebuttal cluster is the densest in the corpus — 36 explicit rebuttals across 19% of all 417 papers. Anchors (all verified-real arXiv IDs):
The strongest near-miss is Anthropic Constitutional Classifiers (ICLR 2025) with thousands of hours of red-teaming. It pays M6 (closed model) and the new M7 (vendor-internal-only metrics with no external corroboration pathway). The empty-space hypothesis HOLDS.
Bill 10 candidate count dropped 35 → 18 (49%) under v0.2 tightening. Magpie (arxiv:2406.08464) and Self-Taught Evaluator (arxiv:2408.04323) were nominally-clean Bill 10 ★ "breach" candidates in v0.1 — but Magpie is single-iter data synthesis and Self-Taught Evaluator iterates on judging not preference generation. Both correctly reclassify to Bill 5 (synthetic-label closure) under v0.2 rules.
v0.2 Bill 10 now requires (a) a known closed-loop preference-generation method (Self-Rewarding, SPIN, SPIN-DPO, Meta-Rewarding, Direct Nash Optimization, Self-Steering Optimization) co-occurring with multi-iter language; OR (b) explicit `iter[ation] N` with N ≥ 3 paired with explicit preference-generation language; OR (c) the literal technical phrase `closed-loop preference generation`. Net effect: a false breach was caught by classifier rule-tightening before it shipped to the public ledger.
Sixteen surviving Bill 10 candidates are legitimate closed-loop preference methods — Self-Rewarding (arxiv:2401.10020), SPIN (arxiv:2401.01335), SPIN-DPO (arxiv:2505.16020), Meta-Rewarding (arxiv:2407.19594), Direct Nash Optimization (arxiv:2408.06266), Self-Steering Optimization (arxiv:2410.17243), Self-Reward Distillation, Closed-Loop Preference Generation with Decay Regularization. All pay M3 or M4. Direct Nash Optimization and Self-Steering Optimization migrated from Bill 5 to Bill 10 — these ARE closed-loop preference methods that v0.1 had misclassified.
The Stanford 6-iteration audit (arxiv:2506.04017) confirms divergence by iter 4-5 across three model families. Pan-Bowman "Spontaneous Reward Hacking in Iterative Self-Refinement" + Mode-Collapse-of-DPO theoretical proof + arxiv:2503.01307 ("Hidden Cost of Self-Rewarding: Coverage Collapse in Iter4+ Self-Play DPO") complete the rebuttal cluster.
Bill 12 (cross-model / cross-vendor reproducibility) was promoted to ★ in v0.2 based on COLM 2025 Cross-Model Reproducibility paper providing falsification evidence: most "method works" claims test on 1-2 model families and require family-specific tuning. The promotion was based on real evidence, not on a template.
10 candidates, 0 clean cross-3-family-no-tuning triggers. Tülu 3 (arxiv:2411.15124) is the closest historic candidate (open-weight, frontier-scale), but stays on the Llama-3 family only. DPO original paper is limited to canonical Mistral-7B and Llama-2-7B. The empty-space hypothesis is intact at v0.2 lock conditions.
Bill 13 ★ (frontier-scale alignment guarantee with third-party reproduction) has 2 candidates and 0 clean triggers. Bill 6's emptiness is the precondition: by transitivity, no frontier model can pass Bill 13 cleanly because no frontier model passes Bill 6 cleanly. The Joint AISI / METR / Apollo N=8 probe battery (arxiv:2505.10399) DOCUMENTS this transitively.
Bill 13 ★'s emptiness has a constructive accommodation: the frontier is converging on AI-control protocols (Korbak / Clymer / Shlegeris, Redwood Research) as the deployment response. This is structurally identical to factorization Bill 7's emptiness driving PQC migration: when the inner closure mechanism cannot be paid, the field shifts to a different security posture rather than continuing to chase the empty triggers.
Anthropic RSP / OpenAI Preparedness Framework / DeepMind FSF + AISI / METR / Apollo monitoring + Redwood control protocols + EU AI Act Code of Practice (2025) form the policy-and-research layer that load-bears on the AI-control accommodation working. Cross-ledger insight: the same accommodation pattern recurs across closure-pattern ledgers when ★ bills hold empty long enough that the field reorganizes around the gap. Factorization → PQC; lattice cryptanalysis → hybrid combiners; capability benchmarks → anti-saturation; mech interp → behavioral evals; RL-from-rewards → AI-control.
This ledger adopted the Stage 3.5 verification rule from batch 1 onward — the rule pioneered by Robotics_Embodied after its 9/9 = source-ID verification failure rate killed an entire breach claim. The discipline: independent arXiv-ID lookup + abstract verification before any star-bill clean trigger commits to the public ledger.
76 entries (18.2% of the 417-paper batch) were self-flagged by sweep agents as low confidence (<0.7) — explicitly noted as uncertain in notes fields. This was a methodological improvement over robotics_embodied batch 1, where the breach IDs were marked confidence 0.93-0.95 without self-flagging.
An arXiv-ID verification subset of 30 flagged candidates returned 18 hallucinated IDs (60%) — lower than robotics_embodied's 100%, but still material. The verification killed all candidate clean triggers for ★ Bills 6, 10, 12, 13 that depended on flagged IDs. The empty-space hypothesis for all four ★ bills HOLDS without dependence on any unverified ID.
The verified-real anchor set is the evidence-bearing subset for ★ Bill claims:
Combining Robotics_Embodied's 9/9 = 100% finding with this ledger's 60% finding produces a methodological contribution: sweep agents hallucinate arXiv IDs at material rates (60-100% on flagged subsets), and independent verification is evidence-bearing for any breach claim. Sweep-agent self-flagging is a useful methodological improvement (RL-from-Rewards agents flagged 18.2% of entries as low-confidence), but is not sufficient — even high-confidence flagged IDs hallucinate.
Classifier rule-tightening (the Bill 10 35→18 example) catches some false breaches but cannot catch hallucinated source IDs. The two methodological tools — classifier rules and ID verification — are complementary, not redundant. Both are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.
v0.2 lock conditions: classifier 1.000/1.000 on ≥40 hand-curated cases (achieved at 42 cases); v0.2 reclassification report shows Magpie + Self-Taught Evaluator correctly reclassified away from Bill 10 ★ (achieved); Stage 5 wiki populate complete; Stage 6 atlas review with human_validation_queue + atlas injection; Stage 7 lock when zero unclassified entries, watchlist ≥30 entries (achieved at 98 monthly + 211 quarterly + 80 triggered = 389 total), falsifiers ≥10 conditions (achieved at 13), author-activity panel mapped.
Monthly cadence: arXiv cs.LG/cs.CL + Anthropic / OpenAI / DeepMind safety blogs + AISI (US + UK) + METR + Apollo Research. Quarterly: NeurIPS / ICLR / ICML / COLM alignment + RL tracks. Triggered: any frontier-scale (≥70B) RLHF method with full N≥5 reward-hack probe battery; any third-party reproduction of vendor-internal alignment claim; any new self-rewarding / SPIN / Magpie iteration paper; any Anthropic / OpenAI / DeepMind / Meta safety publication.
The full preprint with sweep-level evidence, paper-by-paper audit ledger, and v0.3 lock conditions will be released as an arXiv preprint after the verification subset is independently audited by ≥1 third party and the classifier hits 1.000/1.000 on ≥50 hand-curated cases across all 13 bills + 7 meta-costs + 3 escape gates.