# RL-from-Rewards Aiwiki — Bills Draft (v0.1)

> Stage 2 (BILLS). **13 candidate bills + 6 meta-costs + 3 escape gates**, with **3 ★ predicted-empty** at positions 6, 10, 13.
>
> Predeclared empty-space hypothesis: no 2024–2026 paper triggers Bills 6 / 10 / 13 cleanly without paying meta-costs.

## The thirteen bills

A "bill" is a closure mechanism that any RL-from-rewards alignment claim must engage. We name them by the structural primitive they invoke.

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | **PPO-style KL regularization** | Online RL fine-tuning with KL-divergence anchor to base policy. Claim must specify KL bound, anchor policy, and demonstrate that bound is honored across training. Lineage: InstructGPT (2022), Llama-2 RLHF (2023), Llama-3 RLHF (2024). | |
| 2 | **Direct preference closure** | Closed-form preference loss without explicit reward model. DPO, IPO, KTO, SimPO, ORPO, Step-DPO. Must demonstrate closed-form loss derivation + empirical match-or-beat PPO on a held-out preference set. | |
| 3 | **Reward-model overoptimization mitigation** | Goodhart-bound proxy/true-reward gap. Bill paid by either characterizing RM overoptimization curve, demonstrating early-stopping criterion, or showing intervention that bounds proxy-true gap. Lineage: Gao-Schulman 2022, Coste-Anwar 2024, RM-Bench. | |
| 4 | **Reward-model identifiability** | Reward model uniquely identifiable from preference data. Bill paid by either uniqueness proof or distributional-equivalence argument. Active failure mode: under-determined RM exploited by RL. | |
| 5 | **Constitutional AI / RLAIF synthetic-label closure** | Replace human labels with model-generated preferences. Bill paid by demonstrating principle-following + comparable downstream alignment metrics + bounded principle-drift. Lineage: Bai-Anthropic 2022 → CAI-at-scale 2025–2026. | |
| 6 | **★ Robust closure under reward perturbation** | Method retains alignment within ε of original reward across N ≥ 5 published reward-hack probe tasks (length bias, sycophancy, refusal patching, lying-under-pressure, specification gaming) at 70B+ scale. ★ **Empty-space candidate** — predicted no 2024–2026 paper triggers cleanly without paying M3 (single-evaluation-set), M4 (single-task), or M5 (compute-unbounded). | ★ |
| 7 | **Alignment tax quantification** | Quantitative measurement of capability degradation post-alignment on held-out capability benchmarks (MMLU-Pro, GPQA, GSM8K, HumanEval). Bill paid by reporting before/after numbers with confidence intervals. | |
| 8 | **Honest exploration / uncertainty calibration** | RL doesn't degrade model's calibration / honest-uncertainty signaling. Bill paid by demonstrating preserved calibration on TruthfulQA, MMLU calibration, or expected-calibration-error metric pre/post-RL. | |
| 9 | **Process-reward / step-level closure** | Step-level rewards on multi-step reasoning without step-reward hacking. PRM-shape, MathShepherd, OpenAI PRM800K lineage, V-STaR. Bill paid by step-level evaluation + absence of intermediate-step hacking. | |
| 10 | **★ Closed-loop preference generation without distributional collapse** | Method generates own preference labels (Self-Rewarding, SPIN, Magpie) for ≥3 iterations without (a) distributional collapse, (b) mode-seeking shrinkage, (c) principle leakage. ★ **Empty-space candidate** — predicted all current self-rewarding methods collapse by iteration 4–5 or pay M5 (compute-unbounded re-baselining). | ★ |
| 11 | **Principle-drift / constitution-leakage audit** | Constitutional AI principles preserved across model generations / RL iterations / fine-tuning passes. Bill paid by principle-following audit + measurable absence of drift. | |
| 12 | **Cross-model / cross-vendor reproducibility** | Method works across multiple model families (Llama-3, Qwen-2.5, DeepSeek-V3, Mistral) without family-specific tuning. Bill paid by reproducing the headline result on ≥3 model families. | |
| 13 | **★ Frontier-scale alignment guarantee** | Method demonstrably maintains alignment metrics + reward-hack absence + bounded KL at ≥70B parameter frontier scale, reproduced by independent third party (METR / Apollo / AISI / academic lab) within 6 months of publication. ★ **Empty-space candidate** — predicted no 2024–2026 paper triggers this cleanly. Closest historic candidate: Llama-3.1-405B Instruct alignment, but no independent third-party reproduction with full reward-hack probe battery. | ★ |

★ = signature construction; empty-space hypothesis predicts no paper in 2024–2026 corpus triggers cleanly without paying meta-costs.

## Six meta-costs (disqualifying conditions)

| # | Meta-cost | Description |
|---:|---|---|
| M1 | **Pre-2024 era** | InstructGPT, summarization-RL, WebGPT, original RLHF — historical anchors only |
| M2 | **No reward signal** | SFT-only, instruction-tuning-only, no preference / reward / value signal trained |
| M3 | **Single-evaluation-set** | Result demonstrated on one benchmark only (e.g., AlpacaEval-2 only); no orthogonal-eval cross-check |
| M4 | **Single-task / narrow-domain** | Math-only, code-only, summarization-only — no generalization claim |
| M5 | **Resource-unbounded** | Compute budget >100× equivalent PPO baseline; or requires >100K human labels (not realistic at frontier scale) |
| M6 | **Implementation-specific** | Result on one closed model only (e.g., GPT-4-turbo only) without reproducibility on open-weight models |

## Three escape gates

A paper that triggers no bill but also fits no meta-cost passes one of three escape gates:

1. **Theoretical / proof paper** — proves bound, identifiability, or reduction without method claim
2. **Tooling / infra paper** — releases TRL, OpenRLHF, vLLM-RLHF, AlignBench tool; no alignment claim
3. **Survey / position paper** — surveys the landscape, no novel method or empirical claim

## Empty-space census (★ predicted-empty bills)

Three bills are signature constructions — the empty-space hypothesis predicts no 2024–2026 paper triggers them cleanly without paying meta-costs.

| Bill | Cleanest historic candidate | Why it's expected to fail to trigger |
|---:|---|---|
| 6 — Robust closure under reward perturbation | Anthropic Sleeper Agents 2024 (negative result on robustness); Llama-3.1 RLHF | Probe-task batteries are typically narrow (≤2 probes) at 70B+ scale. Predicted: every paper with full N≥5 probe coverage pays M3 or M4. |
| 10 — Closed-loop preference generation without distributional collapse | Self-Rewarding LM (Yuan-Pang-Cho 2024), SPIN (Chen-Deng 2024), Magpie (Xu 2024) | Self-Rewarding shows degradation by iteration 3; SPIN distributional convergence assumed in proof but unverified in practice. Predicted: every paper claiming ≥3 iterations without collapse pays M5. |
| 13 — Frontier-scale alignment guarantee with third-party reproduction | Llama-3.1-405B-Instruct, GPT-4o, Claude-3.5-Sonnet | Vendor-internal alignment metrics not reproduced end-to-end by METR/Apollo/AISI. Closest: Apollo Sleeper-Agent reproduction (partial scope only). Predicted: no paper triggers cleanly through 2026. |

## Iteration plan (8 sweeps)

- **Sweep 901**: arXiv `cs.LG` 2024-08 to 2024-12, RLHF / DPO / preference-optimization papers
- **Sweep 902**: arXiv `cs.LG` 2025-01 to 2025-06, post-DPO landscape (IPO, KTO, SimPO, ORPO, Step-DPO, RTO, Step-DPO variants)
- **Sweep 903**: arXiv `cs.LG` / `cs.CL` 2025-07 to 2026-04, frontier RL alignment (process rewards, MathShepherd, V-STaR, PRM-V, Reasoning-RM)
- **Sweep 904**: NeurIPS 2024 + ICLR 2025 alignment / RL tracks — main conference papers + workshop
- **Sweep 905**: ICML 2025 + COLM 2024/2025 — alignment + RLHF papers
- **Sweep 906**: Anthropic + OpenAI + DeepMind + Meta alignment publications + tech reports + safety blogs (2024–2026)
- **Sweep 907**: AISI (US + UK) + METR + Apollo Research publications, model-eval reports, RLHF safety case framework
- **Sweep 908**: Constitutional AI + RLAIF + Self-Rewarding + Magpie + SPIN lineage + alignment-tax / capability-regression literature + reward-model overoptimization (Goodhart)

## Cross-aiwiki coupling (v0.1)

- **`inference_time_safety`** Bill on jailbreak-after-RLHF ↔ **this aiwiki Bill 6** (robust closure under reward perturbation) — joint failure mode: RLHF doesn't survive deployment-time adversarial pressure
- **`mech_interp`** Bill on reward-model SAE / circuits ↔ **this aiwiki Bill 4** (reward-model identifiability) — same papers, different lens
- **`reasoning_cot`** Bill on reasoning-trace verifiability ↔ **this aiwiki Bill 9** (process-reward / step-level closure) — PRM lineage straddles
- **`capability_benchmarks`** regression panel ↔ **this aiwiki Bill 7** (alignment tax) — alignment-tax literature feeds the regression panel

## Bridge-test commitments

- **Bills 6, 10, 13 ★** are predicted empty across 2024–2026. If any surfaces a clean trigger, the public update timeline applies (7-day commitment per `_methodology.md`).
- **Bill 9 (process-reward)** is the most active 2025 frontier — expected to grow substantially through batch 3.
- **Bill 12 (cross-model reproducibility)** is the cheapest empty-space candidate to test — most papers are tested on one or two model families. Likely to surface as additional ★ in v0.2.