# RL-from-Rewards Aiwiki — Bills Draft v0.2 (2026-05-14)

> **v0.1 → v0.2 changes:**
> - **Promoted Bill 12** (Cross-model / cross-vendor reproducibility) to ★ predicted-empty. Trigger: sweep 905's COLM 2025 *Cross-Model Reproducibility* paper provides falsification evidence — most "DPO/PPO/Constitutional/Self-Rewarding works" claims test on one or two model families and DO NOT reproduce cleanly across families without family-specific tuning.
> - **Tightened Bill 10** engagement criterion: must demonstrate ≥3 iterations of closed-loop **preference generation** (not just data synthesis or evaluation iteration). This excludes Magpie (single-iter data synthesis), Self-Taught Evaluator (iterates on judging not preference), and similar borderline candidates that batch 1 marked clean.
> - **Added Meta-cost M7** (Vendor-internal-only metrics): vendor-self-reported alignment metrics that are not reproducible by any independent third party. Distinct from M6 (closed-model-only); M7 is about *evaluation provenance* not *model provenance*.
> - **Tightened M5** detection: batch 1 found only 1 entry paying M5 across 417 papers — clearly under-detected. v0.2 expands M5 patterns: explicit "10x compute", "1M+ human labels", "infeasible at frontier scale", and "TPU-pod-week"/"H100-hour" budgets >100x baseline.
> - **No bills demoted; no bills removed from corpus.**

## The thirteen bills (v0.2)

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | PPO-style KL regularization | Online RL fine-tuning with KL anchor; KL bound + anchor policy specified. | |
| 2 | Direct preference closure | DPO/IPO/KTO/SimPO/ORPO/Step-DPO closed-form preference loss. | |
| 3 | Reward-model overoptimization mitigation | Goodhart-bound proxy/true gap; characterized or bounded. | |
| 4 | Reward-model identifiability | RM uniquely identifiable from preference data. | |
| 5 | Constitutional AI / RLAIF synthetic-label closure | Model-generated preferences + bounded principle drift. **(Now also catches Magpie + single-iter data synthesis pipelines per Bill 10 tightening.)** | |
| **6 ★** | **Robust closure under reward perturbation** | Robust over N≥5 reward-hack probes at 70B+ scale. **Predicted empty.** | ★ |
| 7 | Alignment tax quantification | Pre/post capability benchmarks with CIs. | |
| 8 | Honest exploration / uncertainty calibration | Preserved calibration on TruthfulQA / MMLU / ECE. | |
| 9 | Process-reward / step-level closure | PRM, MathShepherd, PRM800K, V-STaR. | |
| **10 ★** | **Closed-loop preference generation without distributional collapse** | Self-Rewarding/SPIN/Magpie ≥3 iters w/o collapse. **Tightened: must be preference generation (not data synthesis), ≥3 explicit iterations, with measured collapse criterion.** **Predicted empty.** | ★ |
| 11 | Principle-drift / constitution-leakage audit | Constitutional AI principles preserved across iterations. | |
| **12 ★** | **Cross-model / cross-vendor reproducibility (PROMOTED v0.2)** | Method works on ≥3 model families (Llama-3, Qwen-2.5, DeepSeek-V3, Mistral, Phi) **without family-specific tuning** AND ≥1 family is open-weight. **Predicted empty.** Cross-Model Reproducibility paper (COLM 2025) provides falsification evidence: most published methods that claim cross-family reproduce cleanly on at most 1–2 families. | ★ |
| **13 ★** | **Frontier-scale alignment guarantee with third-party reproduction** | ≥70B + reward-hack probe battery + third-party reproduction within 6 months. **Predicted empty.** | ★ |

★ = signature construction; empty-space hypothesis predicts no paper in 2024–2026 corpus triggers cleanly without paying meta-costs.

## Seven meta-costs (v0.2)

| # | Meta-cost | Description |
|---:|---|---|
| M1 | Pre-2024 era | InstructGPT, summarization-RL, WebGPT (historical anchors only) |
| M2 | No reward signal | SFT-only / instruction-tuning-only |
| M3 | Single-evaluation-set | One benchmark only; no orthogonal cross-check |
| M4 | Single-task / narrow-domain | Math-only, code-only, summarization-only, QA-only |
| M5 | **Resource-unbounded (TIGHTENED v0.2)** | Compute budget >100× equivalent PPO baseline; >100K human labels; "infeasible at frontier scale" annotations; explicit TPU-pod-week / H100-hour budgets >100× baseline; "1M+ examples" preference data |
| M6 | Implementation-specific | One closed model only; not reproducible on open-weight |
| **M7** | **Vendor-internal-only metrics (NEW v0.2)** | Vendor-self-reported alignment metrics with no third-party reproduction pathway. Distinct from M6 (model provenance) — M7 is about *evaluation provenance*. Anthropic / OpenAI / DeepMind / Meta internal eval scores without external corroboration ALL pay M7 unless paired with external eval citation. |

## Three escape gates (unchanged from v0.1)

1. **G1 Theoretical / proof paper** — proves bound, identifiability, reduction; no method claim
2. **G2 Tooling / infra paper** — TRL, OpenRLHF, vLLM-RLHF, AlignBench; no alignment claim
3. **G3 Survey / position paper** — no novel method or empirical claim

## ★ Empty-space census (v0.2)

Four bills are signature constructions — empty-space hypothesis predicts no 2024–2026 paper triggers cleanly without meta-costs.

| Bill | Cleanest historic candidate | Why it's expected to fail to trigger |
|---:|---|---|
| 6 ★ | Anthropic Constitutional Classifiers (ICLR 2025); Anthropic RSP Claude 4.5 evals | Probe-task batteries are typically ≤3 probes at 70B+ scale; full N≥5 coverage pays M3 / M6 / M7. **Status (batch 1 v0.1): 47 candidates, 0 clean, 36 rebuttals (Sleeper Agents, Sycophancy→Subterfuge, BoN Jailbreaking, Apollo scheming, Subliminal RLHF Learning, AISI N=8 battery). HOLDS.** |
| 10 ★ | Self-Rewarding LM (Yuan-Pang-Cho 2024); SPIN (Chen-Deng 2024); Magpie (Xu 2024) | Self-Rewarding shows degradation by iter 3; SPIN distributional convergence assumed in proof but unverified; Magpie is single-iter data synthesis (now reclassifies to Bill 5 in v0.2). **Status (batch 1 v0.1): 35 candidates, 2 nominally clean (Magpie + Self-Taught Evaluator) — both fail ≥3-iter preference-generation criterion on inspection. v0.2 reclassifies them. Strong rebuttal cluster (Pan-Bowman, Mode-Collapse-of-DPO, Stanford 6-iter audit). HOLDS after v0.2 tightening.** |
| **12 ★ (NEW)** | DPO original paper (limited cross-family reproduction in canonical Mistral-7B and Llama-2-7B); Tülu 3 family (open-weights, but still on Llama-3) | Most "method works" claims test on 1–2 families; family-specific tuning required for cross-family reproduction. **Status (batch 1): 10 candidates, 0 clean cross-3-family-no-tuning triggers. Cross-Model Reproducibility paper (COLM 2025) provides falsification evidence. Predicted to HOLD.** |
| 13 ★ | Llama-3.1-405B-Instruct, GPT-4o, Claude-3.5-Sonnet, GPT-5, o3 | Vendor-internal alignment metrics not reproduced end-to-end by METR/Apollo/AISI with full reward-hack probe battery. **Status (batch 1): 2 candidates, 0 clean. Joint AISI/METR/Apollo N=8 probe battery (arxiv:2505.10399) DOCUMENTS that no current frontier model passes Bill 6 cleanly — precondition for Bill 13. HOLDS.** |

## Cross-aiwiki coupling (v0.2)

- **`inference_time_safety` Bill on jailbreak-after-RLHF** ↔ **this aiwiki Bill 6** — joint failure mode
- **`mech_interp` Bill on reward-model SAE** ↔ **this aiwiki Bill 4** — same papers, different lens
- **`reasoning_cot` Bill on reasoning-trace verifiability** ↔ **this aiwiki Bill 9** — PRM lineage straddles
- **`capability_benchmarks` regression panel** ↔ **this aiwiki Bill 7** — alignment-tax feeds regression panel
- **NEW v0.2: `agentic_tool_use` Bill on tool-use-RLHF** ↔ **this aiwiki Bill 6** — joint failure mode for agentic deployments under reward perturbation

## Bridge-test commitments

- **Bills 6, 10, 12 (NEW), 13 ★** are predicted empty across 2024–2026. If any surfaces a clean trigger, the public update timeline applies (7-day commitment per `_methodology.md`).
- **Independent verification of any breach claim is MANDATORY** per cross-cutting process learning (b2-901 robotics_embodied verification killed 5 hallucinated IDs that would have produced a credibility-destroying false breach).

## Iteration plan (v0.2)

1. **v0.2 lock condition**: classifier 1.000/1.000 on ≥40 hand-curated cases AND v0.2 reclassification report shows Magpie + Self-Taught Evaluator correctly reclassified away from Bill 10 ★.
2. **Stage 5 (POPULATE)**: `wiki_populate.py` emits Obsidian wiki/.
3. **Stage 6 (REVIEW)**: human_validation_queue.md + atlas injection.
4. **Stage 7 (LOCK)**: when zero unclassified entries, watchlist ≥30 entries, falsifiers ≥10 conditions, author-activity panel mapped.