# RL-from-Rewards Aiwiki — Purpose

## Threat model (verbatim)

> *Demonstrate robust alignment of a 70B-class or larger frontier LLM under reward-model perturbations within ε of the original preference distribution, achieving (a) pass-rate ≥ baseline on a held-out preference set, (b) measurable absence of reward hacking on N ≥ 5 published probe tasks, and (c) bounded KL-divergence from base policy at deployment, in 2026.*

Equivalent restatements per technique:

- **Online RLHF (PPO-shape)**: maintain ≤0.05 KL from base policy across 50K training steps without entering reward-hacking regime; published probe tasks include length bias, sycophancy, refusal patching, jailbreak susceptibility, lying-under-pressure
- **Direct preference (DPO / IPO / KTO / SimPO / ORPO)**: match or beat PPO on AlpacaEval-2 + Arena-Hard + held-out preference set without specification-gaming on canonical reward-hack probes
- **Constitutional AI / RLAIF**: replace human preference labels with model-generated preferences while preserving (a)+(b)+(c) without principle-drift
- **Process-reward / step-level reward** (PRM-shape, MathShepherd, OpenAI PRM800K lineage): step-level reward models on multi-step reasoning without step-reward hacking
- **Self-play / synthetic preferences** (Anthropic constitutional, Self-Rewarding LM, SPIN): closed-loop preference generation without distributional collapse

This excludes:

- **Pre-2024 RLHF era** (InstructGPT, summarization-RL, original ChatGPT) — covered by meta-cost M1 (toy regime relative to 2024–2026 frontier)
- **SFT-only or SFT-comparison-only papers** that don't actually train a reward signal — covered by M2 (no closure mechanism engaged)
- **Reward modeling papers without alignment evaluation** — covered by M3 (model-only, not algorithm-claim)
- **Single-task RL fine-tuning** (e.g., math-only, code-only) without generalization claim — covered by M4
- **Methods that require >100× the base PPO compute budget** to reach claimed performance — covered by M5 (resource-unbounded analog of factorization M5)
- **Implementation-specific results** on a single closed model (e.g., GPT-4 turbo only) without reproducibility — covered by M6

## Why this scope

The "RLHF replacement" wave (DPO 2023 → IPO/SimPO/KTO/ORPO/Step-DPO 2024–2025) and the "RLAIF / Constitutional AI" wave (Anthropic 2022 → Self-Rewarding 2024 → Magpie 2025 → CAI-at-scale 2025–2026) have produced ≥300 papers claiming alignment improvements. Each claim engages one or more closure mechanisms — what does the proposed method have to *pay for* to preserve the alignment guarantee?

The aiwiki maps every claim to its closure mechanism and surfaces the empty space — methods that survive every closure and provide robust, measurable alignment improvements at frontier scale. The empty-space hypothesis predicts: **the empty space is currently empty** through 2026 — most claims pay one or more meta-costs (compute, scale, evaluation choice) that the headline doesn't acknowledge.

## Cross-aiwiki adjacency

- **`inference_time_safety` aiwiki**: complementary, not overlapping. Inference-time safety covers jailbreaks / prompt injection / red-team after deployment. RL-from-rewards covers alignment-during-training. A paper that breaks RL-from-rewards alignment is *upstream* of inference-time-safety failures.
- **`mech_interp` aiwiki**: cousin. Reward-model interpretability (Anthropic preference-model SAE 2024–2026) appears here as Bill 8 (reward-model identifiability) and there as a circuits-level analysis. Same papers, different lens.
- **`capability_benchmarks` aiwiki**: AlpacaEval, Arena-Hard, MT-Bench, IFEval, JudgeArena live there. RL-from-rewards papers cite these benchmarks; this aiwiki tracks the *training methods*, not the benchmarks themselves.
- **`reasoning_cot` aiwiki**: cousin. Process-reward models (PRMs) live here as Bill 9 (step-level reward identifiability) and there as Bill on reasoning verifiability. STaR / V-STaR / RFT lineage straddles both.

## Q-Day analog (policy-divergence panel)

This aiwiki has a `wiki/policy_trajectory.md` panel mirroring the Q-Day Trajectory in factorization:

- **NIST AISI** (US AI Safety Institute, formed 2024-Feb) — RLHF / red-team evaluation guidance in NIST AI RMF Generative AI Profile (NIST AI 600-1)
- **UK AISI** — model evaluation publications, RLHF safety case framework
- **EU AI Act** (Reg 2024/1689) — Article 51 GPAI obligations including alignment-relevant risk evaluation; Code of Practice (2025)
- **Bletchley/Seoul/Paris AI Safety Summits** — frontier-model safety commitments referencing RL-from-rewards as load-bearing mitigation
- **Anthropic RSP, OpenAI Preparedness Framework, GDM Frontier Safety Framework** — vendor commitments that load-bear on RL-from-rewards working

A trigger in this aiwiki — i.e., a published clean break of RL-from-rewards alignment at frontier scale — would force vendor RSP revisions and federal evaluation standard updates.

## Out of scope (not in this aiwiki)

- Pure SFT methods (InstructGPT-1 era SFT, FLAN, T0) without reward signal
- Pre-2024 RLHF (InstructGPT-1, summarization, WebGPT) — historical anchors only
- Pure benchmark / dataset / leaderboard papers without method claim — `capability_benchmarks` aiwiki
- Pure red-team / jailbreak / adversarial-prompt papers without alignment-method claim — `inference_time_safety` aiwiki
- Pure reward-model interpretability without alignment-effectiveness claim — `mech_interp` aiwiki
- Pure RLHF infra papers (vLLM-RLHF, OpenRLHF systems) — engineering escape gate
- RL for non-LLM domains (robotics, game-playing, control) — `robotics_embodied` aiwiki + out of scope
- Pure preference-data dataset releases (Nectar, Skywork, UltraFeedback) without method claim — escape gate

## In scope (this aiwiki's corpus)

- arXiv `cs.LG` / `cs.CL` / `cs.AI` 2024–2026 with keywords: `RLHF`, `DPO`, `IPO`, `KTO`, `SimPO`, `ORPO`, `Constitutional AI`, `RLAIF`, `process reward`, `PRM`, `Self-Rewarding`, `reward hacking`, `specification gaming`, `Goodhart`, `KL regularization`
- NeurIPS 2024 / 2025 alignment / RL tracks
- ICLR 2025 / 2026 (workshops + main)
- ICML 2025 alignment / RL
- COLM 2024 / 2025
- OpenReview (any with `alignment` in subject area)
- Anthropic, OpenAI, DeepMind, Meta AI, Microsoft Research alignment-publication blogs and tech reports (2024–2026 only)
- AISI (US + UK) publications
- METR publications
- Apollo Research publications
- Constitutional-AI / Sparrow / Claude / GPT-4 / Llama-3 / Llama-3.1 / DeepSeek-V3 / Qwen-2.5 alignment sections of model cards
- Anthropic Responsible Scaling Policy commitments + actual evaluations
- OpenAI Preparedness Framework commitments + actual evaluations
- DeepMind Frontier Safety Framework commitments + actual evaluations

## Adjacent corpora (the watch-list)

- Process-reward / step-level reward literature (cousin to `reasoning_cot` Bill on verifiability)
- Reward-model interpretability (cousin to `mech_interp` SAE / circuits literature)
- Constitutional AI principle-drift literature (cousin to `inference_time_safety` jailbreak papers post-CAI)
- Self-play / synthetic-preference literature (cousin to evolutionary-optimization aiwiki candidate)
- Alignment-tax / capability-degradation literature (cousin to `capability_benchmarks` regression panel)

## Authorship

Kevin Russell (Project 42). Scoping draft.

## Status

Stage 1 (SCOPE).
