# RL-from-Rewards Aiwiki

Forensic deep-loop closure-pattern atlas for RL-from-rewards alignment claims at frontier (≥70B) scale, 2024–2026. Mapping every claim to one of 13 closure mechanisms; surfacing empty-space hypotheses on robustness, closed-loop self-improvement, and frontier-scale third-party reproduction.

## Status

**Stage 1 (SCOPE) → Stage 2 (BILLS) drafted, 2026-05-14.** Awaiting batch 1 (8 parallel sweeps).

## Files

- [purpose.md](purpose.md) — threat model, scope, exclusions
- [schema.md](schema.md) — paper YAML / sweep JSON / bill classifier rule format
- [bills_draft.md](bills_draft.md) — 13 bills + 6 meta-costs + 3 escape gates, with 3 ★ predicted-empty (Bills 6, 10, 13)
- `deep_loops/` — sweep outputs (empty until batch 1 dispatched)
- `wiki/` — Obsidian-flavored populated wiki (empty until batch 1 + classifier complete)
- `scripts/` — bill_classifier.py + wiki_populate.py + wiki_atlas_review.py (empty until batch 1)

## Empty-space hypothesis (predeclared)

| Bill | Name | Status |
|---:|---|---|
| 6 ★ | Robust closure under reward perturbation | Predicted empty across 2024–2026 |
| 10 ★ | Closed-loop preference generation without distributional collapse | Predicted empty across 2024–2026 |
| 13 ★ | Frontier-scale alignment guarantee with third-party reproduction | Predicted empty across 2024–2026 |

Empty-space commitment: any paper triggering one of these bills cleanly within 2024–2026 prompts a public atlas update within 7 days.

## Cross-aiwiki coupling

- `inference_time_safety` — joint failure mode: RLHF doesn't survive deployment-time adversarial pressure (Bill 6)
- `mech_interp` — reward-model interpretability lens (Bill 4)
- `reasoning_cot` — process-reward overlap (Bill 9)
- `capability_benchmarks` — alignment-tax regression panel (Bill 7)