# RL-from-Rewards Aiwiki — Schema

## Deep-loop sweep JSON output

Each sweep agent returns an array of paper objects with this exact frontmatter:

```json
[
  {
    "paper_id": "arxiv:2412.01234" | "openreview:abc123" | "doi:10.xxxx/...",
    "title": "...",
    "authors": ["First Last", ...],
    "date": "YYYY-MM",
    "venue": "NeurIPS 2024" | "ICLR 2025" | "arxiv:cs.LG 2024-12" | "Anthropic blog 2025-03" | ...,
    "summary": "<3-sentence summary; focus on what closure mechanism the claim engages>",
    "candidate_bill": "Bill_1" | ... | "Bill_13" | null,
    "candidate_meta_cost": "M1" | ... | "M6" | null,
    "verdict": "known_bill" | "needs_gate" | "out_of_scope" | "needs_gate_declaration" | "rebuttal_paper",
    "confidence": 0.0..1.0,
    "watchlist_tier": "monthly" | "quarterly" | "triggered" | null,
    "method_family": "PPO" | "DPO" | "IPO" | "KTO" | "SimPO" | "ORPO" | "Step-DPO" | "RLAIF" | "Constitutional" | "Self-Rewarding" | "PRM" | "other:<name>",
    "model_scale_billions": null | number,
    "compute_budget_relative": null | number | "unspecified",
    "claimed_kl_bound": null | number | "unspecified",
    "evaluation_set": ["AlpacaEval-2" | "Arena-Hard" | "MT-Bench" | "IFEval" | "TruthfulQA" | "HHH" | "JailbreakBench" | "RewardBench" | "PRM800K" | ...],
    "reward_hack_probes": ["length_bias" | "sycophancy" | "refusal_patching" | "lying_under_pressure" | "specification_gaming" | "RM_overoptimization" | ...],
    "claimed_advantage_over_baseline": null | "<short claim>",
    "rebuttal_papers": [{"paper_id": "...", "summary": "..."}, ...],
    "notes": "..."
  },
  ...
]
```

## Bill classifier rule format

`scripts/bill_classifier.py` rules are tuples of (regex, bill, confidence-modifier):

```python
RULES = [
    # (pattern, bill, weight)
    (r"\bDPO\b|\bdirect\s+preference\s+optimization\b", "Bill_2", 1.0),
    (r"\bPPO\b.*\b(KL|kl[\-_]?(div|reg|penalty))\b", "Bill_1", 0.95),
    (r"\bRLAIF\b|\bconstitutional\s+AI\b", "Bill_5", 0.95),
    (r"\bself[\-_]?rewarding\b|\bself[\-_]?play\s+preference\b", "Bill_6", 0.95),
    (r"\b(PRM|process\s+reward\s+model)\b", "Bill_9", 0.95),
    (r"\b(reward\s+hacking|specification\s+gaming|Goodhart)\b", "Bill_3", 0.95),
    (r"\b(reward\s+model\s+over[\-_]?optimization|RM[\-_]?overoptim)\b", "Bill_3", 1.0),
    # ...
]
```

Two-pattern AND-form rules used for sub-bill resolution:

```python
RULES_AND = [
    # ((p1, p2), bill, weight)
    ((r"\bDPO\b", r"\b(IPO|SimPO|KTO|ORPO)\b"), "Bill_2", 1.0),
    ((r"\bConstitutional\b", r"\b(principle\s+drift|principle\s+leakage)\b"), "Bill_11", 1.0),
    ((r"\bself[\-_]?reward", r"\b(distributional\s+collapse|mode\s+collapse)\b"), "Bill_6", 1.0),
]
```

## Wiki paper file YAML frontmatter (`wiki/papers/<source>:<date>/<paper_id>.md`)

```yaml
---
paper_id: arxiv:2412.01234
title: "Self-Rewarding Language Models at Scale"
authors:
  - First Last
date: 2024-12
venue: arxiv:cs.LG 2024-12
summary: |
  Three-sentence summary tied to the closure mechanism.
bill: Bill_6
meta_cost: null
verdict: known_bill
confidence: 0.92
watchlist_tier: quarterly
method_family: Self-Rewarding
model_scale_billions: 70
compute_budget_relative: 4.2
claimed_kl_bound: null
evaluation_set:
  - AlpacaEval-2
  - Arena-Hard
reward_hack_probes:
  - length_bias
claimed_advantage_over_baseline: "+8 points AlpacaEval-2 vs DPO"
rebuttal_papers:
  - paper_id: arxiv:2503.05678
    summary: "Shows distributional collapse after iteration 4."
schema_version: "1.0"
---

(body of the paper note: extended discussion, why this paper triggers Bill_6
specifically, what cousin papers are in the watch-list, etc.)
```

## Bill file format (`wiki/bills/bill_NN_<name>.md`)

```yaml
---
bill_number: 6
bill_name: "Self-rewarding / synthetic-preference closure"
type: bill
empty_space_candidate: false
---

# Bill 6 — Self-rewarding / synthetic-preference closure

## What gets paid

(prose explanation of the closure mechanism)

## Triggers

| Paper | Date | Notes |
| --- | --- | --- |
| [[arxiv:2412.01234]] | 2024-12 | ... |
| ... | | |

## Cousin precedents

(historical anchor papers that established this closure mechanism)

## Re-poll cadence

monthly | quarterly | triggered

## Rebuttal literature

| Paper | Date | What it closes |
| --- | --- | --- |
| ... | | |
```

## Schema versioning

- `schema_version: 1.0` — current draft, will lock when batch 1 returns
- Future versions: bump on breaking change, document in `wiki/concepts/schema_changelog.md`