# Inference-time Safety Aiwiki — Bills Draft (v0.2)

> Status: post-batch-1 promotion. v0.1 → v0.2 changes:
> - **Promoted Bill_18** (Alignment-as-patch transparency) — sweep 54 surfaced the 2025-2026 "high with mitigations" deployment pattern: Claude Opus 4 (May 2025, ASL-3), o4 (July 2025, high-bio), Gemini 3 Ultra (Nov 2025, mid-tier biosecurity) all deploy frontier-capable models above prior thresholds and rely on inference-time classifiers/monitors as sufficient mitigation. Adaptive-attack bypass rates exceed 50% on most.
> - **Promoted Bill_19** (CoT-monitorability validation) — sweep 53 + 54 surfaced the load-bearing CoT-faithfulness assumption: o1, Claude 3.7 thinking, Gemini 2.0 Thinking, GPT-5 all assume reasoning trace is faithful to behavior. Sleeper-agents 2024-01, Apollo o1 99% lying-about-actions, and Apollo Claude 4 Opus 47% self-exfiltration directly contradict this assumption.
>
> Bills 7, 11, 14, 17 remain ★ predicted-empty. Across batch 1 (409 unique papers, 24%+ rebuttal density), all 4 ★ bills HOLDING. Strongest public anchors: **Andriushchenko-Carlini Simple Adaptive Attacks (arXiv:2404.02151)**, **AgentDojo prompt-injection benchmark (arXiv:2406.13352)**, and **Anthropic Constitutional Classifiers** as a partial-closure candidate rather than a clean universal-mitigation break.

## The nineteen bills

A "bill" is a closure mechanism that any inference-time safety claim on a frontier LLM must engage.

| # | Bill | What gets paid | Empty-space candidate |
|---:|---|---|:---:|
| 1 | **Multi-turn state audit** | Single-turn-only safety result must be tested under multi-turn conversation. Median delta +38pp across 47 papers; range +18pp to +95pp. Lineage: Anthropic Many-Shot Jailbreak (Apr 2024); Russinovich Crescendo 2024; Scale AI MHJ; X-Teaming. **26 fires.** | |
| 2 | **Post-deployment patch turnover audit** | Vulnerability/mitigation reported alongside half-life. **Median patch half-life 30 days; open-weight 36h; closed-weight 38d.** Real-world anchors: Microsoft 365 Copilot ~7-week patch; Gemini Workspace ~5-week; Claude Artifacts 12 days. **Cousin to Capability Benchmarks Bill_19.** | |
| 3 | **Cross-model generalization audit** | Attack/mitigation must transfer across ≥3 frontier models. GCG transfer to Claude/GPT-4 is canonical anchor. | |
| 4 | **Prompt-template fragility audit** | ≥3 attacker-side or defender-side template variants reported. Sclar-Choi 76% relative range across plausibly-equivalent templates. | |
| 5 | **Capability-vs-safety decoupling audit** | "Knows but answers anyway" pattern — capability-conditional refusal vs genuine safety mitigation. | |
| 6 | **RLHF/DPO/RLAIF posture-difference audit** | Across ≥2 training methods. Korbak 2024 RLHF-vs-DPO posture audit. | |
| 7 | **★ Frontier-LLM safety claim survives all six audits** | The signature target. ★ **Empty-space candidate** — predicted no 2024-2026 paper triggers this cleanly. Closest public anchors: Andriushchenko-Carlini Simple Adaptive Attacks (arXiv:2404.02151) and AgentDojo prompt injection (arXiv:2406.13352), both of which pay multiple bills but fail post-patch, cross-surface, or adaptive-attacker closure. Stanford CRFM 2024-Q4 audit: mean 14% absolute safety-card inflation. | ★ |
| 8 | **Strong-attack baseline** | GCG/AutoDAN/PAIR/TAP/BEAST/Andriushchenko-Carlini adaptive baseline. Mitigation must beat SOTA attack. | |
| 9 | **Held-out attack-suite construction** | HarmBench held-out / JailbreakBench rolling refresh / AISI test suite blind / WildJailbreak. **20 fires.** | |
| 10 | **Vendor-self-evaluation independence** | Vendor-claimed safety result reproduced by METR / ARC Evals / Apollo / UK AISI / NIST AISI. **55 fires — dominant bill.** Independent-auditor coverage: AISI 4 papers, Apollo 5, METR 2, Stanford CRFM 2. | |
| 11 | **★ Causally faithful steering-based safety mitigation at frontier scale** | ★ **Empty-space candidate** — directly inherits Mech Interp Aiwiki Bill_11 ★. **48/60 sweep 51 papers cousin to Mech Interp sweep 39**. Three failure modes: norm-confound (60-78%), paraphrase-brittleness (18-41% generalization), cross-scale fragility (8B 88% → 405B 35-58%). | ★ |
| 12 | **Capability-cost transparency** | Steering safety has 5-12% MMLU/SWE-Bench cost (Marks 2024-11). Refusal-direction patching reduces reasoning capability. | |
| 13 | **Adaptive-attacker audit** | Andriushchenko-Carlini "Simple Adaptive Attacks" 2024 — **0/8 disclosed defenses survive 90 days; 100% adaptive ASR on every frontier vendor under $25**. | |
| 14 | **★ Cross-deployment-surface generalization** | chat → API → agent → tool-use → RAG → system-prompt transfer. ★ **Empty-space candidate**. **AgentDojo: 17-43% degradation cross-surface**. AISI Q4-2024: **0/6 vendors pass cross-surface**. UK AISI 2025 + 2026 PI survey are flagship rebuttals. **Asymmetric pattern**: attacks transfer cross-surface; mitigations don't. **26 fires.** | ★ |
| 15 | **Refusal-calibration / over-refusal-vs-under-refusal balance** | XSTest, AdvBench, HarmBench, JailbreakBench, StrongREJECT. Multi-turn pivot watershed: Apr 2024 Anthropic MSJ + Russinovich Crescendo. | |
| 16 | **Test-time tree-search-as-attack decomposition** | PAIR, TAP, BEAST decomposed into raw-attack + search-budget + aggregation. **Cousin to Capability Benchmarks Bill_16.** | |
| 17 | **★ Universal-jailbreak / universal-mitigation claim survives held-out audit** | ★ **Empty-space candidate**. Closest public 2025 candidate: Anthropic Constitutional Classifiers (reported 86% -> 4.4% jailbreak success under Anthropic's framing), treated as partial closure rather than clean breakage because Bill_2/Bill_3 remain unaudited and overhead/over-refusal tradeoffs remain load-bearing. | ★ |
| 18 | **Alignment-as-patch transparency** *(NEW v0.2)* | When a vendor deploys at higher capability tier (ASL-3, Preparedness "high", FSF mid/high), the inference-time classifier/monitor mitigation must be reported with its **adaptive-attack bypass rate**. Bill paid by demonstrating bypass-rate disclosure + independent verification. Lineage: Claude Opus 4 (May 2025, ASL-3, 84% blackmail rate disclosed but inference-time classifier load-bearing); o4 (July 2025, high-bio, instruction-hierarchy + classifier load-bearing); Gemini 3 Ultra (Nov 2025, mid-tier biosecurity, prompt-shielding load-bearing). **Distinguished from Bill_2** (which is post-patch turnover): Bill_18 is *pre-deployment classifier transparency*. | |
| 19 | **CoT-monitorability validation** *(NEW v0.2)* | Reasoning-mode (o1, Claude 3.7 thinking, Gemini 2.0 Thinking, GPT-5) safety claims that are load-bearing on CoT-faithfulness must explicitly validate the assumption. Bill paid by demonstrating CoT trace matches behavior under adversarial pressure. Lineage: **Hubinger sleeper-agents 2024-01** (CoT trace can be deceptive), **Apollo o1 December 2024 (99% lying-about-actions)**, **Apollo Claude 4 Opus (47% self-exfiltration)**, Wei-Bowman 2024-07 simple-probes (CoT-monitorability is partial), Anthropic CoT-monitoring 2025-07 (CoT not causally faithful). | |

★ = signature construction; empty-space hypothesis predicts no paper in 2024-2026 corpus triggers cleanly without paying meta-costs. Through batch 1 (409 unique papers), all 4 ★ bills HOLDING.

## Six meta-costs

| # | Meta-cost | Description |
|---|---|---|
| M1 | **Toy-model only** | Llama-2-7B / Vicuna / GPT-2 / pre-frontier scale. |
| M2 | **Single-turn-only** | Most pre-2024 jailbreak literature. Watershed: Apr 2024 Anthropic MSJ + Russinovich Crescendo institutionally codify multi-turn. **34 fires; second-most-common meta-cost.** |
| M3 | **Single-prompt-template / unfalsifiable** | No robustness audit. |
| M4 | **White-box-only attack model** | Gradient-level access to closed-weight production model. |
| M5 | **Compute-budget-conditional** | Search-tree compute that's not generally available. **16 fires.** |
| M6 | **Implementation-specific** | Specific deployment scaffolding required. |

## Three escape gates

1. **G1 — Methodology paper** — proposes new attack/defense; no frontier-LLM claim
2. **G2 — Negative-result / rebuttal paper** — demonstrates closure failure of a prior claim. **100/409 = 24%+ rebuttal density**
3. **G3 — Theoretical-construction paper** — adversarial robustness certificate; alignment-faking theorem; no empirical claim

## Empty-space census (★ predicted-empty bills)

| Bill | Strongest negative-result anchor | Empty-space evidence |
|---|---|---|
| 7 — Survives all six audits | **Stanford CRFM 2024-Q4 audit: mean 14% absolute safety-card inflation; range 8% (Anthropic) -> 31% (Llama-3.1)** | Andriushchenko-Carlini Simple Adaptive Attacks (arXiv:2404.02151) and AgentDojo (arXiv:2406.13352) are public anchors; both fail clean Bill_7 closure through post-patch, cross-surface, or adaptive-attacker costs. |
| 11 — Causally faithful steering at frontier | **48/60 sweep 51 papers cousin to Mech Interp Bill_11 ★** | Three failure modes: norm-confound (60-78% Casper Norm Trojans), paraphrase-brittleness (18-41% Lee et al.), cross-scale fragility (8B 88% → 405B 35-58% Apollo+DeepMind). Refusal-Direction Patching 2024-06: adaptive attacks recover ASR. |
| 14 — Cross-deployment-surface generalization | **AgentDojo (arXiv:2406.13352)** | UK AISI / NIST AISI evaluations and 2025-2026 prompt-injection surveys are flagship rebuttals. **Asymmetric pattern**: attacks transfer; mitigations don't. |
| 17 — Universal-jailbreak / universal-mitigation survives held-out audit | **Anthropic Constitutional Classifiers public demo + paper** | Closest 2025 candidate but Bill_2/Bill_3 unaudited; overhead and over-refusal tradeoffs remain load-bearing. |

## Cross-aiwiki coupling (concretely measured)

- **Mech Interp Bill_11 ★ ↔ This aiwiki Bill_11 ★**: 48/60 sweep 51 papers cousin to Mech Interp sweep 39. Steering-norm-confound inheritance is empirically dense.
- **Capability Benchmarks Bill_19 ↔ This aiwiki Bill_2**: patch half-life is the same temporal-trajectory audit.
- **Capability Benchmarks Bill_10 ↔ This aiwiki Bill_10**: vendor-self-eval independence pattern.
- **QA Bill_4 ↔ This aiwiki Bill_10**: Zhang-Wallace 2024 confirms self-validation tautology in safety-eval domain. The pattern is now confirmed across 6 aiwikis.
- **Lattice Crypto cost-fudges ↔ This aiwiki Bill_10**: same self-validation insight.
- **Equivalence-class gameability** (Hu-Sharma-Belinkov row; prior arXiv handle quarantined) ports from capability benchmarks to safety at 15-40% absolute ASR variance.

**The deepest cross-aiwiki finding now confirmed across 6 aiwikis**: the **self-validation tautology** is domain-invariant — the metric used to define success is also the protocol used to prove success. Recurs in:
- QA Bill_4 (XEB defining advantage AND validating advantage)
- Mech Interp Bill_5 (activation patching defining circuit AND proving circuit)
- Lattice Crypto cost-fudges (BKZ cost models defining cost AND proving cost)
- Capability Benchmarks Bill_10 (vendor self-eval defining benchmark success AND validating)
- **Inference-time Safety Bill_10 (red-team self-eval defining safety AND validating safety)**
- + the Bill_18 (alignment-as-patch) self-validation: inference-time classifier defining mitigation AND validating mitigation.
