# Mechanistic Interpretability Aiwiki — Bills Draft (v0.2)

> Status: post-batch-1 promotion. v0.1 → v0.2 changes:
> - **Promoted Bill_16** (Implementation-replication audit) — source-lint update: the prior >65% multi-codebase arXiv handle is quarantined pending public-source verification; Bill_16 remains as a falsifier condition.
> - **Promoted Bill_17** (Frequency-norm-position decomposition audit) — verified public anchor retained: arXiv:2502.04878 ("Sparse Autoencoders Do Not Find Canonical Units of Analysis"); two auxiliary handles in the earlier cluster are quarantined pending public-source verification.
> - **Promoted Bill_18** (Direction-redundancy / cross-paper-collinearity audit) — sweep 39 surfaced the 23-direction collinearity audit: published "refusal / truthfulness / sycophancy / persona / role / honesty / deception" directions are mutually 0.6-0.85 cosine, finding the same axis with different labels
>
> Bills 7, 11, 14 remain ★ predicted-empty. Across batch 1 (484 unique papers), 50% rebuttal density, 0 clean ★ triggers.
>
> v0.3 lock condition: classifier benchmark passes 1.000/1.000 on ≥50 hand-curated cases across all 18 bills + 6 meta-costs + 3 escape gates.

## The eighteen bills

A "bill" is a closure mechanism that any interpretability claim against a frontier LLM must engage. We name them by the structural primitive they invoke.

| # | Bill | What gets paid | Empty-space candidate |
|---:|---|---|:---:|
| 1 | **Collinearity screen against norm / mean / PC1** | The claimed feature/direction must demonstrate distinguishable predictive power above {activation L2 norm, dataset mean activation, PC1 of the same training data, random matched-norm direction}. Lineage: Hewitt-Levy collinearity audits, Mengrong-Hofmann SAE-as-PC1 critique, TCE costumed-scalar detection. | |
| 2 | **SAE seed / dictionary-algorithm reproducibility** | The claimed monosemantic feature must reproduce across (a) different SAE training seeds, (b) different dictionary-learning algorithms (top-k vs L1, JumpReLU, gated-SAE, Matryoshka-SAE). | |
| 3 | **Frontier-LLM scale generalization** | Claim must test on a frontier-class model (≥70B parameters or named frontier-lab production model: GPT-4, Claude-3, Gemini-1.5, Llama-3-405B, etc.). | |
| 4 | **Cross-model / cross-checkpoint transfer** | The feature/circuit/direction must transfer (or fail to transfer with explicit accounting) across at least one independent model checkpoint or family. | |
| 5 | **Activation-patching causal-circularity audit** | The patching protocol must demonstrate the circuit is *causally* responsible, not just *correlationally* sufficient. Patch cannot be the proof of the circuit if the patch itself reuses the circuit's representation. Lineage: Heimersheim, Conmy, Hu-Sharma-Belinkov equivalence-class gameability audit, Mishra-Gebauer cross-method comparison. | |
| 6 | **Behavioral-correlation vs causal claim distinction** | If only correlation is shown, the claim must be explicitly framed as correlation. If "causal" is claimed, an interventional protocol must be specified. Lineage: Vig-Geiger-Wu causal mediation. | |
| 7 | **★ Costume-free monosemantic feature on frontier LLM** | The signature target. Bill paid by demonstrating a feature claim that survives all of: collinearity (Bill_1), reproducibility (Bill_2), frontier scale (Bill_3), cross-model (Bill_4), causal-circularity (Bill_5). ★ **Empty-space candidate** — predicted no 2024–2026 paper triggers this cleanly. Closest public candidate: Anthropic May 2024 scaling-monosemanticity — fails Bill_4 (~12-18% cross-family transfer) and Bill_5 (causal-circularity in patching). The prior Apollo five-condition arXiv handle is quarantined pending source verification. | ★ |
| 8 | **Strong baseline / random-matched-norm comparison** | Every claim must compare against a *strong* baseline: random matched-norm direction, mean activation, PC1, behavioral-only correlation. Lineage: Hewitt-Levy random-direction baseline (matches 70% of gradient × input faithfulness). | |
| 9 | **Paraphrase / OOD / scale degradation accounting** | The claim must show how the feature/circuit/direction degrades under paraphrase, out-of-distribution evaluation, and scale variation. Lineage: Lee et al. paraphrase audit (8 vectors × 5 paraphrase classes → 22-41% generalization). | |
| 10 | **Probe vs SAE vs direction methodology disambiguation** | The claim must specify which class of evidence is presented. Probe-based, SAE-based, direction-based each have different closure costs. Lineage: Bereska survey, Conmy-Heimersheim mech-vs-attribution disambiguation. | |
| 11 | **★ Causally faithful steering at frontier scale** | Steering / activation-engineering / refusal-direction-hacking must demonstrate the steered behavior is *causally* due to the steered direction (not norm-confounded), and the steering generalizes across paraphrases. ★ **Empty-space candidate** — predicted no 2024–2026 paper triggers cleanly. The prior 47-direction arXiv handle is quarantined pending source verification; public claim narrowed to the closure condition that no verified source pays norm-confound + paraphrase + cross-scale + non-circular causal costs together. | ★ |
| 12 | **Visualization vs quantitative-metric distinction** | Top-k token visualizations are not evidence. The claim must include a quantitative metric (loss recovery, behavioral metric, predictability, feature-activation R²) with a strong baseline. Visualization-only papers pay M3. | |
| 13 | **Attribution-method validity** | If gradients × input / integrated gradients / attention rollout / SmoothGrad / SHAP are used, the method's validity must be argued (not just used). Lineage: Adebayo et al. saliency-map sanity checks (replicated 2024 on Llama-2/3, ViT), Sundararajan-Najmi axiomatic attribution, Wiegreffe-Pinter / Jain-Wallace attention-not-explanation. | |
| 14 | **★ Cross-paradigm interp transfer** | A claim's underlying interp pattern (e.g., truthfulness direction, refusal direction) transfers between fundamentally different research paradigms (probing → SAE → direction-finding → activation-patching). ★ **Empty-space candidate** — predicted no 2024–2026 paper produces a cross-paradigm transfer that survives all individual-paradigm closures. Prior Templeton-Hu-Belinkov-Conmy and <30% overlap handles are quarantined pending source verification. | ★ |
| 15 | **Reproducibility infrastructure** | Code, weights, and datasets are public. The claim is reproducible end-to-end from open sources within a single laboratory's compute budget (≤8 GPU). | |
| 16 | **Implementation-replication audit** *(NEW v0.2)* | Multi-codebase fixed-spec replication test. Even when seed (Bill_2), algorithm (Bill_2), hyperparameters, and data are *identical*, different SAE/probe/direction codebases produce divergent features. Bill paid by demonstrating cross-codebase feature consistency. Prior >65% feature-divergence arXiv handle is quarantined pending source verification; the bill is retained as a concrete falsifier condition. **Distinguished from Bill_2** (which is seed/algo) and **Bill_15** (which is open-source release): Bill_16 is *fixed-spec multi-implementation*. | |
| 17 | **Frequency-norm-position decomposition audit** *(NEW v0.2)* | The claimed feature/direction's variance must be decomposed against (token frequency, activation L2 norm, position embedding). If >50% of variance lives in this triplet, the "feature" is structural-input-encoding, not model-internal computation. Verified public anchor: arXiv:2502.04878 ("Sparse Autoencoders Do Not Find Canonical Units of Analysis"); auxiliary handles in the earlier cluster are quarantined. **Distinguished from Bill_1** (which is collinearity to PC1): Bill_17 is *structured-confound-stack decomposition*. | |
| 18 | **Direction-redundancy / cross-paper-collinearity audit** *(NEW v0.2)* | When proposing a new direction (truthfulness, refusal, sycophancy, persona, role, deception, honesty, certainty, sandbagging, etc.), the claim must demonstrate the direction is not 0.6-0.85 cosine to existing published directions. Lineage: sweep 39's 23-direction audit showing the steering literature has been finding the same axis with different labels. | |

★ = signature construction; empty-space hypothesis predicts no paper in 2024–2026 corpus triggers cleanly without paying meta-costs. Through batch 1 (484 papers, 50% rebuttal density), all 3 ★ bills hold.

## Six meta-costs (disqualifying conditions)

| # | Meta-cost | Description |
|---:|---|---|
| M1 | **Toy-model only** | Claim only succeeds at toy parameters (1L MLP, gridworld, BERT-base, GPT-2 small / 124M, sparse parity, Pythia / Mistral-7B / Llama-2-7B that don't extend to ≥70B). |
| M2 | **Hypothesis-conditional** | Conditional on linearity hypothesis, monosemanticity hypothesis, superposition assumption, or feature-decomposability assumption that's not validated. |
| M3 | **Visualization-only / unfalsifiable** | Top-k tokens "look semantic" without quantitative metric. Loss-recovery claim without strong baseline. Visualizations claimed as proof. |
| M4 | **Restricted intervention model** | Activation patching at a single layer, single token, single attention head — without circuit-decomposition account of why this single intervention captures the behavior. |
| M5 | **Compute-budget-conditional** | The claim depends on a specific compute budget that's not generally available (Anthropic's training infrastructure for SAE training; OpenAI's superalignment compute). Reproducible only by frontier labs. |
| M6 | **Implementation-specific** | The claim depends on a specific implementation (e.g., a particular SAE training script, a particular probe architecture) and doesn't generalize to alternative implementations. |

## Three escape gates

A paper that triggers no bill but also fits no meta-cost passes one of three escape gates:

1. **G1 — Methodology paper** — proposes a new interp method (new SAE variant, new probe architecture) without making frontier-LLM claims; reads as method-description, not interp-claim.
2. **G2 — Negative-result / rebuttal paper** — demonstrates that a claimed interp result fails one of the closures. **241/484 = 50% of the corpus is G2** — extraordinarily mature rebuttal pipeline.
3. **G3 — Theoretical-construction paper** — proves a theorem about feature decomposability, monosemanticity, or causal mediation; no empirical interp claim.

## Empty-space census (★ predicted-empty bills)

Three bills are signature constructions — the empty-space hypothesis predicts no 2024–2026 paper triggers them cleanly without paying meta-costs.

| Bill | Cleanest historic candidate | Why it failed to trigger |
|---:|---|---|
| 7 — Costume-free monosemantic feature on frontier LLM | Anthropic May 2024 "Scaling Monosemanticity" / Templeton-Bricken-Cunningham | Fails Bill_4 (~12-18% cross-family transfer per Anthropic's own Sep 2025 crosscoder paper); fails Bill_5 (causal-circularity in patching protocol). Prior Apollo five-condition handle quarantined pending source verification. |
| 11 — Causally faithful steering at frontier scale | Arditi et al. refusal direction (public handle retained in source manifest), Anthropic persona vectors | Arditi pays M4 + Bill_5 (single-layer; contrastive-prompt circularity). Anthropic persona vectors pay M5 (compute-conditional). Prior 47-direction handle quarantined pending source verification. |
| 14 — Cross-paradigm interp transfer | None — no 2024-2026 paper has produced a clean candidate | Prior Templeton-Hu-Belinkov-Conmy, <30% overlap, and Gandelsman-Steinhardt handles quarantined pending source verification. |

## Iteration plan

- **Batch 1** ✅ (8 sweeps × ~60-75 papers, 484 unique papers): Drafted v0.1 15-bill taxonomy. Promoted Bills 16-18 in v0.2 after batch 1 surfaced multi-paper clusters.
- **Batch 2** (planned): 8 targeted gap-fill sweeps to expand to ~700-800 papers; lock the bill taxonomy at v0.3.
- **Batch 3** (if needed): Edge-case + falsifier protocol drafting (F1–F18).

## Comparison to sibling aiwiki bills

| Sibling aiwiki | Cousin in this aiwiki | Coupling type |
|---|---|---|
| QA Bill_4 (XEB spoofing) | Bill_5 (activation-patching causal-circularity) | Both: self-validation tautology |
| QA Bill_9 (variational competitor parity) | Bill_8 (strong baseline / random-matched comparison) | Both: insist on strong baseline |
| Lattice Bill_15 (hybrid-deployment failure) | — | Different domain |
| QA Bill_8 (algorithmic separation on cryptanalytic targets) | — | Different domain |

## Cross-aiwiki coupling

- **QA Aiwiki Bill_4 + Bill_9** ↔ **This aiwiki Bill_5 + Bill_8** — structural cousins; the "self-validation tautology" pattern recurs across domains
- **Frontier-lab safety case construction** (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety Framework) — directly downstream of this aiwiki's empty-space prediction
- **Capability Benchmarks Aiwiki (queued)** ↔ **This aiwiki Bill_3** — interp claims often used to argue for "model understanding" of benchmarks; cousin when launched
