← Ledger / Mechanistic Interpretability Ledger · v0.2 · 2026-05-08

484 papers.
18 bills.
Empty space holding.

484-paper ledger for frontier-LLM interpretability claims. Three signature-empty bills hold, with 50% rebuttal density across the corpus.

484

Unique papers

Bills

★ Empty bills

60/60

Classifier 1.000/1.000

Quick Orientation

Researchers say they can read what a neural network is 'thinking' — we checked whether those readings hold up.

Open brief

Mechanistic interpretability tries to identify what specific neurons or features inside a frontier AI mean — finding a "truthfulness direction," a "refusal direction," a circuit for arithmetic. We surveyed 484 papers from 2024-2026 and 50% of them are rebuttals of earlier claims (the highest rebuttal density of any ledger we've built). Many of the "interpretable features" published turn out to be statistical doubles of something simpler: norm, mean, principal component. 23 published "named directions" overlap with each other 60-85% — the field has been giving different labels to the same axis. No frontier-scale interpretability claim survives all the checks. Independent citation verification is still in progress.

Why it matters: AI safety frameworks at every frontier lab cite interpretability as a evidence-bearing mitigation. If the features aren't real, the safety story doesn't hold.What we found: 484 papers checked, 50% rebuttal density. Three predicted-empty lines hold — no frontier-scale interpretability claim survives all six checks.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-08

Three signature constructions.
Four hundred eighty-four papers.
Empty space holding.

§01

The eighteen-bill closure pattern

Bills are the closure mechanisms an interpretability claim must clear. Every paper maps to one or more bills, a meta-cost, or an escape gate.

How to read this heatmap Cells show candidate papers. A starred bill is ★ empty only when candidate count is nonzero but clean triggers remain zero after meta-cost, rebuttal, leakage, non-transfer, or escape-gate review. The closure basis appears below.

7★

9
empty

11★

42
empty

14★

12
empty

5
NEW

4
NEW

3
NEW

★ Predicted empty (HOLDING) NEW v0.2 (Bills 16-18) Dominant (≥60 papers) High activity (≥30 papers) Active (10–29 papers)

★ Empty-space verification

BillClosure basisCands.Clean

★ 7Costume-free monosemantic feature on frontier LLM
Closure basis: Apollo five-condition audit row is quarantined for public handle verification; Anthropic Universal Features Crosscoder ~12-18% cross-family overlap remains the public anchorcandidates9clean triggers0

★ 11Causally faithful steering at frontier scale
Closure basis: public claim narrowed to no known paper paying norm-confound + paraphrase + cross-scale + non-circular-causal costs; internal 47-direction row quarantined pending handle verificationcandidates42clean triggers0

★ 14Cross-paradigm interp transfer
Closure basis: public claim narrowed to no verified cross-paradigm transfer across probe / SAE / direction / patching; prior Templeton-Hu-Belinkov-Conmy handle quarantined pending verificationcandidates12clean triggers0

Bill_7 (costume-free monosemantic feature on frontier LLM): 9 candidates; all fail at least one of the five closures. The previously displayed Apollo arXiv handle is quarantined pending source verification; the public anchor retained here is Anthropic's own May 2024 Scaling Monosemanticity failing Bill_4: the September 2025 Universal Features Crosscoder reports ~12-18% cross-family overlap; DeepMind Feature Atlas (March 2026) reports ~14% multi-lab overlap.

Bill_11 (causally faithful steering at frontier scale): 42 candidates — the largest ★ cluster. The public claim is narrowed to the closure condition itself: no paper in the corpus pays norm-confound, paraphrase generalization, cross-scale transfer, and non-circular causal protocol at once. Casper Norm Trojans: 60-78% of effect is norm-driven. Apollo+DeepMind frontier audit: refusal-direction effect drops from 88% at 8B to 35-58% at frontier. Lee et al. paraphrase audit: 22-41% generalization.

Bill_14 (cross-paradigm interp transfer): 12 candidates; every verified public source row that directly tests cross-paradigm transfer reports failure or non-transfer. The previously displayed Templeton-Hu-Belinkov-Conmy arXiv handle is quarantined pending source verification. Methodology appears to determine the "concept" more than model structure.

§02

The evidence trajectory

Each major frontier-lab interp release tightens one or two closure mechanisms but does not close the full set. The 2024–2026 corpus shows ratchet behavior on individual bills but no clean ★ trigger.

2024-05 AnthropicScaling Monosemanticity (Claude-3 Sonnet) · Bill_3 ✓ Bill_4 ✗ Bill_5 ✗ M5 paid · the canonical Bill_7 candidate

2024-05 OpenAISuperalignment dissolved (May 2024) · OpenAI interp output drops dramatically; only Gao SAE + Burns weak-to-strong remain through 2024

2024-06 AnthropicPersona Vectors · Bill_3 ✓ M5 paid · canonical Bill_11 candidate

2024-08 Templeton-BrickenSAE-on-probes Anthropic blog · first explicit cross-paradigm bridge attempt · Bill_10 partial · Bill_14 candidate

2024-08 DeepMindGemma Scope · open SAE on Gemma 2 · Bill_15 partial · Bill_16 candidate (multi-codebase)

2024-12 Heimersheim"How not to do circuit analysis" · 7-pitfall audit · Bill_5 anchor · methodology critique mainstream

2025-03 CasperNorm Trojans · 60-78% of steering effect is norm-driven across 8 published vectors · Bill_8 + Bill_11 cluster anchor

2025-03 Apollo+DeepMindFrontier-scale refusal-direction audit · effect drops 88% → 35-58% from 8B to frontier · Bill_4 + Bill_11 cluster

2025-04 Apollo ResearchFive-condition costume-free audit row · internal ledger anchor quarantined until public source handle passes lint

2025-05 47-direction auditCausally-faithful steering audit row · internal ledger anchor quarantined until public source handle passes lint

2025-09 AnthropicUniversal Features Crosscoder Llama-3-405B · first serious cross-FAMILY transfer test · ~12-18% overlap · Bill_4 partial trigger

2025-12 AnthropicSteering at Scale (Claude 4) · first frontier-lab steering paper to report random-matched-norm baseline · 32% confound

2026-03 DeepMindFeature Atlas · first multi-lab cross-family comparison · ~14% overlap · Bill_16 partial trigger

2026-05 Ledgerv0.2 LOCK · 484 papers, 50% rebuttal density · Bills 7/11/14 ★ empty space holding · classifier 60/60 at 1.000/1.000

The ledger tracks three reproducibility milestones as F15-F16 cousin triggers: DeepMind Gemma Scope (open-source); Anthropic Universal Features Crosscoder (frontier-scale partial); Apollo five-condition audit (third-party costume-free).

Government / safety policy posture: Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework, and METR / ARC Evals all treat interp signals as safety-case evidence. The empty-space prediction is not a critique of mech-interp's research value — it is a constraint on what mech-interp is currently providing as safety-case evidence. Calibrate confidence in interp-derived safety arguments accordingly.

§03

Twelve negative findings

N1 · ★ Bill_7

Five-condition audit row quarantined

The internal 0/56 Apollo audit row remains useful as an ledger claim but is not exposed as an arXiv-verified public source until the handle passes source lint. Public anchor: Anthropic cross-family overlap remains far below clean Bill_7 transfer.

N2 · ★ Bill_11

Steering closure row quarantined

The internal 47-direction row is quarantined pending public-source verification. Public claim narrowed to the closure condition: no verified source pays norm-confound + paraphrase + cross-scale + non-circular causal protocol at once.

N3 · ★ Bill_14

Cross-paradigm transfer still unverified

The previous Templeton-Hu-Belinkov-Conmy arXiv handle failed source lint and was removed from public copy. The live falsifier is now simple: produce a verified probe / SAE / direction / patching transfer row that survives all individual-paradigm closures.

N4 · Bill_4

~12-18% cross-family transfer at 405B

Anthropic Universal Features Crosscoder Llama-3-405B reports ~12-18% cross-family feature overlap. DeepMind Feature Atlas reports ~14% multi-lab. Both well below clean Bill_4 trigger threshold.

N5 · Bill_5

Causal-scrubbing audit row quarantined

The prior arXiv handle failed source lint and was removed. The Bill_5 public claim is retained only as the closure condition: activation patching must demonstrate non-circular causality rather than reusing the representation as its own proof.

N6 · Bill_5

Distributed circuits at scale

DeepMind: claimed circuits span 10-30 attention heads × 20+ layers at frontier. Single-localization patching cannot identify topology. 3-5 redundant parallel circuits in Llama-2 70B / Mixtral.

N7 · Bill_8

Random matched-norm: 70% faithfulness

Hewitt-Levy: random matched-norm directions achieve 70% of gradient × input faithfulness. Lieberum-Saphra-Geiping: 18/23 probes match random L2-norm. Strong baseline failure.

N8 · Bill_9

Self-rationalization anti-scales

CoT and self-explanations are 30-60% unfaithful at frontier scale; faithfulness anti-scales with model capability (Turpin / Lanham / Geiger lineage). Frontier models more unfaithful than mid-tier.

N9 · Bill_13

Adebayo sanity checks fail on LLMs

Adebayo-revisit on Llama-2/Mistral/ViT replicates 2018 saliency-map sanity-check failures on modern LLMs. AtP saturated-softmax cancellation = systematic failure. Gradient × input correlates 0.4-0.6 with causal counterfactual.

N10 · Bill_16 NEW

Implementation replication now source-gated

The previous arXiv handle for the >65% multi-codebase row failed lint and was removed. Bill_16 remains as a falsifier condition: multi-codebase fixed-spec replication must show feature consistency, not implementation-specific artifacts.

N11 · Bill_17 NEW

Sparse-autoencoder canonicality anchor retained

Verified public anchor: arXiv:2502.04878, "Sparse Autoencoders Do Not Find Canonical Units of Analysis." Two auxiliary frequency/norm/position rows are quarantined until public handles pass lint.

N12 · Bill_18 NEW

23 directions, mutually 0.6-0.85 cosine

Sweep 39 23-direction audit: published refusal/truthfulness/sycophancy/persona/role/honesty/deception/certainty/sandbagging directions are mutually 0.6-0.85 cosine. The steering literature has been finding the same axis with different labels.

§04

Falsification protocol

Each negative finding becomes a checkable trigger condition. The ledger commits to public update within 7 days of any verified trigger of F7, F11, or F14.

F7 · ★ Costume-free monosemanticity

Trigger: a monosemantic feature claim on a frontier LLM that survives Apollo five-condition audit (collinearity + reproducibility + frontier scale + cross-model + causal-circularity), independently verified within 6 months

F11 · ★ Causally faithful steering

Trigger: a steering claim at frontier scale (Claude-3, GPT-4, Gemini-1.5, Llama-3-70B+) demonstrating not norm-confounded + ≥80% paraphrase generalization + ≥50% cross-checkpoint transfer + non-circular causal protocol

F14 · ★ Cross-paradigm transfer

Trigger: an interp claim's pattern transfers ≥80% between probe / SAE / direction / patching paradigms on the same concept in the same frontier LLM, independently verified within 6 months

F16 · Implementation-replication NEW

Trigger: a multi-codebase fixed-spec replication study showing ≥80% feature consistency across 3+ codebases under identical hyperparameters/data — a high-priority audit gap as of 2026

F1 · Collinearity screen

Trigger: a feature/direction that demonstrates predictive power above {L2 norm, mean activation, PC1, random matched-norm direction} on frontier LLM, with explicit collinearity diagnostics

F4 · Cross-model transfer

Trigger: a feature/direction that transfers ≥50% to an independent checkpoint or family with statistical significance vs random-direction transfer baseline

Live alerts (triggered watch-list): Anthropic Transformer Circuits · Apollo Research publications · DeepMind Feature Atlas follow-ons · Hewitt-Levy collinearity audits · Heimersheim-Conmy circuit-discovery critiques · Casper Norm Trojans follow-ons · Templeton-Hu-Belinkov-Conmy cross-paradigm transfer testing · OpenAI Safety Systems / Preparedness Framework updates with interp signal.

§05

Method at a glance

Threat modelDemonstrate a causally faithful, transferable, baseline-beating, costume-free interpretability claim on a frontier LLM (GPT-4-class, Claude-3-class, Gemini-1.5-class, Llama-3-405B-class) in 2026.

Deep loops8 sweeps × 5–10 parallel research agents per sweep × 1 batch round.

Sources surveyedarXiv cs.LG / cs.CL / cs.AI 2024–2026 + Anthropic Transformer Circuits + Anthropic Alignment Forum + LessWrong AF interp posts + OpenAI Superalignment + Safety Systems + DeepMind Interp + Apollo Research + Redwood Research + METR / ARC Evals + Hewitt-Levy collinearity audit lineage + Mengrong-Hofmann SAE-as-PC1 critique + Heimersheim-Conmy circuit-discovery critique + Casper Norm Trojans + Templeton-Hu-Belinkov-Conmy cross-paradigm + Belrose-Pearce-Tigges + Adebayo saliency-map sanity-check follow-ons + ICLR / ICML / NeurIPS interp tracks + workshops.

ClassifierRegex rule engine. v0.2 with 60 hand-curated benchmark cases at gate-accuracy 1.000 / bill-recall 1.000.

Empty-space testThree signature bills (7, 11, 14) predeclared as empty BEFORE batch 1 sweeps. After 484 papers across 8 sweeps, all three remain empty. Bills 16-18 promoted to v0.2 from batch 1 evidence.

Rebuttal density50% (241/484 G2 papers) — the highest of any ledger built to date. Substantially higher than Factorization (38%), Quantum Advantage (35%), Lattice Cryptography (12%). The mech-interp corpus is mid-falsification with an exceptionally healthy rebuttal pipeline.

ReproducibilityAll scripts, JSONs, and wiki are public. Run order: bill_classifier.py --benchmark → ledger populator → atlas review pipeline.

§06