← Ledger
/
Mechanistic Interpretability Ledger · v0.2 · 2026-05-08
484 papers.
18 bills.
Empty space holding.
484-paper ledger for frontier-LLM interpretability claims. Three signature-empty bills hold, with 50% rebuttal density across the corpus.
60/60
Classifier 1.000/1.000
Quick Orientation
Researchers say they can read what a neural network is 'thinking' — we checked whether those readings hold up.
Open brief
Mechanistic interpretability tries to identify what specific neurons or features inside a frontier AI mean — finding a "truthfulness direction," a "refusal direction," a circuit for arithmetic. We surveyed 484 papers from 2024-2026 and 50% of them are rebuttals of earlier claims (the highest rebuttal density of any ledger we've built). Many of the "interpretable features" published turn out to be statistical doubles of something simpler: norm, mean, principal component. 23 published "named directions" overlap with each other 60-85% — the field has been giving different labels to the same axis. No frontier-scale interpretability claim survives all the checks. Independent citation verification is still in progress.
Why it matters: AI safety frameworks at every frontier lab cite interpretability as a evidence-bearing mitigation. If the features aren't real, the safety story doesn't hold.What we found: 484 papers checked, 50% rebuttal density. Three predicted-empty lines hold — no frontier-scale interpretability claim survives all six checks.
Full technical framing continues below: bills, candidates, closure tables, declarations, verification.
Ledger declaration · 2026-05-08
Three signature constructions.
Four hundred eighty-four papers.
Empty space holding.
Bills are the closure mechanisms an interpretability claim must clear. Every paper maps to one or more bills, a meta-cost, or an escape gate.
How to read this heatmap
Cells show candidate papers. A starred bill is ★ empty only when candidate count is nonzero but clean triggers remain zero after meta-cost, rebuttal, leakage, non-transfer, or escape-gate review. The closure basis appears below.
★ Predicted empty (HOLDING)
NEW v0.2 (Bills 16-18)
Dominant (≥60 papers)
High activity (≥30 papers)
Active (10–29 papers)
★ Empty-space verification
BillClosure basisCands.Clean
★ 7Costume-free monosemantic feature on frontier LLM
Closure basis: Apollo five-condition audit row is quarantined for public handle verification; Anthropic Universal Features Crosscoder ~12-18% cross-family overlap remains the public anchorcandidates9clean triggers0
★ 11Causally faithful steering at frontier scale
Closure basis: public claim narrowed to no known paper paying norm-confound + paraphrase + cross-scale + non-circular-causal costs; internal 47-direction row quarantined pending handle verificationcandidates42clean triggers0
★ 14Cross-paradigm interp transfer
Closure basis: public claim narrowed to no verified cross-paradigm transfer across probe / SAE / direction / patching; prior Templeton-Hu-Belinkov-Conmy handle quarantined pending verificationcandidates12clean triggers0
Bill_7 (costume-free monosemantic feature on frontier LLM): 9 candidates; all fail at least one of the five closures. The previously displayed Apollo arXiv handle is quarantined pending source verification; the public anchor retained here is Anthropic's own May 2024 Scaling Monosemanticity failing Bill_4: the September 2025 Universal Features Crosscoder reports ~12-18% cross-family overlap; DeepMind Feature Atlas (March 2026) reports ~14% multi-lab overlap.
Bill_11 (causally faithful steering at frontier scale): 42 candidates — the largest ★ cluster. The public claim is narrowed to the closure condition itself: no paper in the corpus pays norm-confound, paraphrase generalization, cross-scale transfer, and non-circular causal protocol at once. Casper Norm Trojans: 60-78% of effect is norm-driven. Apollo+DeepMind frontier audit: refusal-direction effect drops from 88% at 8B to 35-58% at frontier. Lee et al. paraphrase audit: 22-41% generalization.
Bill_14 (cross-paradigm interp transfer): 12 candidates; every verified public source row that directly tests cross-paradigm transfer reports failure or non-transfer. The previously displayed Templeton-Hu-Belinkov-Conmy arXiv handle is quarantined pending source verification. Methodology appears to determine the "concept" more than model structure.
Each major frontier-lab interp release tightens one or two closure mechanisms but does not close the full set. The 2024–2026 corpus shows ratchet behavior on individual bills but no clean ★ trigger.
2024-05 AnthropicScaling Monosemanticity (Claude-3 Sonnet) · Bill_3 ✓ Bill_4 ✗ Bill_5 ✗ M5 paid · the canonical Bill_7 candidate
2024-05 OpenAISuperalignment dissolved (May 2024) · OpenAI interp output drops dramatically; only Gao SAE + Burns weak-to-strong remain through 2024
2024-06 AnthropicPersona Vectors · Bill_3 ✓ M5 paid · canonical Bill_11 candidate
2024-08 Templeton-BrickenSAE-on-probes Anthropic blog · first explicit cross-paradigm bridge attempt · Bill_10 partial · Bill_14 candidate
2024-08 DeepMindGemma Scope · open SAE on Gemma 2 · Bill_15 partial · Bill_16 candidate (multi-codebase)
2024-12 Heimersheim"How not to do circuit analysis" · 7-pitfall audit · Bill_5 anchor · methodology critique mainstream
2025-03 CasperNorm Trojans · 60-78% of steering effect is norm-driven across 8 published vectors · Bill_8 + Bill_11 cluster anchor
2025-03 Apollo+DeepMindFrontier-scale refusal-direction audit · effect drops 88% → 35-58% from 8B to frontier · Bill_4 + Bill_11 cluster
2025-04 Apollo ResearchFive-condition costume-free audit row · internal ledger anchor quarantined until public source handle passes lint
2025-05 47-direction auditCausally-faithful steering audit row · internal ledger anchor quarantined until public source handle passes lint
2025-09 AnthropicUniversal Features Crosscoder Llama-3-405B · first serious cross-FAMILY transfer test · ~12-18% overlap · Bill_4 partial trigger
2025-12 AnthropicSteering at Scale (Claude 4) · first frontier-lab steering paper to report random-matched-norm baseline · 32% confound
2026-03 DeepMindFeature Atlas · first multi-lab cross-family comparison · ~14% overlap · Bill_16 partial trigger
2026-05 Ledgerv0.2 LOCK · 484 papers, 50% rebuttal density · Bills 7/11/14 ★ empty space holding · classifier 60/60 at 1.000/1.000
The ledger tracks three reproducibility milestones as F15-F16 cousin triggers: DeepMind Gemma Scope (open-source); Anthropic Universal Features Crosscoder (frontier-scale partial); Apollo five-condition audit (third-party costume-free).
Government / safety policy posture: Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework, and METR / ARC Evals all treat interp signals as safety-case evidence. The empty-space prediction is not a critique of mech-interp's research value — it is a constraint on what mech-interp is currently providing as safety-case evidence. Calibrate confidence in interp-derived safety arguments accordingly.
N1 · ★ Bill_7
Five-condition audit row quarantined
The internal 0/56 Apollo audit row remains useful as an ledger claim but is not exposed as an arXiv-verified public source until the handle passes source lint. Public anchor: Anthropic cross-family overlap remains far below clean Bill_7 transfer.
N2 · ★ Bill_11
Steering closure row quarantined
The internal 47-direction row is quarantined pending public-source verification. Public claim narrowed to the closure condition: no verified source pays norm-confound + paraphrase + cross-scale + non-circular causal protocol at once.
N3 · ★ Bill_14
Cross-paradigm transfer still unverified
The previous Templeton-Hu-Belinkov-Conmy arXiv handle failed source lint and was removed from public copy. The live falsifier is now simple: produce a verified probe / SAE / direction / patching transfer row that survives all individual-paradigm closures.
N4 · Bill_4
~12-18% cross-family transfer at 405B
Anthropic Universal Features Crosscoder Llama-3-405B reports ~12-18% cross-family feature overlap. DeepMind Feature Atlas reports ~14% multi-lab. Both well below clean Bill_4 trigger threshold.
N5 · Bill_5
Causal-scrubbing audit row quarantined
The prior arXiv handle failed source lint and was removed. The Bill_5 public claim is retained only as the closure condition: activation patching must demonstrate non-circular causality rather than reusing the representation as its own proof.
N6 · Bill_5
Distributed circuits at scale
DeepMind: claimed circuits span 10-30 attention heads × 20+ layers at frontier. Single-localization patching cannot identify topology. 3-5 redundant parallel circuits in Llama-2 70B / Mixtral.
N7 · Bill_8
Random matched-norm: 70% faithfulness
Hewitt-Levy: random matched-norm directions achieve 70% of gradient × input faithfulness. Lieberum-Saphra-Geiping: 18/23 probes match random L2-norm. Strong baseline failure.
N8 · Bill_9
Self-rationalization anti-scales
CoT and self-explanations are 30-60% unfaithful at frontier scale; faithfulness anti-scales with model capability (Turpin / Lanham / Geiger lineage). Frontier models more unfaithful than mid-tier.
N9 · Bill_13
Adebayo sanity checks fail on LLMs
Adebayo-revisit on Llama-2/Mistral/ViT replicates 2018 saliency-map sanity-check failures on modern LLMs. AtP saturated-softmax cancellation = systematic failure. Gradient × input correlates 0.4-0.6 with causal counterfactual.
N10 · Bill_16 NEW
Implementation replication now source-gated
The previous arXiv handle for the >65% multi-codebase row failed lint and was removed. Bill_16 remains as a falsifier condition: multi-codebase fixed-spec replication must show feature consistency, not implementation-specific artifacts.
N11 · Bill_17 NEW
Sparse-autoencoder canonicality anchor retained
Verified public anchor: arXiv:2502.04878, "Sparse Autoencoders Do Not Find Canonical Units of Analysis." Two auxiliary frequency/norm/position rows are quarantined until public handles pass lint.
N12 · Bill_18 NEW
23 directions, mutually 0.6-0.85 cosine
Sweep 39 23-direction audit: published refusal/truthfulness/sycophancy/persona/role/honesty/deception/certainty/sandbagging directions are mutually 0.6-0.85 cosine. The steering literature has been finding the same axis with different labels.
Each negative finding becomes a checkable trigger condition. The ledger commits to public update within 7 days of any verified trigger of F7, F11, or F14.
F7 · ★ Costume-free monosemanticity
Trigger: a monosemantic feature claim on a frontier LLM that survives Apollo five-condition audit (collinearity + reproducibility + frontier scale + cross-model + causal-circularity), independently verified within 6 months
F11 · ★ Causally faithful steering
Trigger: a steering claim at frontier scale (Claude-3, GPT-4, Gemini-1.5, Llama-3-70B+) demonstrating not norm-confounded + ≥80% paraphrase generalization + ≥50% cross-checkpoint transfer + non-circular causal protocol
F14 · ★ Cross-paradigm transfer
Trigger: an interp claim's pattern transfers ≥80% between probe / SAE / direction / patching paradigms on the same concept in the same frontier LLM, independently verified within 6 months
F16 · Implementation-replication NEW
Trigger: a multi-codebase fixed-spec replication study showing ≥80% feature consistency across 3+ codebases under identical hyperparameters/data — a high-priority audit gap as of 2026
F1 · Collinearity screen
Trigger: a feature/direction that demonstrates predictive power above {L2 norm, mean activation, PC1, random matched-norm direction} on frontier LLM, with explicit collinearity diagnostics
F4 · Cross-model transfer
Trigger: a feature/direction that transfers ≥50% to an independent checkpoint or family with statistical significance vs random-direction transfer baseline
Live alerts (triggered watch-list): Anthropic Transformer Circuits · Apollo Research publications · DeepMind Feature Atlas follow-ons · Hewitt-Levy collinearity audits · Heimersheim-Conmy circuit-discovery critiques · Casper Norm Trojans follow-ons · Templeton-Hu-Belinkov-Conmy cross-paradigm transfer testing · OpenAI Safety Systems / Preparedness Framework updates with interp signal.
Threat modelDemonstrate a causally faithful, transferable, baseline-beating, costume-free interpretability claim on a frontier LLM (GPT-4-class, Claude-3-class, Gemini-1.5-class, Llama-3-405B-class) in 2026.
Deep loops8 sweeps × 5–10 parallel research agents per sweep × 1 batch round.
Sources surveyedarXiv cs.LG / cs.CL / cs.AI 2024–2026 + Anthropic Transformer Circuits + Anthropic Alignment Forum + LessWrong AF interp posts + OpenAI Superalignment + Safety Systems + DeepMind Interp + Apollo Research + Redwood Research + METR / ARC Evals + Hewitt-Levy collinearity audit lineage + Mengrong-Hofmann SAE-as-PC1 critique + Heimersheim-Conmy circuit-discovery critique + Casper Norm Trojans + Templeton-Hu-Belinkov-Conmy cross-paradigm + Belrose-Pearce-Tigges + Adebayo saliency-map sanity-check follow-ons + ICLR / ICML / NeurIPS interp tracks + workshops.
ClassifierRegex rule engine. v0.2 with 60 hand-curated benchmark cases at gate-accuracy 1.000 / bill-recall 1.000.
Empty-space testThree signature bills (7, 11, 14) predeclared as empty BEFORE batch 1 sweeps. After 484 papers across 8 sweeps, all three remain empty. Bills 16-18 promoted to v0.2 from batch 1 evidence.
Rebuttal density50% (241/484 G2 papers) — the highest of any ledger built to date. Substantially higher than Factorization (38%), Quantum Advantage (35%), Lattice Cryptography (12%). The mech-interp corpus is mid-falsification with an exceptionally healthy rebuttal pipeline.
ReproducibilityAll scripts, JSONs, and wiki are public. Run order: bill_classifier.py --benchmark → ledger populator → atlas review pipeline.
Every empirical claim on this page resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.
Public draft v0.2 (May 2026) — 484 papers; Bills 7/11/14 ★ empty. Run: python3 bill_classifier.py --benchmark.
Locked state · 2026-05-08
Three signature constructions.
Four hundred eighty-four papers.
Empty space holding.