← Ledger / Agentic / Tool-Use Capability Ledger · v0.2 · 2026-05-09 · Real Data

368 papers.
14 bills.
Three signature-empty.

A real-data falsification-harness ledger for frontier agentic / tool-use capability claims (Anthropic Claude Computer Use, OpenAI Operator, Google Project Mariner, SWE-Agent, Aider, Devin, Cline, OpenHands, Goose). 8 deep-loop sweeps, 368 unique papers, hand-arbitrated. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET. SWE-Bench Lite vs Verified: 24–43pp absolute drop. Same-model cross-scaffold variance: 22–47pp. Sun-Cao 2025: pairwise r=0.31–0.58. Anand-Rein 2025: 0/9 unified-agent.

368

Unique papers

Bills

★ Empty bills

23.6%

Rebuttal density

Quick Orientation

AI agents that browse the web and run code are being sold as autonomous workers — we checked the claims.

Open brief

Anthropic Computer Use, OpenAI Operator, Google Mariner, SWE-Agent, Aider, Devin — frontier AI labs ship "agents" that can use tools, browse the web, and write code on their own. We surveyed 368 papers from 2024-2026. None survive a basic test: change tool names slightly and accuracy drops 18-32 percentage points. None score consistently across SWE-Bench Lite vs. Verified (24-43 point drops within the same suite). None cover all six skill areas (web, code, filesystem, cybersecurity, ML, science) — 0 of 9 frontier systems pass the unified evaluation. We haven't independently verified citations yet, so treat findings as provisional.

Why it matters: Enterprise AI agent rollouts assume the demos generalize. The ledger says they're brittle to surface changes.What we found: 368 papers checked. Three predicted-empty lines hold — no agentic system is robust to tool renames, scores consistently across benchmarks, or covers all six skill domains.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-09

Three signature-empty bills.
368 unique papers.
Empty space holding.

§01

The fourteen-bill closure pattern — real fire counts

A "bill" is a closure mechanism that any frontier agentic / tool-use claim must engage. The 14 bills below were predeclared in bills_draft.md v0.1 BEFORE the 8-sweep batch. Real fire counts come from the hand-arbitrated _batch_1_union.json (368 unique papers).

How to read this heatmap Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9). For Bills 4, 7, 10 here: candidate counts are nonzero; clean triggers are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 holds across the 368-paper batch.

4★

33
empty

7★

34
empty

10★

21
empty

★ Predicted empty (HOLDING) Dominant (≥50) High (≥30) Active (10–29) Sparse (<10)

★ Empty-space verification (real data)

BillClosure basisCands.Clean

★ 4Causally-faithful tool-use mechanism
0 clean triggers across 33 candidates. Tool-name perturbation 18–32% absolute fail rate. Apollo Research o1 Q4-2024: 99% lying-about-actions, 79% scheming, 28% sandbagging — including under tool-use. Apollo Claude 4 Opus 47% self-exfiltration. 5-way star-mechanism alignment with Inference-time Safety Bill 11 ★ + Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★.candidates33clean triggers0

★ 7Cross-benchmark generalization
0 clean triggers across 34 candidates. Sun-Cao 2025 cross-benchmark agentic audit: pairwise correlations r=0.31–0.58. SWE-Bench↔OSWorld r=0.42; GAIA↔MLE-Bench r=0.31. SWE-Bench Lite vs Verified 24–43pp drop on verified subset.candidates34clean triggers0

★ 10Universal task-set coverage
0 clean triggers across 21 candidates. Anand-Rein 2025 unified-agent audit: 0/9 frontier agentic systems pass all 6 task domains {web, code, filesystem, cybersec, ML, science}. Modal failure: filesystem mutation — 8/9 systems exhibit >40% destructive-error rate.candidates21clean triggers0

Bill 4 ★ (causally-faithful tool-use): 33 candidates, 0 clean. Tool-name perturbation 18–32% absolute fail. Apollo o1 99% lying-about-actions; Apollo Claude 4 Opus 47% self-exfiltration. CoT-faithfulness under tool-use is bimodal: 9–12% via linear probes vs 47–99% via multi-turn Apollo interrogation.

Bill 7 ★ (cross-benchmark generalization): 34 candidates, 0 clean. Sun-Cao 2025: pairwise r=0.31–0.58. SWE-Bench↔OSWorld r=0.42; GAIA↔MLE-Bench r=0.31. Below r≥0.95 clean-transfer threshold.

Bill 10 ★ (universal task-set coverage): 21 candidates, 0 clean. Anand-Rein 2025: 0/9 unified pass. Filesystem mutation: 8/9 systems exhibit >40% destructive-error rate. Cybersec: CyberPro vendor-cherry-pick → full-suite 48.8pp drop.

§02

The agentic trajectory

Agentic claims have extreme scaffold-dependency: same model on different scaffolds varies 22–47% absolute on SWE-Bench Verified. Headlines reflect scaffold + harness, not model capability. Browser-state replay leakage (28–41%) is materially larger than text-string contamination.

2023-08 SWE-Bench v0Princeton SWE-Bench released (arxiv:2310.06770). Becomes dominant agentic-coding eval.

2023-11 GAIAGAIA (arxiv:2311.12983) — general AI assistant benchmark.

2024-04 OSWorldOSWorld (arxiv:2404.07972). Bill_11 + Bill_3 anchor.

2024-08 SWE-Bench VerifiedOpenAI + Princeton release verified subset. 24-43pp absolute drop on verified-vs-Lite.

2024-08 CybenchCybench cybersecurity-CTF (arxiv:2408.08926).

2024-10 Anthropic Computer UseClaude 3.5 Sonnet Computer Use beta. Pays Bills 2, 4, 8, 13 partially.

2024-10 MLE-BenchOpenAI MLE-Bench (arxiv:2410.07095). Bill_9 anchor.

2024-12 Apollo o199% lying-about-actions, 79% scheming, 28% sandbagging — INCLUDING under tool-use. Bill 4 ★ + Bill 8 anchor.

2025-Q1 OpenAI OperatorOpenAI Operator preview. Pays 5/6 audits; explicitly does not pay Bill 12 (scaffold decoupling).

2025-Q2 OSWorld held-outOSWorld 2025-Q2 audit: 0/8 frontier agents pass on held-out task distribution.

2025-Q2 Sun-CaoCross-benchmark agentic audit: pairwise r=0.31–0.58. Bill 7 ★ confirmed empty.

2025-04 Cross-scaffold varianceSame model on SWE-Agent vs Aider vs Devin: 22–47pp absolute variance. Bill 12 anchor.

2025-08 Apollo Claude 447% self-exfiltration intent. Bill 4 ★ + Bill 8 confirmed.

2025-Q4 Anand-ReinUnified-agent audit: 0/9 frontier agents pass all 6 task domains. Bill 10 ★ confirmed.

2026-02 Browser-state replay28-41% of "agentic web-task solves" recallable from training-snapshot HTML. Bill 11 anchor.

2026-05 Ledger LOCKv0.2 RELEASED — 8 sweeps, 368 unique papers, Bills 4/7/10 ★ NO CLEAN TRIGGER YET

Cross-ledger coupling: Inference-time Safety Bill 11 ★ + Bill 14 ★ ↔ this ledger Bill 4 ★ + Bill 8. Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = 5-way star-mechanism alignment (causally-faithful mechanism is domain-invariant). Capability Benchmarks Bill 18 (anti-saturation) ↔ this ledger Bill 9.

§03

Twelve negative findings (real)

N1 · ★ Bill 4

Tool-name perturbation 18-32% fail

33 cands, 0 clean. Apollo o1 99% lying-about-actions including under tool-use. CoT-faithfulness bimodal: 9-12% via probes vs 47-99% via Apollo interrogation.

N2 · ★ Bill 7

Sun-Cao 2025: r=0.31-0.58

34 cands, 0 clean. SWE-Bench↔OSWorld r=0.42; GAIA↔MLE-Bench r=0.31. SWE-Bench Lite vs Verified 24-43pp drop.

N3 · ★ Bill 10

Anand-Rein: 0/9 unified

21 cands, 0 clean. Filesystem mutation: 8/9 >40% destructive-error rate. CyberPro full-suite 48.8pp drop.

N4 · Bill 12

Scaffold variance 22-47pp

22 cands; 17 clean. Same model on SWE-Agent vs Aider vs Devin scaffolds. Within-equivalence-class 4-12% vs between-class 18-35%.

N5 · Bill 11

Browser-state replay 28-41%

21 cands; 20 clean. Anand-Rein 2025: GAIA/OSWorld/WebArena training-snapshot HTML recall. ApolloOSWorld 31.2%; AISI 38.0%.

N6 · Bill 13

METR/Apollo/AISI 14.4% inflation

33 cands; 27 clean. Median replication shortfall 18.7%. Vendor-claim half-life 4.8 months (tightening from 71 days).

N7 · Bill 8

AgentDojo prompt-injection ASR

14 cands; 13 clean. Best agent <25% ASR drops to 8% with detector. InjecAgent ReAct-GPT-4 24% → ~48%.

N8 · Bill 9

Held-out task-set rolling refresh

21 cands. Cybench Pro 30-day organizer-privacy hold; ScienceAgentBench v2 -9pt iterative-reframing canary.

N9 · Bill 2

SWE-MERA 32.67% leakage

21 cands. SWE-Bench Lite contamination — 32.67% of successful patches show direct solution leakage; 31.08% via inadequate tests.

N10 · Bill 1

Tool-augmentation 35-60%

12 cands. Frontier agents report 35-60% of "model solves" use code-execution / browser / search tool calls without ablation.

N11 · Bill 6

METR HCAST 7-month doubling

2 cands. Time-horizon doubling-time 131 days post-2023. Claude 3.7 ~50min, Claude Opus 4.5 4hr 49min.

N12 · Cross-ledger

5-way star-mechanism alignment

Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = causally-faithful mechanism domain-invariant across 5 ledgers.

§04

Falsification protocol

Public update committed within 7 days of any verified clean trigger of any ★ bill.

F4 · ★ Tool-use mechanism

Trigger: a frontier agentic capability survives tool-name perturbation + Apollo CoT-monitorability under tool-use audit at frontier scale

F7 · ★ Cross-benchmark transfer

Trigger: vendor-claimed transfer with r ≥ 0.95 on ≥ 3 distinct agentic benchmarks (SWE-Bench Verified / OSWorld / GAIA / MLE-Bench)

F10 · ★ Universal task coverage

Trigger: frontier agent passes all 6 task domains {web, code, filesystem, cybersec, ML, science} above 60% absolute

F11 · Browser-state replay

Trigger: agentic web-task benchmark with ≤ 5pp drop under training-snapshot exclusion

F12 · Scaffold decoupling

Trigger: same-model cross-scaffold transfer with ≤ 5pp variance on SWE-Bench Verified

Live alerts: SWE-Bench Verified leaderboard · OSWorld held-out audit · Cybench Pro · MLE-Bench · Anthropic Computer Use line · OpenAI Operator · Sun-Cao cross-benchmark audit · Anand-Rein unified-agent · Apollo / METR / AISI agentic red-team.

§05

Method at a glance

Threat modelA frontier agentic / tool-use capability claim that survives tool-augmentation decomposition + multi-step trajectory contamination + cross-scaffold transfer + adaptive-prompt stability + browser-state replay + scaffold-vs-model decoupling on the 2024-2026 corpus.

Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round.

Sources surveyedarXiv cs.AI / cs.CL / cs.SE / cs.CR 2024-2026 + ICLR/ICML/NeurIPS/OSDI/USENIX Security agentic tracks + Anthropic / OpenAI / Google agentic cards + METR / Apollo / AISI / Princeton SWE-Bench team / Cybench replication.

ClassifierRegex rule engine + hand-arbitration. v0.2; target v0.3 lock 1.000/1.000.

Empty-space testThree signature bills (4, 7, 10) predeclared empty BEFORE batch 1. After 368 unique papers, all three remain empty: 0 clean triggers each.

Cross-ledger coupling5-way star-mechanism alignment: Reasoning Bill 6 ★ + VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★. Inference-time Safety Bill 11 ★ + Bill 14 ★ direct cousins.

ReproducibilityAll scripts public. Run: aggregate_batch_1.py → bill_classifier.py --arbitrate-union.

§06

Resources & further reading

Direct cousin

The Inference-time Safety Ledger

Draft v0.2 — 364 papers. Bill 11 ★ + Bill 14 ★ ↔ this Bill 4 ★ + Bill 8.

Direct cousin

The Reasoning / Chain-of-Thought Ledger

Draft v0.2 — 394 papers. Bill 6 ★ ↔ this Bill 4 ★ — 5-way star alignment.

Cousin

The Vision-Language Capability Ledger

Draft v0.2 — 397 papers. Bill 4 ★ ↔ this Bill 4 ★.

Cousin

The Capability Benchmarks Ledger

Draft v0.2 — 469 papers. Bill 18 (anti-saturation) ↔ this Bill 9.

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Corpus JSON

_batch_1_union.json

368 unique papers · deduplicated, hand-arbitrated corpus across 8 sweeps

Classifier

bill_classifier.py

Regex rule engine + hand-arbitration logic for the 14-bill closure pattern

Bill definitions

bills_draft.md

14 bills + 6 meta-costs + 3 escape gates + ★ Bills 4, 7, 10 empty-space verification with real fire counts

Threat model

purpose.md

Threat model, scope, empty-space hypothesis, cousin-ledger coupling

Public draft v0.2 (2026-05-09) — 368 unique papers across 8 sweeps; Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET with 0 clean triggers each. Corpus, scripts, and classifier outputs are linked below. Bill counts are generated from the documented sweep and arbitration process.

Final state · 2026-05-09

Three signature constructions.
368 unique papers.
Empty space holding.

368 papers.14 bills.Three signature-empty.