# Agentic / Tool-Use Capability Ledger — Purpose

## Threat model (one paragraph)

Demonstrate a frontier agentic / tool-use capability claim — on a system
that explicitly markets autonomous tool-use (Anthropic Claude Computer Use,
OpenAI Operator, Google Project Mariner, SWE-Agent, Aider, Devin, Cline,
OpenHands, Goose) — that survives six closure audits on the 2024–2026
corpus: **(1) tool-augmentation decomposition, (2) multi-step trajectory
contamination, (3) cross-scaffold transfer, (4) browser-state replay
leakage, (5) scaffold-vs-model decoupling, (6) cross-benchmark generalization
(SWE-Bench ↔ OSWorld ↔ GAIA ↔ MLE-Bench).** A clean trigger requires
independent third-party verification (METR / Apollo / AISI / Princeton SWE-Bench
team / Cybench) within 6 months.

## Empty-space hypothesis (predeclared)

We predict no 2024–2026 paper triggers Bills 4, 7, 10 cleanly:

- **Bill 4 ★** — Causally-faithful tool-use mechanism. Tool-name perturbation
  audits + Apollo CoT-monitorability under tool-use predict 0 clean triggers.
  Direct cousin to Inference-time Safety Bill 11 ★ + Reasoning Bill 6 ★ +
  VLM Bill 4 ★ + Mech Interp Bill 11 ★ + Scaling Laws Bill 5 ★ — five-way
  star-mechanism alignment.
- **Bill 7 ★** — Cross-benchmark generalization (SWE-Bench ↔ OSWorld ↔
  GAIA ↔ MLE-Bench). Sun-Cao 2025 cross-benchmark agentic audit predicted
  to find pairwise correlations r=0.31–0.58.
- **Bill 10 ★** — Universal task-set coverage (web + code + filesystem +
  cybersec + ML + science). Anand-Rein 2025 predicted to find 0/9 frontier
  agentic systems pass all 6 task domains.

## Status

Stage 1 (SCOPE) — purpose written.