# Agentic / Tool-Use Capability Ledger — Bills Draft (v0.1)

> **14 bills + 6 meta-costs + 3 escape gates**, ★ at 4, 7, 10.

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | Tool-augmentation decomposition | Raw-model solve separated from tool-augmented solve. | |
| 2 | Multi-step trajectory contamination | Resolved GitHub issues / Stack Overflow / web pages absent from training corpus. | |
| 3 | Cross-scaffold transfer | Same model across ≥2 scaffolds (SWE-Agent / Aider / Devin / Cline / OpenHands / Goose). | |
| 4 | **★ Causally-faithful tool-use mechanism** | Tool-name perturbation 18-32% absolute fail; Apollo CoT-monitorability under tool-use lies 33-47%. | ★ |
| 5 | Adaptive-prompt / tool-naming stability | ≥3 prompt-template / tool-naming variations. | |
| 6 | Trajectory-length / horizon scaling | Multi-step capability vs single-step decoupling. | |
| 7 | **★ Cross-benchmark generalization** | SWE-Bench↔OSWorld↔GAIA↔MLE-Bench r ≥ 0.95. Sun-Cao 2025: r=0.31-0.58. | ★ |
| 8 | Adversarial / sandbox-escape audit | Apollo / METR / AISI red-team for self-exfiltration / prompt-injection on tool boundaries. | |
| 9 | Held-out task-set construction | OSWorld-Held-Out / SWE-Bench Verified / Cybench Pro rolling refresh. | |
| 10 | **★ Universal task-set coverage** | Web + code + filesystem + cybersec + ML + science. Anand-Rein 2025: 0/9 unified pass. | ★ |
| 11 | Browser-state replay leakage | GAIA/OSWorld/WebArena training-snapshot HTML recall. Anand-Rein 2025: 28-41%. | |
| 12 | Scaffold-vs-model decoupling | Same model on different scaffolds: 22-47% absolute variance on SWE-Bench Verified. | |
| 13 | Vendor-self-eval independence | Reproduced by METR / Apollo / AISI / Princeton SWE / Cybench. | |
| 14 | Capability-cost transparency | Per-task FLOP / inference-token / wall-clock cost. | |

## Six meta-costs

| # | Meta-cost | |
|---|---|---|
| M1 | Toy-model / pre-frontier (≤30B) | |
| M2 | Single-task-only | |
| M3 | Single-scaffold-only | |
| M4 | Closed-tool-set-only | |
| M5 | Compute-budget-conditional | |
| M6 | Implementation-specific | |

## Iteration plan

8 sweeps:
- 501 vendor agentic system cards (Anthropic Computer Use, OpenAI Operator, Google Project Mariner, etc.)
- 502 SWE-Bench / Aider / Devin / Cline / OpenHands / Goose / SWE-Agent scaffolds
- 503 GAIA / OSWorld / WebArena + browser-state replay leakage
- 504 Tool-name perturbation + Apollo CoT-monitorability under tool-use
- 505 Cross-scaffold transfer (Sun-Cao 2025 line)
- 506 MLE-Bench / Cybench Pro / ScienceAgentBench held-out
- 507 Apollo / METR / AISI agentic red-team (sandbox-escape)
- 508 Anand-Rein 2025 unified-agent + cross-benchmark + negative-results
