Agentic / Tool-Use Capability Ledger Data Receipts
Public Draft v0.2 REAL DATA
8 parallel Opus research-agent sweeps yielded ~368 raw papers, deduplicated and hand-arbitrated to 368 unique. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET (0 clean triggers each). Rebuttal density 23.6%.
Receipts
| Artifact | Link | Purpose |
|---|---|---|
| Bill definitions | bills_draft.md | 14 bills + 6 meta-costs + 3 escape gates + ★ 4, 7, 10 empty-space verification with real fire counts |
| Threat model | purpose.md | Threat model, scope, empty-space hypothesis, cousin-ledger coupling |
| Corpus union JSON | _batch_1_union.json | 368 unique papers (deduplicated from ~368 raw across 8 sweeps), with full metadata |
| Classifier | bill_classifier.py | Regex rule engine + hand-arbitration. Run with --arbitrate-union |
| Aggregator | aggregate_batch_1.py | Deduplicates raw sweep JSONs into the corpus union |
| README | README.md | Reproducibility README with run order |
Real fire counts
| Bill | Cands. | Clean | Rebuttals | Gated |
|---|---|---|---|---|
| 1 — Tool-augmentation decomposition | 12 | 12 | 0 | 0 |
| 2 — Multi-step trajectory contamination | 21 | 20 | 1 | 0 |
| 3 — Cross-scaffold transfer | 22 | 10 | 12 | 0 |
| 4 ★ Causally-faithful tool-use mechanism | 33 | 0 | 4 | 29 |
| 5 — Adaptive-prompt / tool-naming stability | 9 | 9 | 0 | 0 |
| 6 — Trajectory-length / horizon scaling | 2 | 2 | 0 | 0 |
| 7 ★ Cross-benchmark generalization | 34 | 0 | 11 | 23 |
| 8 — Adversarial / sandbox-escape audit | 14 | 13 | 1 | 0 |
| 9 — Held-out task-set construction | 21 | 21 | 0 | 0 |
| 10 ★ Universal task-set coverage | 21 | 0 | 0 | 21 |
| 11 — Browser-state replay leakage | 21 | 20 | 1 | 0 |
| 12 — Scaffold-vs-model decoupling | 22 | 17 | 5 | 0 |
| 13 — Vendor-self-eval independence | 33 | 27 | 6 | 0 |
| 14 — Capability-cost transparency | 3 | 3 | 0 | 0 |
Public draft v0.2 (2026-05-09). Sweep JSONs live in the source repo at ProjectForty2 public evidence bundle: agentic_tool_use/deep_loops/. Target v0.3 lock 2026-Q3.