All data

Agentic / Tool-Use Capability Ledger Data Receipts

Public Draft v0.2 REAL DATA

8 parallel Opus research-agent sweeps yielded ~368 raw papers, deduplicated and hand-arbitrated to 368 unique. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET (0 clean triggers each). Rebuttal density 23.6%.

Receipts

ArtifactLinkPurpose
Bill definitionsbills_draft.md14 bills + 6 meta-costs + 3 escape gates + ★ 4, 7, 10 empty-space verification with real fire counts
Threat modelpurpose.mdThreat model, scope, empty-space hypothesis, cousin-ledger coupling
Corpus union JSON_batch_1_union.json368 unique papers (deduplicated from ~368 raw across 8 sweeps), with full metadata
Classifierbill_classifier.pyRegex rule engine + hand-arbitration. Run with --arbitrate-union
Aggregatoraggregate_batch_1.pyDeduplicates raw sweep JSONs into the corpus union
READMEREADME.mdReproducibility README with run order

Real fire counts

BillCands.CleanRebuttalsGated
1 — Tool-augmentation decomposition121200
2 — Multi-step trajectory contamination212010
3 — Cross-scaffold transfer2210120
4 ★ Causally-faithful tool-use mechanism330429
5 — Adaptive-prompt / tool-naming stability9900
6 — Trajectory-length / horizon scaling2200
7 ★ Cross-benchmark generalization3401123
8 — Adversarial / sandbox-escape audit141310
9 — Held-out task-set construction212100
10 ★ Universal task-set coverage210021
11 — Browser-state replay leakage212010
12 — Scaffold-vs-model decoupling221750
13 — Vendor-self-eval independence332760
14 — Capability-cost transparency3300

Public draft v0.2 (2026-05-09). Sweep JSONs live in the source repo at ProjectForty2 public evidence bundle: agentic_tool_use/deep_loops/. Target v0.3 lock 2026-Q3.