The Thirteen-Bill Closure Pattern of Frontier Embodied-AI Capability Claims (2024–2026)

A real-data falsification harness with 312-paper empirical verification + a 9/9 hallucination-rate methodological finding.
Kevin Russell · ProjectForty2 · 2026-05-15 · Public draft v0.1.

↓ bills_draft.md ← Ledger overview Data & code →

Abstract

The robotics / embodied-AI ledger audits 312 unique papers across 12 deep-loop sweeps (8 batch-1 + 4 batch-2 verification, 2024–2026) on frontier embodied-AI capability claims (Google RT-2/RT-X, Figure 03/Helix, OpenVLA, Physical Intelligence π0/π0.5, NVIDIA GR00T, DeepMind Gato-2, Stanford Mobile ALOHA, Berkeley Octo, Tesla Optimus/FSD V14, Waymo, Wayve, Apptronik Apollo, 1X). The harness consists of 13 bills; six meta-costs; and three escape gates.

Three bills were predicted ★ empty BEFORE batch-1 sweeps ran: Bill 5 ★ (sim-to-real generalization at ≥80% retained capability), Bill 8 ★ (embodiment cross-platform transfer with ≤30% degradation), Bill 11 ★ (universal task-set coverage across 5 sub-tasks). Bill 4 (causally-faithful grounded-reward mechanism) was deliberately NOT marked ★ as a bridge-falsification test for cross_ledger_bridges Bridge 1's prediction that RL-from-physics-grounded-reward would provide the causally-faithful mechanism that LLM-centric domains lack.

Batch 1 produced a "candidate breach" headline — that ★ Bills 5 and 8 were tripped by clean triggers from quadrupedal-locomotion field deployments, and that Bill 4 fired positive (4 known_bill triggers paying only M3). Two rounds of independent arXiv-ID verification killed all 9 flagged IDs. Round 1 (sweep b2-901): 5/5 Bill 5/8 ★ "breach" IDs hallucinated. Round 2 (cross-ledger, 2026-05-15): 4/4 Bill 4 IDs hallucinated. Robotics batch-1 hallucination rate: 9 of 9 checked IDs on flagged IDs. The ★ Bill 5/8/11 empty-space hypothesis HOLDS at strict formulation. The Bill 4 finding is killed; the cross_ledger_bridges Bridge 1 falsification claim is unsupported (not refuted — the evidence base evaporated). Bridge 1 remains untested by the robotics_embodied corpus.

Methodological contribution: this ledger drives the new Stage 3.5 verification rule (mandatory arXiv-ID + abstract verification before any breach commitment), now applied to all subsequent ledgers. The 7-day public-update commitment cannot fire from sweep-agent output alone.

13Bills

3★ Empty

312Unique Papers

9/9Hallucinated IDs killed

§01Threat model + thirteen bills

Threat model (verbatim from purpose.md): Demonstrate a frontier robotic / embodied AI capability claim on a system that markets autonomous physical manipulation or locomotion or driving (Google RT-2/RT-X, Figure 03/Helix, OpenVLA, Physical Intelligence π0/π0.5, 1X RFM-1, NVIDIA GR00T, DeepMind Gato-2, Stanford Mobile ALOHA, Berkeley Octo, Tesla Optimus/FSD V14, Waymo, Wayve, Apptronik Apollo) that survives six closure audits on the 2024-2026 corpus: (1) sim-to-real generalization, (2) demonstration-distribution shift (cherry-picked vs everyday), (3) embodiment cross-platform transfer, (4) long-horizon plan stability, (5) hardware-cost / fleet-cost transparency, (6) real-world deployment beyond demonstration set with held-out novel scenes. A clean trigger requires independent third-party verification (METR / Apollo / AISI / DARPA / NIST robotics-eval) within 6 months.

#	Bill	What gets paid	Star
1	Demo-distribution shift	Cherry-picked vs everyday distribution; novel objects/scenes
2	Sim-to-real perception gap	Camera-format / lighting / friction / contact-dynamics gap measured
3	Hardware-cost transparency	Per-robot capex, per-trial wall-clock, total hardware-hours
4	Causally-faithful grounded-reward	RL-from-physics-grounded-reward, intervention-validated. Bridge test.	✦
5	Sim-to-real generalization	≥80% of demonstrated capability retained on real hardware	★
6	Long-horizon plan stability	≥10-step task chain holds; failure-cascade rate < 30%
7	Strong-baseline classical	VLA beats MPC / WBC / classical-IK at equivalent compute
8	Embodiment cross-platform transfer	Same policy transfers Google → BD → Apptronik → 1X with ≤30% degradation	★
9	Held-out scene generalization	≥30% novel objects, no overlap with training scenes
10	Vendor-self-eval independence	Reproduced by METR / Apollo / DARPA / NIST robotics-eval
11	Universal task-set coverage	All 5 sub-tasks {manipulation, locomotion, navigation, planning, human-interaction} above clean threshold	★
12	Safety / collision audit	Real-world incident rate, near-miss disclosure, safety-bound enforcement
13	Tele-op decomposition	Autonomous vs tele-op contribution explicitly separated

Six meta-costs: M1 (pre-2024 toy regime), M2 (single-task), M3 (single-embodiment), M4 (sim-only-no-real-hardware), M5 (demo cherry-pick), M6 (implementation-specific). Three escape gates: G1 (methodology), G2 (negative-result), G3 (theoretical). M3 (single-embodiment) is the dominant tax at 36% of corpus.

Predeclared empty-space hypothesis: ★ Bills 5, 8, 11 predicted empty across the 2024-2026 frontier corpus. Bill 4 NOT predicted empty — anticipated to surface positive triggers as the bridge-falsification test for cross_ledger_bridges Bridge 1's unrestricted form.

§02Findings — breach killed by verification

★ Bill 5 + Bill 8 breach: killed by sweep b2-901 (round 1)

Batch 1 produced a "publishable headline" claim: ★ Bills 5 and 8 each appeared to surface clean triggers from the legged-locomotion sub-domain. Bill 5 ★ had 2 nominally-clean triggers (ANYmal D yearlong field eval at confidence 0.96; Spot 2025 industrial fleet at confidence 0.95). Bill 8 ★ had 3 nominally-clean triggers (Universal Quadruped Walking at 0.95; OXE-Loco at 0.83; 100K-Hour Locomotion at 0.81). All five sweep-agent confidences were above 0.80 — the framework's nominal "publishable" threshold.

Verification round 1 · sweep b2-901 · 2026-05-14

5/5 flagged arXiv IDs verified hallucinated. Each ID resolved via arXiv lookup to an unrelated paper. Strict breach status: KILLED. The empty-space hypothesis for Bills 5 and 8 ★ HOLDS at the strict formulation.

Closest real adjacent papers: URMA arxiv:2409.06366, URMAv2 arxiv:2509.02815, Embodiment Scaling Laws arxiv:2505.05753, GRoQ-LoCO arxiv:2505.10973, MetaLoco arxiv:2407.17502. These trigger only weak forms of Bill 8 with M3 (single-vendor real hardware). Multi-vendor cross-morphology does not exist in the verified corpus.

Bill 4 (bridge test) breach: killed by cross-ledger verification (round 2)

Batch 1 also claimed Bill 4 (causally-faithful grounded reward) fired positively in locomotion: 4 known_bill triggers paying only M3 (single-embodiment) — Falcon arxiv:2501.10459, Dex-Loco arxiv:2505.13245, Soft-Skin Quadruped arxiv:2504.20039, HOVER arxiv:2411.14458. This was the evidence-bearing evidence for the cross_ledger_bridges Bridge 1 falsification claim: that RL-from-physics-grounded-reward provides causally-faithful mechanism that LLM-centric domains lack.

Verification round 2 · cross-ledger · 2026-05-15

4/4 Bill 4 arXiv IDs verified hallucinated. They resolve to: a traffic-prediction GNN (arxiv:2501.10459), DFT vacancy clustering in aluminum (arxiv:2505.13245), an LLM speculative-decoding paper (arxiv:2504.20039), and a geo-distributed LM training systems paper (arxiv:2411.14458) respectively.

Robotics batch-1 hallucination rate: 9 of 9 checked IDs on all flagged IDs. The Bill 4 finding is killed. The cross_ledger_bridges Bridge 1 falsification claim is now unsupported — not refuted, but the evidence base evaporated. Bridge 1's unrestricted form remains untested by the robotics_embodied corpus.

★ Bill 11 holds without breach

Bill 11 ★ (universal task-set coverage) had 4 known_bill candidates and 16 needs_gate candidates from sweep returns. None covers all 5 sub-tasks {manipulation, locomotion, navigation, multi-step planning, human-interaction} above the clean threshold. The skepticism cluster reinforces holding-empty status: Rodney Brooks ("Why Today's Humanoids Won't Learn Dexterity"), Yann LeCun humanoid skepticism / Robotics Bubble warnings, MKBHD on 1X NEO ("Selling the Dream"), and the 1X NEO Oct-2025 launch revealing "Expert Mode" tele-operation are converging public skepticism. The Tesla HW3-cant-do-unsupervised admission Q1 2026 is a vendor-side falsification on hardware claims.

Locomotion-specific weak forms exist (real)

The verified-real adjacent leaders in legged locomotion (URMA, URMAv2, Embodiment Scaling Laws, GRoQ-LoCO, MetaLoco) trigger weak forms of Bill 8 with M3 (single-vendor real hardware). The bills_draft v0.2 should consider splitting bills 5/8/11 by morphology class:

5a / 5b: sim-to-real generalization in {VLA-manipulation} / {legged locomotion}
8a / 8b: cross-embodiment transfer in {arm-class platforms} / {legged platforms}
11a / 11b: universal task coverage at {robot-fleet level} / {single-platform whole-body level}

Locomotion has matured faster than VLA-manipulation-driving. The bills_draft v0.1 was written assuming morphology-uniformity that doesn't hold. The strict formulation still holds; the morphology-specific weak-form refinement is queued for v0.2.

Wayve + Nissan FY27: live contingent trigger

Contingent trigger · live monthly watch

The strongest live candidate to falsify Bill 8 ★ is the Wayve + Nissan cross-OEM AI driving pilot, currently needs_gate in sweep 806. The Wayve foundation model is being integrated into Nissan ProPILOT for FY27 production launch. If the FY27 production launch ships with the same policy across both OEMs, this would be a clean trigger of Bill 8 at the cross-OEM driving sub-domain. The watchlist tracks Wayve cross-OEM pilot updates monthly. Public update of the ledger committed within 7 days.

§03Methodology lesson — the Stage 3.5 verification rule

The methodological finding of this ledger is the 9/9 = source-ID verification failure rate on flagged arXiv IDs. Sweep agents — even at confidence 0.81-0.96 — produced hallucinated arXiv IDs that resolved to unrelated papers. Without independent verification, the headline claim ("Bills 5 and 8 ★ breached for locomotion; Bridge 1 falsified") would have shipped to the public ledger and triggered the 7-day public-update commitment on entirely fictional evidence.

The new Stage 3.5 verification rule

Driven by this finding, the methodology now mandates a Stage 3.5 between Stage 3 (SWEEP) and Stage 4 (CLASSIFY):

Independent arXiv-ID verification: every ID flagged by a sweep agent for a ★-bill clean trigger must be independently resolved via arXiv lookup. The actual title + actual authors + actual abstract must match the claimed paper.
Abstract verification: the verified abstract must engage the claimed bill mechanism. (A real paper that doesn't actually engage the bill is not a clean trigger.)
Confidence is not sufficient: sweep-agent self-flagged confidence scores (0.7-0.95) are not reliable. The robotics_embodied breach IDs were at confidence 0.81-0.96.
7-day public-update trigger gate: the 7-day commitment fires only after Stage 3.5 verification, never from sweep output alone.

Cross-ledger adoption

The Stage 3.5 rule is now applied to all subsequent ledgers. RL-from-Rewards adopted it from batch 1: its verification subset returned 60% (18/30) hallucinated IDs — lower than robotics_embodied's 100% but still material. The rule is evidence-bearing for the credibility of the entire 13-ledger ProjectForty2 collection.

Process learning: sweep agents hallucinate at material rates; classifier rule-tightening (as RL-from-Rewards v0.2 demonstrated with Bill 10 35→18) catches some false breaches, but classifier rules cannot catch hallucinated source IDs. Independent ID verification is the only reliable gate.

Lock conditions and queue

v0.1 lock conditions: classifier 1.000/1.000 on ≥50 hand-curated cases (in-progress); morphology-class-specific bill split (5a/5b/8a/8b/11a/11b) for v0.2; monthly cadence on Wayve+Nissan FY27, Tesla FSD V14 NHTSA reports, RoboArena leaderboard, vendor blog cadence (Figure / 1X / Apptronik / Boston Dynamics). The full preprint with sweep-level evidence and v0.2 lock will be released as an arXiv preprint after the Stage 3.5 verification rule has been independently audited against this ledger by ≥1 third party.