← Ledger
/
Robotics / Embodied AI Ledger · v0.1 · 2026-05-15 · Real Data
312 papers.
13 bills.
Empty space holding.
A real-data falsification harness for frontier embodied-AI capability claims (RT-2/RT-X, Helix/Figure 03, OpenVLA, π0/π0.5, NVIDIA GR00T, Tesla Optimus/FSD V14, Waymo, Wayve, Apptronik Apollo, 1X). 12 deep-loop sweeps (8 batch-1 + 4 batch-2), 312 unique papers, hand-arbitrated. ★ Bills 5, 8, 11 HOLD EMPTY at the strict formulation. The original "breach" headline was killed by batch-2 verification: 9/9 = 100% of flagged arXiv IDs were hallucinated.
9/9
Flagged IDs hallucinated
Quick Orientation
Vendors promise robots that learn to do real-world tasks — we tracked whether the demos actually transfer.
Open brief
Companies like Google, Figure, Tesla, and Waymo market robots and self-driving systems that supposedly learn general physical skills (manipulation, walking, driving). We surveyed 312 papers from 2024-2026 to see if any of those claims survive a basic test: does a robot trained in simulation work on real hardware? Does a policy trained on one robot transfer to another? Our first headline was wrong — the AI sweep agents fabricated nine "breakthrough" paper citations that turned out to not exist when we checked. After verification, none of the surviving real papers crossed the bar either.
Why it matters: Robotaxi rollouts, humanoid factory pilots, and warehouse automation all assume the demos generalize. The ledger says they mostly don't.What we found: Our first headline turned out to be wrong because nine of the cited papers didn't exist. After throwing out the fakes, the three predicted-empty lines still hold across 312 real papers.
Full technical framing continues below: bills, candidates, closure tables, declarations, verification.
Ledger declaration · 2026-05-15
Three signature-empty bills.
312 unique papers.
Empty space holding (after verification).
The honest finding of this ledger is methodological. Batch 1 produced a "candidate breach" headline — that ★ Bills 5 and 8 were tripped by clean triggers from the legged-locomotion sub-domain. Batch-2 verification killed it. So did the second-round verification of Bill 4. All flagged arXiv IDs resolved to unrelated papers.
Two rounds of verification, 9/9 hallucinated
Round 1 (sweep b2-901, 2026-05-14): 5 arXiv IDs cited as Bill_5 / Bill_8 ★ "clean breaches" from quadrupedal-locomotion field deployments. All 5 verified hallucinated. Strict breach status: KILLED. Bills 5 and 8 ★ HOLD empty at the strict-formulation level.
Round 2 (cross-ledger sweep, 2026-05-15): 4 arXiv IDs cited as Bill_4 "known_bill paying only M3" — the evidence-bearing evidence for the cross_ledger_bridges Bridge 1 falsification claim (Falcon arxiv:2501.10459, Dex-Loco arxiv:2505.13245, Soft-Skin Quadruped arxiv:2504.20039, HOVER arxiv:2411.14458). All 4 verified hallucinated — they resolve to a traffic-prediction GNN, DFT vacancy clustering, LLM speculative decoding, and geo-distributed LM training respectively.
Robotics batch-1 hallucination rate: 9/9 = 100% on all flagged IDs. The Bill_4 claim is killed; the cross_ledger_bridges Bridge 1 falsification claim is now unsupported (not refuted — the evidence base evaporated). Bridge 1's unrestricted form remains untested by the robotics_embodied corpus.
Closest real adjacent papers (URMA arxiv:2409.06366, URMAv2 arxiv:2509.02815, Embodiment Scaling Laws arxiv:2505.05753, GRoQ-LoCO arxiv:2505.10973, MetaLoco arxiv:2407.17502) trigger only weak forms of Bill_8 with M3 (single-vendor real hardware). Multi-vendor cross-morphology does not yet exist in the corpus.
Process learning carried into all subsequent ledgers: sweep agents hallucinate arXiv IDs at material rates, and independent verification is evidence-bearing for any breach claim. Robotics batch-1 produced 9 hallucinated IDs at confidence 0.81-0.96. The 7-day public-update commitment cannot trigger from sweep-agent output alone — verification (b2-901-style arXiv lookup) is mandatory. This finding drives the new Stage 3.5 verification rule applied to RL-from-Rewards from batch 1 onward.
Bills are closure mechanisms a frontier embodied-AI claim must pay. Each paper maps to one or more bills, a meta-cost (M1-M6), or an escape gate. The 13 bills below were predeclared in bills_draft.md v0.1 BEFORE any sweep ran. ★ Bills 5, 8, 11 were predicted empty; Bill 4 was deliberately NOT marked ★ as a bridge-falsification test for cross_ledger_bridges Bridge 1.
How to read this heatmap
Counts inside each cell show candidate papers that touched a bill. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9 with verified arXiv ID). Bill 4 is highlighted purple as the deliberate bridge test (NOT predicted empty). After two rounds of verification, the empty-space hypothesis predeclared in bills_draft.md v0.1 holds across the 312-paper batch.
★ Predicted empty (HOLDING after verification)
✦ Bridge test (NOT predicted empty)
High (≥30)
Active (10-29)
Sparse (<10)
★ Empty-space verification (post-hallucination-correction)
BillClosure basis (post-verification)Cands.Clean
★ 5Sim-to-real generalization (≥80% retained capability)
43 candidates, 0 verified clean. Initial "breach" claim from quadrupedal-locomotion (ANYmal D yearlong / Spot 2025 fleet) was based on 5 hallucinated arXiv IDs; sweep b2-901 verified all 5 fake. Closest real adjacent papers (URMA, URMAv2, Embodiment Scaling Laws) trigger weak forms only with M3 (single-vendor real hardware).candidates43verified clean0
★ 8Embodiment cross-platform transfer (≤30% degradation)
41 candidates, 0 verified clean. Initial "breach" claim from cross-vendor locomotion (Universal Quadruped Walking, OXE-Loco, 100K-Hour Locomotion) also based on hallucinated IDs. CrossFormer (20 embodiments, heterogeneous success) and π0 (7 platforms, arm-class) are best real candidates; both pay M3 implicitly. Multi-vendor cross-morphology does not exist in the verified corpus.candidates41verified clean0
★ 11Universal task-set coverage (5 sub-tasks above clean threshold)
22 candidates, 0 verified clean. No paper covers all 5 sub-tasks {manipulation, locomotion, navigation, multi-step planning, human-interaction} above clean threshold. Rodney Brooks + Yann LeCun humanoid skepticism cluster, MKBHD on 1X NEO "selling the dream" + 1X NEO Oct-2025 "Expert Mode" tele-op revelation reinforce holding-empty status.candidates22verified clean0
Bill 5 ★ (sim-to-real generalization): Sim-to-real perception gap dominates the corpus (Bill 2 = 18 cands). 27 benchmark intros (G1 escape gate). Hand-validated leader URMA arxiv:2409.06366 + URMAv2 arxiv:2509.02815 trigger weak forms only. The empty-space hypothesis at the strict formulation is intact.
Bill 8 ★ (cross-embodiment transfer): 64% of corpus pays M3 (single-embodiment) — most claims are platform-locked. CrossFormer's 20-embodiment claim is heterogeneous success, π0's 7-platform claim is arm-class only. The strongest real Bill 8 contender, Wayve+Nissan FY27 cross-OEM pilot, is currently needs_gate pending production launch — see contingent trigger callout below.
Bill 11 ★ (universal task coverage): The skepticism cluster (Rodney Brooks, Yann LeCun, MKBHD on 1X NEO) reinforces the empty status. Tesla HW3-cant-do-unsupervised admission Q1 2026 = vendor-side falsification. This ledger predicts no humanoid frontier system passes all 5 sub-tasks above clean threshold through end-2026.
Contingent trigger · Wayve + Nissan FY27 cross-OEM
The strongest live candidate to falsify Bill 8 ★ is the Wayve + Nissan cross-OEM AI driving pilot, currently needs_gate in sweep 806. The Wayve foundation model trained on (primarily) UK Jaguar / Ford / Nissan/Renault data is being integrated into Nissan ProPILOT for FY27 production launch. If the FY27 production launch ships with the same policy across both OEMs, this would be a clean trigger of Bill 8 at the cross-OEM driving sub-domain. The watchlist tracks Wayve cross-OEM pilot updates monthly. Public update of the ledger committed within 7 days.
Adjacent triggers under monitoring: Tesla FSD V14 NHTSA disengagement-rate disclosures; 1X RFM disclosure refresh; Apptronik Apollo at GXO Logistics + Google DeepMind cross-vendor partnership; METR robot-AgentBench publication; Helix-2 follow-up (Figure has not closed Bill 13 ambiguity).
Embodied-AI capability claims are saturated by vendor-driven framing (Tesla Optimus, Figure 03, 1X NEO, Apptronik Apollo). Bills 13 (tele-op decomposition) and 10 (vendor-self-eval independence) are the active falsifiers. M3 (single-embodiment) is the dominant tax at 36% of the corpus.
2022-2023 RT-1 eraPre-frontier baseline (M1). Single-embodiment, single-task. Sets reference frame for the 2024-2026 wave.
2023-Q3 RT-2Google DeepMind Vision-Language-Action model (arxiv:2307.15818). Pays M1 transition; bills 5, 8 candidates only with M3.
2024-Q1 Open X-EmbodimentRT-X cross-platform dataset (arxiv:2310.08864). Bills 8 candidate; rebuttal-paper status — heterogeneous success across embodiments.
2024-Q2 OpenVLAOpen-source VLA (Stanford / TRI). Bill 5 needs-gate.
2024-Q3 π0 (Physical Intelligence)7 arm-class platforms; Bill 8 candidate but pays M3 (arm-class only, not cross-morphology).
2024-Q4 π0.5 + GR00T-N1Physical Intelligence π0.5 + NVIDIA GR00T foundation models. Bill 5/8 needs-gate; both pay M3 at strict formulation.
2024-Q4 Sleeper-Agents-equivalentsApptronik Apollo + GR00T-N1 cross-humanoid foundation deployment. Pays M5 (demo cherry-pick).
2025-Q1 Helix (Figure)Helix Beyond the Robot critical audit + Helix academic-version rebuttal. Figure has not closed Bill 13 (tele-op decomposition) ambiguity.
2025-Q2 Wayve+Nissan FY27Cross-OEM pilot announcement (Wayve foundation model into Nissan ProPILOT for FY27). Strongest live Bill 8 ★ candidate — currently needs_gate pending production launch.
2025-Q3 Geiger-group nuScenesClosed-loop AV evaluation critique invalidates ~half of published VLM-driving headlines. Bill 12 anchor; sweep 806 had 23 rebuttal papers (highest rebuttal density in the corpus).
2025-Q4 1X NEO launch1X NEO consumer-humanoid launch (Oct 2025) — "Expert Mode" tele-operation revealed. MKBHD review: "Selling the Dream." Bill 13 anchor.
2026-Q1 Tesla HW3 admissionTesla admits HW3 cannot run unsupervised FSD. Bill 3 (vendor-side falsification on hardware-cost transparency).
2026-Q1 Apollo / Anthropic / IBBIS-equivalentsVendor disclosure ranking (median 1.4/5): 1X RFM > Apollo > NEO > Optimus > Figure. Bill 10 dominant at sweep 808.
2026-05 Ledger v0.1Released — 12 sweeps (8 batch-1 + 4 batch-2 verification), 312 unique papers, ★ Bills 5/8/11 EMPTY HOLDING after 9/9 hallucinated IDs killed.
Bill_4 (causally-faithful grounded reward, the bridge test) was claimed positive in batch 1 with 4 known_bill triggers paying M3 — Falcon, Dex-Loco, Soft-Skin Quadruped, HOVER. All 4 arXiv IDs verified hallucinated in the 2026-05-15 second-round verification sweep. The Bill_4 finding is killed; the cross_ledger_bridges Bridge 1 falsification claim is unsupported. Bridge 1's unrestricted form remains untested by the robotics_embodied corpus.
N1 · ★ Bill 5
Sim-to-real generalization breach killed
43 cands, 0 verified clean. Initial 2-clean-trigger "breach" was 5 hallucinated arXiv IDs (ANYmal D yearlong, Spot 2025 fleet). Verification sweep b2-901 killed it. Empty-space hypothesis HOLDS at strict formulation.
N2 · ★ Bill 8
Cross-embodiment transfer breach killed
41 cands, 0 verified clean. Initial 3-clean-trigger "breach" (Universal Quadruped Walking, OXE-Loco, 100K-Hour Locomotion) was hallucinated. Real adjacent leaders trigger weak forms with M3.
N3 · ★ Bill 11
Universal task coverage holds empty
22 cands, 0 verified clean. No paper covers all 5 sub-tasks {manipulation, locomotion, navigation, planning, human-interaction} above clean threshold. Brooks + LeCun + MKBHD skepticism cluster.
N4 · ✦ Bill 4 KILLED
Bridge 1 falsification claim unsupported
4 known_bill Bill_4 triggers (Falcon, Dex-Loco, Soft-Skin Quadruped, HOVER) all verified hallucinated. The cross_ledger_bridges Bridge 1 falsification claim is unsupported — not refuted, but evidence base evaporated.
N5 · Bill 2 dominant
Sim-to-real perception gap is the corpus
18 candidates. Camera-format / lighting / friction / contact-dynamics gap explicitly measured in most VLA papers. 8 rebuttal papers in sweep 802 confirm Bill 5 ★ predicted-empty for manipulation.
N6 · Bill 3 sparse
Hardware cost rarely disclosed
12 cands. Per-robot capex / per-trial wall-clock / total hardware-hours rarely disclosed by frontier vendors. Tesla HW3-cant-do-unsupervised Q1 2026 admission is vendor-side falsification.
N7 · Bill 6 rare
Long-horizon plan stability rarely audited
7 cands. ≥10-step task chains with failure-cascade rate < 30% rarely surface in the literature. Apollo / Mobile ALOHA + GR00T-N1 are weak candidates pending verification.
N8 · Bill 9 high-activity
Held-out scene generalization is active
32 cands. Strongest non-★ active bill. ≥30% novel objects in test set, no overlap with training scenes. RoboArena protocol monthly watchlist anchor.
N9 · Bill 10 moderate
Vendor-self-eval independence rare
26 cands across sweep 808. METR / Apollo / DARPA / NIST robotics-eval reproductions remain the rarest closure trigger. Bill 10 is the gating constraint for any positive Bill 5/8/11.
N10 · Bill 12 driving
Driving has highest rebuttal density
13 cands. 23 rebuttal papers in sweep 806 — highest rebuttal density in the corpus. Geiger-group nuScenes critique invalidates ~half of published VLM-driving headlines.
N11 · Bill 13 dominant
Tele-op decomposition: 32 clean
25 cands; the only bill with substantial clean known_bill triggers. ALOHA / Mobile ALOHA / π0/π0.5 / BD Atlas / Sanctuary Phoenix are clean. Vendor demos (Figure / Optimus / 1X NEO) remain ambiguous.
N12 · Methodology
9 of 9 checked IDs failed source verification on flagged IDs
Robotics batch-1 hallucination rate is a methodological finding. Drives the new Stage 3.5 verification rule applied to all subsequent ledgers. Lock condition: independent verification mandatory before any breach claim ships.
Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F5, F8, or F11. Verification (arXiv-ID lookup + abstract verification) is mandatory before any trigger fires.
F5 · ★ Sim-to-real generalization
Trigger: ≥80% of demonstrated capability retained on real hardware not seen during training/sim, with arXiv ID verified and the capability tested independently by METR / Apollo / NIST robotics-eval / university lab
F8 · ★ Embodiment cross-platform transfer
Trigger: same policy transfers Google → Boston Dynamics → Apptronik → 1X (or equivalent multi-vendor cross-morphology) with ≤30% capability degradation, verified arXiv ID + multi-vendor confirmation. Wayve+Nissan FY27 cross-OEM is the live contingent trigger.
F11 · ★ Universal task-set coverage
Trigger: frontier embodied system passes all 5 sub-tasks {manipulation, locomotion, navigation, multi-step planning, human-interaction} above clean threshold (≥70% per sub-task) with verified third-party reproduction
F4 · ✦ Bridge test (open)
Trigger: clean Bill 4 (causally-faithful grounded-reward) trigger with verified arXiv ID and intervention-validated causal mechanism. Round-2 verification killed all 4 claimed candidates; the bridge test remains open.
F12 · Safety / collision
Trigger: real-world incident rate, near-miss disclosure, or safety-bound enforcement audit forces NHTSA addendum or vendor RSP revision
F13 · Tele-op decomposition
Trigger: vendor (Figure / Optimus / 1X / Apollo) discloses autonomous-vs-tele-op contribution split with auditable telemetry
Live triggered watchlist: Wayve+Nissan FY27 cross-OEM · Tesla FSD V14 NHTSA monthly disengagement reports · Optimus tele-op decomposition disclosure · 1X RFM disclosure refresh · METR robot-AgentBench publication · Helix-2 followups (Figure) · RoboArena leaderboard monthly · Apptronik Apollo at GXO Logistics + Google DeepMind cross-vendor partnership. Monthly cadence: arXiv cs.RO, vendor blogs, RoboArena. Quarterly: NeurIPS/ICML/ICLR + RSS/CoRL/ICRA/IROS.
Threat modelDemonstrate a frontier robotic / embodied AI capability claim — on a system that markets autonomous physical manipulation or locomotion or driving (Google RT-2/RT-X, Figure 03/Helix, OpenVLA, Physical Intelligence π0/π0.5, 1X RFM-1, NVIDIA GR00T, DeepMind Gato-2, Stanford Mobile ALOHA, Berkeley Octo, Tesla Optimus/FSD V14, Waymo, Wayve, Apptronik Apollo) — that survives six closure audits on the 2024-2026 corpus: sim-to-real generalization, demo-distribution shift, embodiment cross-platform transfer, long-horizon plan stability, hardware-cost transparency, real-world deployment beyond demonstration set with held-out novel scenes. A clean trigger requires independent third-party verification (METR / Apollo / AISI / DARPA / NIST robotics-eval) within 6 months.
Deep loops12 sweeps × 5-10 parallel Opus research agents per sweep × 2 batch rounds (8 batch-1 + 4 batch-2 verification).
Sources surveyedarXiv cs.RO / cs.LG (robotics filter) 2024-2026 + RSS / CoRL / ICRA / IROS / NeurIPS / ICML / ICLR robotics tracks + frontier-vendor model cards (RT-2, RT-X, Helix/Figure 03, OpenVLA, π0/π0.5, GR00T, Gato-2, Apollo, Optimus, NEO) + sim-to-real audits (LIBERO, RoboCasa, ManiSkill, RoboArena, Open X-Embodiment) + locomotion benchmarks (ANYmal, Spot, MuJoCo) + driving benchmarks (CARLA, nuScenes, Waymo Open Dataset, Tesla FSD telemetry) + tele-op decomposition + Figure/Optimus skepticism + METR / Apollo / DARPA / NIST robotics-eval independent audits + safety / collision / NHTSA disclosures.
ClassifierRegex rule engine + hand-arbitration. v0.1; target v0.2 lock 1.000/1.000 on ≥50 hand-curated cases.
Empty-space testThree signature bills (5, 8, 11) predeclared empty BEFORE batch 1. After 312 unique papers and TWO rounds of verification (b2-901 + cross-ledger 2026-05-15), all three remain empty: 0 verified clean triggers each.
Verification ruleMandatory arXiv-ID + abstract verification before any breach commitment. Driven by the 9 of 9 checked IDs failed source verification on flagged robotics IDs in batch 1. Now applied as Stage 3.5 in all subsequent ledgers.
Bridge test statusBill 4 (causally-faithful grounded reward) was deliberately NOT marked ★ as a bridge-falsification test. Round-2 verification killed all 4 claimed Bill_4 triggers. cross_ledger_bridges Bridge 1 falsification claim is unsupported (not refuted) — the bridge test remains open.
ReproducibilityAll scripts, JSONs, ledger public. Run order: aggregate_batch_1.py → bill_classifier.py --benchmark → ledger populator → b2_901_breach_verification.py.
Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.
Public draft v0.1 (2026-05-15) — 312 unique papers across 12 sweeps; ★ Bills 5, 8, 11 EMPTY HOLDING with 0 verified clean triggers each after two rounds of verification killed 9 hallucinated IDs. Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual literature, not from a template.
Final state · 2026-05-15
Three signature constructions.
312 unique papers.
Empty space holding (after verification).