The Multimodal Generation Ledger audits 377 unique papers across 8 deep-loop sweeps (2024–2026) on frontier multimodal generation capability claims — OpenAI Sora / Sora 2 / DALL-E 3, Google Veo 2 / 3 / Imagen 3 / Lyria, Anthropic, Midjourney v6 / v7, Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma Dream Machine, Kling, Hailuo MiniMax, Suno v3 / v4, Udio, MusicGen, ElevenLabs v3, Tencent HunyuanVideo, Bytedance MagicAnimate, Genmo Mochi. The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).
Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (causally-faithful generation mechanism per intervention experiments), Bill 8 ★ (cross-modality unified generation — same model image + video + audio above clean threshold), and Bill 11 ★ (held-out compositional generalization — T2I-CompBench / GenAI-Bench / SeedBench-2 above threshold).
After batch 1 (377 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5: 0/8, 0/9, 0/39 firings. A strong B7/B8 bipolar signal surfaced in the current 19-ledger atlas: Bill 9 (vendor-self-eval-independence) fires 78 times in the closed-cloud cluster (Sora / Veo / MJ / Adobe / RunwayML / Pika / Luma / Kling / Hailuo / Suno / Udio / Tencent cloud); Bill 12 (commercialization-axis) fires 74 times in the open-source cluster (SD3 / Flux / HunyuanVideo open weights / MusicGen / Genmo Mochi / Bytedance MagicAnimate). This is a strong closed-vs-open split observed in any of the 19 cross-ledger ledgers, and the bipolar signature precisely validates the cross_ledger_bridges B7 RESCOPING + B8 emergent bridge.
Batch-3 inheritance check (sweep_708 2026-05-14, cross_ledger_bridges): all three ★ bill checks matched the current annotations (0/8, 0/9, 0/39); B9 NULL check matched the current annotations (no grounded-reward triggers in pure generation); B4 substrate-conditional deepens (image / video / audio diffusion are three substrates within one ledger).
Threat model (verbatim from purpose.md): Demonstrate a frontier multimodal-generation capability claim — on a system generating image / video / audio at frontier scale — that survives six closure audits on the 2024–2026 corpus: (1) prompt-leakage contamination, (2) attribute-faithfulness audit (counting, color binding, spatial layout), (3) text-rendering generalization, (4) physics-consistency audit (video objects don't teleport / interpenetrate), (5) cross-resolution / cross-aspect generalization, (6) held-out prompt evaluation.
| # | Bill (v0.1) | What gets paid | Star |
|---|---|---|---|
| 1 | Prompt-leakage contamination | Test prompts absent from training-data corpus (LAION-5B, JourneyDB) | |
| 2 | Attribute-faithfulness audit | Counting / color binding / spatial layout / negation handling | |
| 3 | Text-rendering generalization | Held-out text strings (non-trained typography, rare characters) | |
| 4 | Physics-consistency audit | Video objects don't teleport / interpenetrate / violate gravity | |
| 5 | Causally-faithful generation mechanism | Intervention experiments show attention causally produces artifact | ★ |
| 6 | Cross-resolution / cross-aspect | Trained 1024² generalizes to 4096² / 9:16 / 1:1 / 16:9 | |
| 7 | Strong-baseline classical | GAN / VAE / flow baseline at equivalent compute | |
| 8 | Cross-modality unified generation | Same model image + video + audio above clean threshold | ★ |
| 9 | Vendor-self-eval independence BIPOLAR 78 closed | T2I-CompBench / GenAI-Bench / VBench independent reproduction | |
| 10 | Held-out style / OOD prompts | Diverse out-of-distribution prompts beyond training corpus | |
| 11 | Held-out compositional generalization | T2I-CompBench / GenAI-Bench / SeedBench-2 held-out splits above threshold | ★ |
| 12 | Commercialization-vs-research axis BIPOLAR 74 open | Closed cloud (Sora, Veo, MJ) vs open-source (SD3, Flux, HunyuanVideo). B7 bridge test | |
| 13 | Safety / NSFW / deepfake / copyright | Watermarking, C2PA, model-extraction, training-data attribution |
Six meta-costs (M1–M6): M1 (pre-2024), M2 (single-modality-eval), M3 (closed-vendor-only — no API access for evaluation), M4 (single-prompt-distribution), M5 (no-held-out-eval — observation-only), M6 (implementation-specific).
Bill 5 ★ requires intervention experiments showing attention or cross-attention causally produces the generated artifact (image / video / audio). Of 8 batch-1 candidates, 0 clean triggers — every candidate either pays M5 (no-citation-eval — observation-only attention attribution) or M3 (closed-vendor only). Direct extension of LLM-centric Bridge 1 (causally-faithful mechanism empty) to the diffusion / autoregressive generation substrate. Closure requires intervention experiments; current audits are observational.
Bill 8 ★ requires the same model to clear above-clean-threshold on T2I-CompBench + VBench + audio-eval simultaneously without per-modality fine-tuning. Frontier "omni" models (GPT-4o, Gemini Live, Veo 3) marketed as unified but evaluated subset-only. Of 9 batch-1 candidates, 0 clean triggers — every candidate pays M2 (single-modality-eval) or M3 (closed-vendor only).
Bill 11 ★ requires the same model to clear all five sub-tasks (counting, color binding, spatial layout, attribute composition, negation) above clean threshold on held-out test splits. The largest active ★ bill at 39 candidates. Of 39 batch-1 candidates, 0 clean triggers — every candidate excels at 1–3 sub-tasks and degrades on the rest. M5 (no-held-out-eval) and M3 dominate the meta-cost paid.
Bills 9 and 12 form a striking bipolar firing pattern:
Closed-cloud cluster (Bill 9 fires 78 times): Sora / Sora 2, Veo 2 / 3, Imagen 3, DALL-E 3, Midjourney v6 / v7, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma Dream Machine, Kling, Hailuo MiniMax, Suno v3 / v4, Udio, ElevenLabs v3, Tencent HunyuanVideo (cloud product). Marketing-grade vendor-self-evaluation + benchmark cherry-pick + sub-set reporting dominates.
Open-source cluster (Bill 12 fires 74 times): Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, MusicGen, HunyuanVideo (open-weights), Genmo Mochi, Bytedance MagicAnimate. Open commercialization-axis + reproducible weights + community evaluation dominates.
This is the strong closed-vs-open split observed in the 19-ledger atlas — and the bipolar signature precisely validates the cross_ledger_bridges B7 RESCOPING (commercialization-vs-research, not geopolitical) + B8 emergent bridge (commercialization-vs-research-artifact axis). Hardware_inference is the only ledger with cleaner separation (strong 0/N signal).
The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies (Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures).
The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (causally-faithful generation requires intervention experiments, cross-modality unified requires per-modality balanced training, held-out compositional requires per-sub-task balanced evaluation). The B7/B8 bipolar finding, by contrast, is an architectural observation from the 78-vs-74 firing pattern — less dependent on individual source-ID verification.
Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: classifier rule-tightening, sweep-time arxiv verification, Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.
v0.2 status (2026-05-15): Stage 5 (POPULATE) complete with 8 sweeps × 377 papers (✓), classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-Bipolar-collapse, F-Physics, F-Safety, ✓), author-activity panel mapped (OpenAI-Sora, Google-Veo / Imagen, Anthropic, Midjourney, Stability-AI, Black-Forest-Labs, Adobe-Firefly, RunwayML, Pika, Luma, Kling, Hailuo-MiniMax, Suno, Udio, Meta-MusicGen, ElevenLabs, Tencent-Hunyuan, Bytedance-Magic, Genmo-Mochi, ✓). Stage 3.5 verification queue pending.
Cadence: Monthly — vendor system-card revisions + open-source diffusion releases. Quarterly — VBench / VBench-Physics, T2I-CompBench / GenAI-Bench / SeedBench-2 held-out refresh. Triggered: any 2024–2026 model that collapses the B7/B8 bipolar signal (closed-cloud open-sourcing core diffusion weights, or open-source release reaching closed-cloud product capability parity); any frontier model passing all 5 compositional sub-tasks above clean threshold on held-out splits.
This ledger is one of two strong tests of the cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges. Strong BIPOLAR signal observed (78 vs 74) — precisely validating the architectural prediction. Hardware_inference is the only ledger with cleaner separation (strong 0/N signal across 0/34 + 0/38 + 0/20).
Falsification of the B7/B8 prediction here would require a 2024–2026 model that collapses the bipolar signal cleanly — e.g., a closed-cloud vendor (Sora / Veo / MJ) open-sourcing core diffusion weights with frontier capability, or an open-source release (SD3 / Flux / HunyuanVideo) reaching Sora-3 / Veo-4 capability parity on VBench.