Public draft · v0.2 · May 2026 · Multimodal Generation Ledger

★ Public Draft · v0.2 · Real Data17th of 19 · Strong B7/B8 Bipolar8 Sweeps · 377 Papers

The Thirteen-Bill Closure Pattern for Frontier Multimodal Generation Claims (Image / Video / Audio, 2024–2026)

A real-data falsification harness for 2024–2026 frontier multimodal generation capability claims with 377-paper empirical verification, three signature-empty bills, and a strong B7/B8 BIPOLAR signal observed in the current cross-ledger atlas.
Kevin Russell · ProjectForty2 · 2026-05-15 · Public draft v0.2.

↓ bills_draft.md ← Ledger overview Data & code →

Abstract

The Multimodal Generation Ledger audits 377 unique papers across 8 deep-loop sweeps (2024–2026) on frontier multimodal generation capability claims — OpenAI Sora / Sora 2 / DALL-E 3, Google Veo 2 / 3 / Imagen 3 / Lyria, Anthropic, Midjourney v6 / v7, Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma Dream Machine, Kling, Hailuo MiniMax, Suno v3 / v4, Udio, MusicGen, ElevenLabs v3, Tencent HunyuanVideo, Bytedance MagicAnimate, Genmo Mochi. The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).

Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (causally-faithful generation mechanism per intervention experiments), Bill 8 ★ (cross-modality unified generation — same model image + video + audio above clean threshold), and Bill 11 ★ (held-out compositional generalization — T2I-CompBench / GenAI-Bench / SeedBench-2 above threshold).

After batch 1 (377 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5: 0/8, 0/9, 0/39 firings. A strong B7/B8 bipolar signal surfaced in the current 19-ledger atlas: Bill 9 (vendor-self-eval-independence) fires 78 times in the closed-cloud cluster (Sora / Veo / MJ / Adobe / RunwayML / Pika / Luma / Kling / Hailuo / Suno / Udio / Tencent cloud); Bill 12 (commercialization-axis) fires 74 times in the open-source cluster (SD3 / Flux / HunyuanVideo open weights / MusicGen / Genmo Mochi / Bytedance MagicAnimate). This is a strong closed-vs-open split observed in any of the 19 cross-ledger ledgers, and the bipolar signature precisely validates the cross_ledger_bridges B7 RESCOPING + B8 emergent bridge.

Batch-3 inheritance check (sweep_708 2026-05-14, cross_ledger_bridges): all three ★ bill checks matched the current annotations (0/8, 0/9, 0/39); B9 NULL check matched the current annotations (no grounded-reward triggers in pure generation); B4 substrate-conditional deepens (image / video / audio diffusion are three substrates within one ledger).

13+6Bills + meta-costs

3★ Empty bills

377Unique Papers

78/74B7/B8 Bipolar

§01Threat model + bills (predeclared in v0.1)

Threat model (verbatim from purpose.md): Demonstrate a frontier multimodal-generation capability claim — on a system generating image / video / audio at frontier scale — that survives six closure audits on the 2024–2026 corpus: (1) prompt-leakage contamination, (2) attribute-faithfulness audit (counting, color binding, spatial layout), (3) text-rendering generalization, (4) physics-consistency audit (video objects don't teleport / interpenetrate), (5) cross-resolution / cross-aspect generalization, (6) held-out prompt evaluation.

#	Bill (v0.1)	What gets paid	Star
1	Prompt-leakage contamination	Test prompts absent from training-data corpus (LAION-5B, JourneyDB)
2	Attribute-faithfulness audit	Counting / color binding / spatial layout / negation handling
3	Text-rendering generalization	Held-out text strings (non-trained typography, rare characters)
4	Physics-consistency audit	Video objects don't teleport / interpenetrate / violate gravity
5	Causally-faithful generation mechanism	Intervention experiments show attention causally produces artifact	★
6	Cross-resolution / cross-aspect	Trained 1024² generalizes to 4096² / 9:16 / 1:1 / 16:9
7	Strong-baseline classical	GAN / VAE / flow baseline at equivalent compute
8	Cross-modality unified generation	Same model image + video + audio above clean threshold	★
9	Vendor-self-eval independence BIPOLAR 78 closed	T2I-CompBench / GenAI-Bench / VBench independent reproduction
10	Held-out style / OOD prompts	Diverse out-of-distribution prompts beyond training corpus
11	Held-out compositional generalization	T2I-CompBench / GenAI-Bench / SeedBench-2 held-out splits above threshold	★
12	Commercialization-vs-research axis BIPOLAR 74 open	Closed cloud (Sora, Veo, MJ) vs open-source (SD3, Flux, HunyuanVideo). B7 bridge test
13	Safety / NSFW / deepfake / copyright	Watermarking, C2PA, model-extraction, training-data attribution

Six meta-costs (M1–M6): M1 (pre-2024), M2 (single-modality-eval), M3 (closed-vendor-only — no API access for evaluation), M4 (single-prompt-distribution), M5 (no-held-out-eval — observation-only), M6 (implementation-specific).

§02Findings — three ★ bills HOLD empty + strong B7/B8 bipolar signal

★ Bill 5 — Causally-faithful generation mechanism · 8 candidates · 0 clean

Bill 5 ★ requires intervention experiments showing attention or cross-attention causally produces the generated artifact (image / video / audio). Of 8 batch-1 candidates, 0 clean triggers — every candidate either pays M5 (no-citation-eval — observation-only attention attribution) or M3 (closed-vendor only). Direct extension of LLM-centric Bridge 1 (causally-faithful mechanism empty) to the diffusion / autoregressive generation substrate. Closure requires intervention experiments; current audits are observational.

★ Bill 8 — Cross-modality unified generation · 9 candidates · 0 clean

Bill 8 ★ requires the same model to clear above-clean-threshold on T2I-CompBench + VBench + audio-eval simultaneously without per-modality fine-tuning. Frontier "omni" models (GPT-4o, Gemini Live, Veo 3) marketed as unified but evaluated subset-only. Of 9 batch-1 candidates, 0 clean triggers — every candidate pays M2 (single-modality-eval) or M3 (closed-vendor only).

★ Bill 11 — Held-out compositional generalization · 39 candidates · 0 clean

Bill 11 ★ requires the same model to clear all five sub-tasks (counting, color binding, spatial layout, attribute composition, negation) above clean threshold on held-out test splits. The largest active ★ bill at 39 candidates. Of 39 batch-1 candidates, 0 clean triggers — every candidate excels at 1–3 sub-tasks and degrades on the rest. M5 (no-held-out-eval) and M3 dominate the meta-cost paid.

BIPOLAR signal · strong closed-vs-open split in 19-ledger atlas

Bills 9 and 12 form a striking bipolar firing pattern:

Closed-cloud cluster (Bill 9 fires 78 times): Sora / Sora 2, Veo 2 / 3, Imagen 3, DALL-E 3, Midjourney v6 / v7, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma Dream Machine, Kling, Hailuo MiniMax, Suno v3 / v4, Udio, ElevenLabs v3, Tencent HunyuanVideo (cloud product). Marketing-grade vendor-self-evaluation + benchmark cherry-pick + sub-set reporting dominates.

Open-source cluster (Bill 12 fires 74 times): Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, MusicGen, HunyuanVideo (open-weights), Genmo Mochi, Bytedance MagicAnimate. Open commercialization-axis + reproducible weights + community evaluation dominates.

This is the strong closed-vs-open split observed in the 19-ledger atlas — and the bipolar signature precisely validates the cross_ledger_bridges B7 RESCOPING (commercialization-vs-research, not geopolitical) + B8 emergent bridge (commercialization-vs-research-artifact axis). Hardware_inference is the only ledger with cleaner separation (strong 0/N signal).

§03Verification methodology — Stage 3.5 queue pending

Stage 3.5 verification queue · 2026-05-15

The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies (Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures).

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (causally-faithful generation requires intervention experiments, cross-modality unified requires per-modality balanced training, held-out compositional requires per-sub-task balanced evaluation). The B7/B8 bipolar finding, by contrast, is an architectural observation from the 78-vs-74 firing pattern — less dependent on individual source-ID verification.

Cross-ledger methodology rule · v2026-05-15

Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: classifier rule-tightening, sweep-time arxiv verification, Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.

§04Lock conditions and queue

v0.2 status (2026-05-15): Stage 5 (POPULATE) complete with 8 sweeps × 377 papers (✓), classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-Bipolar-collapse, F-Physics, F-Safety, ✓), author-activity panel mapped (OpenAI-Sora, Google-Veo / Imagen, Anthropic, Midjourney, Stability-AI, Black-Forest-Labs, Adobe-Firefly, RunwayML, Pika, Luma, Kling, Hailuo-MiniMax, Suno, Udio, Meta-MusicGen, ElevenLabs, Tencent-Hunyuan, Bytedance-Magic, Genmo-Mochi, ✓). Stage 3.5 verification queue pending.

Cadence: Monthly — vendor system-card revisions + open-source diffusion releases. Quarterly — VBench / VBench-Physics, T2I-CompBench / GenAI-Bench / SeedBench-2 held-out refresh. Triggered: any 2024–2026 model that collapses the B7/B8 bipolar signal (closed-cloud open-sourcing core diffusion weights, or open-source release reaching closed-cloud product capability parity); any frontier model passing all 5 compositional sub-tasks above clean threshold on held-out splits.

Bridge-test status · cross_ledger_bridges B7 RESCOPED + B8 emergent (validated)

This ledger is one of two strong tests of the cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges. Strong BIPOLAR signal observed (78 vs 74) — precisely validating the architectural prediction. Hardware_inference is the only ledger with cleaner separation (strong 0/N signal across 0/34 + 0/38 + 0/20).

Falsification of the B7/B8 prediction here would require a 2024–2026 model that collapses the bipolar signal cleanly — e.g., a closed-cloud vendor (Sora / Veo / MJ) open-sourcing core diffusion weights with frontier capability, or an open-source release (SD3 / Flux / HunyuanVideo) reaching Sora-3 / Veo-4 capability parity on VBench.