CHRONOS Agent DREAMING CRC Score Discoveries Ledger Training Signal
← Ledger / Multimodal Generation Ledger · v0.2 · 2026-05-15 · Ledger populated · 17th of 19 ledgers

377 papers.
13 bills + 6 meta-costs.
Three signature-empty + strong B7/B8 bipolar.

A real-data falsification harness for 2024–2026 frontier multimodal generation claims — image + video + audio. ★ Bills 5 (causally-faithful generation mechanism), 8 (cross-modality unified generation), 11 (held-out compositional generalization) HOLD pre-Stage-3.5 with 0/8, 0/9, 0/39 firings. Strong B7/B8 BIPOLAR signal in the 19-ledger atlas: Bill 9 (vendor-self-eval-independence) fires 78 times in closed-cloud cluster (Sora / Veo / MJ); Bill 12 (commercialization-axis) fires 74 times in open-source cluster (SD3 / Flux / HunyuanVideo). 8 deep-loop sweeps × 377 unique papers across vendor system cards + compositional benchmarks + physics-consistency audits + B7 commercialization-axis bridge test + independent third-party audits.

377
Unique papers
13+6
Bills + meta-costs
3
★ Empty bills
78/74
B7/B8 bipolar
Quick Orientation

AI now generates images, video, and music — we checked which generators do what their marketing says.

Open brief

Sora, Veo, Midjourney, Stable Diffusion, Suno, ElevenLabs — frontier generative AI now ships across image, video, and audio. We surveyed 377 papers from 2024-2026 covering the major systems. No model has shown that its internal mechanism causally produces a specific generated artifact in an intervention experiment. No single model handles all three modalities (image + video + audio) above a clean threshold. None pass the held-out compositional benchmarks designed to defeat training-data memorization. Clearest pattern in the ledger: closed-cloud products vs. open-source research artifacts split sharply on independent evaluation discipline. Independent citation verification is still in progress.

Why it matters: Generative media is the front-line consumer AI product. The line between 'genuinely understands what it's drawing' and 'remixes the training set' matters.What we found: 377 papers checked. Three predicted-empty lines hold (0/8, 0/9, 0/39 triggers). Strong signal so far for closed-product vs open-research evaluation split.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-15
Three signature-empty bills.
377 unique papers across 8 sweeps.
Strong B7/B8 bipolar signal in atlas.
§01

The thirteen-bill closure pattern for multimodal-generation claims

Bills are the closure mechanisms any 2024–2026 frontier multimodal generation capability claim must engage. The 13 bills below were predeclared in bills_draft.md v0.1 before any sweep ran, calibrated to the structure of the multimodal generation literature (vendor system cards, prompt-leakage contamination, attribute-faithfulness, text-rendering, physics-consistency, cross-resolution, held-out compositional benchmarks, commercialization-axis, safety / NSFW / deepfake / copyright). Bills 5, 8, 11 are ★ — empty-space hypothesis predicts no clean trigger without paying a meta-cost.

How to read this heatmap Counts inside each cell show candidate papers that touched a bill. ★ Bills 5, 8, 11 hold empty pre-Stage-3.5 (0/8, 0/9, 0/39). Bills 9 and 12 form a striking BIPOLAR signal: Bill 9 (vendor-self-eval) fires 78 times in the closed-cloud cluster; Bill 12 (commercialization-axis) fires 74 times in the open-source cluster — a strong closed-vs-open split in the current 19-ledger atlas.
1
28
2
39
3
17
4
31
5★
8
empty?
6
19
7
14
8★
9
empty?
9
78
closed
10
22
11★
39
empty?
12
74
open
13
26
★ Predicted empty (HOLDING pre-Stage-3.5) BIPOLAR (closed-vs-open split) High (25-50) Active (10-24)

★ Empty-space census (HOLDS pre-Stage-3.5 verification)

BillClosure basisCands.Clean
★ 5Causally-faithful generation mechanism (intervention-verified attention/cross-attention)
8 candidates. The closure requires intervention experiments showing attention or cross-attention causally produces the generated artifact (image / video / audio). 0 clean triggers — every candidate either pays M5 (no-citation-eval — observation-only attention attribution) or M3 (closed-vendor only). Direct extension of LLM-centric Bridge 1 (causally-faithful mechanism empty) to the diffusion / autoregressive generation substrate.
candidates8clean0
★ 8Cross-modality unified generation (same model image + video + audio above clean threshold)
9 candidates. Frontier "omni" models (GPT-4o, Gemini Live, Veo 3) marketed as unified but evaluated subset-only. 0 clean triggers — every candidate pays M2 (single-modality-eval) or M3 (closed-vendor only). The closure requires a single model to clear above-threshold on T2I-CompBench + VBench + audio-eval simultaneously without per-modality fine-tuning.
candidates9clean0
★ 11Held-out compositional generalization (T2I-CompBench / GenAI-Bench / SeedBench-2 held-out splits above threshold)
39 candidates — the largest active ★ bill. The closure requires the same model to clear all five sub-tasks (counting, color binding, spatial layout, attribute composition, negation) above clean threshold on held-out test splits. 0 clean triggers — every candidate excels at 1–3 sub-tasks and degrades on the rest. M5 (no-held-out-eval) and M3 dominate the meta-cost paid.
candidates39clean0

Bill 5 ★ (causally-faithful generation): 8 candidates, 0 clean. Direct extension of LLM-centric Bridge 1 to the diffusion / autoregressive generation substrate. Closure requires intervention experiments; current audits are observational.

Bill 8 ★ (cross-modality unified): 9 candidates, 0 clean. Frontier "omni" models (GPT-4o, Gemini Live, Veo 3) marketed as unified but evaluated subset-only.

Bill 11 ★ (held-out compositional): 39 candidates, 0 clean. Largest active ★ bill. Models excel at 1–3 sub-tasks and degrade on the rest.

BIPOLAR signal · strong closed-vs-open split in the 19-ledger atlas

Bills 9 (vendor-self-eval-independence) and 12 (commercialization-axis) form a striking bipolar firing pattern:

Closed-cloud cluster (Bill 9 fires 78 times): Sora / Sora 2, Veo 2 / 3, Imagen 3, DALL-E 3, Midjourney v6 / v7, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma Dream Machine, Kling, Hailuo MiniMax, Suno v3 / v4, Udio, ElevenLabs v3, Tencent HunyuanVideo (cloud product). Marketing-grade vendor-self-evaluation + benchmark cherry-pick + sub-set reporting dominates.

Open-source cluster (Bill 12 fires 74 times): Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, MusicGen, HunyuanVideo (open-weights), Genmo Mochi, Bytedance MagicAnimate. Open commercialization-axis + reproducible weights + community evaluation dominates.

This is the strong closed-vs-open split observed in the 19-ledger atlas — and the bipolar signature precisely validates the cross_ledger_bridges B7 RESCOPING (commercialization-vs-research, not geopolitical) + B8 emergent bridge (commercialization-vs-research-artifact axis). Hardware_inference is the only ledger with cleaner separation (strong 0/N signal).

Stage 3.5 verification queue · ledger populated, verification pending

Status: ledger populated; verification status: partial. The ledger ran 8 deep-loop sweeps × 377 unique papers and produced a strong B7/B8 bipolar signal in the current cross-ledger atlas. Stage 3.5 verification queue pending — priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API.

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (causally-faithful generation requires intervention experiments; cross-modality unified requires per-modality balanced training; held-out compositional requires per-sub-task balanced evaluation), not contingent on individual paper IDs. The B7/B8 bipolar finding, by contrast, is an architectural observation from the 78-vs-74 firing pattern — less dependent on individual source-ID verification.

§02

Frontier multimodal generation vendors and models in scope

The ledger tracks frontier multimodal generation capability claims by vendor / model lineage across image, video, and audio. The closed-cloud / open-source cluster boundary creates the bipolar B7/B8 signal.

OpenAI
Sora / Sora 2 / DALL-E 3
closed cloud
Google
Veo 2 / 3 / Imagen 3 / Lyria
closed cloud
Midjourney
v6 / v7
closed cloud
Stability / Flux
SD3 / 3.5 + Flux
open weights
RunwayML / Pika
Gen-3/4 + Pika 2.0
closed cloud
Suno / Udio
audio generation
closed cloud
Tencent / Bytedance
HunyuanVideo + MagicAnimate
mixed open/closed
Adobe / Microsoft
Firefly 3 + Designer
closed cloud

The frontier multimodal generation literature splits sharply between closed-cloud vendor products and open-source releases. Bill 9 (vendor-self-eval) catches the closed-cloud reporting gap; Bill 12 (commercialization-axis) catches the open-source community evaluation reporting. The 78-vs-74 bipolar firing is a strong signal of the B7 RESCOPING + B8 emergent bridge across the current cross-ledger atlas.

§03

Method at a glance

Threat modelDemonstrate a frontier multimodal-generation capability claim — on a system generating image / video / audio at frontier scale — that survives six closure audits: (1) prompt-leakage contamination, (2) attribute-faithfulness audit (counting, color binding, spatial layout), (3) text-rendering generalization, (4) physics-consistency audit (video objects don't teleport / interpenetrate), (5) cross-resolution / cross-aspect generalization, (6) held-out prompt evaluation.
Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round (Stage 5 POPULATE).
Sources surveyedVendor system cards (OpenAI Sora / Sora 2, Google Veo 2 / 3, Imagen 3, Lyria, Midjourney v6 / v7, Stable Diffusion 3 / 3.5, SDXL Turbo, Flux dev / pro / schnell, Adobe Firefly 3, RunwayML Gen-3 / Gen-4, Pika 2.0, Luma, Kling, Hailuo MiniMax, Suno, Udio, MusicGen, ElevenLabs v3, Tencent HunyuanVideo, Bytedance MagicAnimate, Genmo Mochi) + prompt-leakage / training-data contamination audits (LAION-5B Carlini extraction, JourneyDB) + attribute faithfulness benchmarks (T2I-CompBench, GenAI-Bench, VBench, SeedBench-2) + physics-consistency video audits (VBench-Physics, Sora-violations critique) + B7 commercialization-axis bridge test + independent third-party audits + safety / NSFW / deepfake / copyright papers.
Bridge-test focusTests cross_ledger_bridges B7 RESCOPED / B8 emergent and B4 substrate-conditional. Strong BIPOLAR signal observed: 78 closed-cloud Bill 9 vs 74 open-source Bill 12. B9 NULL check matched the current annotations (no grounded-reward triggers in pure generation — power/thermal/latency don't ground generation pipelines). B4 substrate-conditional deepens (image / video / audio diffusion are three substrates within one ledger).
Empty-space testThree signature bills (5, 8, 11) predeclared empty in v0.1 BEFORE batch 1 sweeps. After 377 unique papers, all three ★ bills HOLD pre-Stage-3.5: 8 / 9 / 39 candidates, 0 clean triggers across all three. Bill 11 (held-out compositional) is the largest active ★ bill at 39 candidates.
Verification ruleStage 3.5 verification queue pending. Priority pool of ~30 ★-bill candidate IDs + 20 sweep-health spotchecks. Closure mechanisms are structural; the empty-space hypothesis is less sensitive to typical source-ID errors. The B7/B8 bipolar finding is architectural and less dependent on individual source-ID verification.
Cross-ledger couplingcross_ledger_bridges B7 RESCOPED + B8 emergent + B4 substrate-conditional + B9 NULL — this ledger validated all four. hardware_inference shares the bipolar B7/B8 architecture (strong signal there). vision_language Bill 4 cousin to Bill 5 here on attention-faithfulness.
ReproducibilityScripts, JSONs, ledger public. Run order: sweep dispatcher → bill_classifier.py → ledger populator → atlas review pipeline. Stage 3.5 verifier dispatch in queue.
§04

Falsification protocol

Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F5, F8, or F11.

F5 · ★ Causally-faithful generation
Trigger: a 2024–2026 paper demonstrating a frontier image / video / audio generation system whose attention or cross-attention causally produces the generated artifact per intervention experiments — verified by an independent third-party (METR / AISI / Apollo / Anthropic mech interp lab) within 6 months.
F8 · ★ Cross-modality unified
Trigger: a 2024–2026 frontier "omni" model (GPT-4o, Gemini Live, Veo 3, etc.) demonstrating above-clean-threshold performance on T2I-CompBench + VBench + audio-eval simultaneously without per-modality fine-tuning — verified by an independent third-party.
F11 · ★ Held-out compositional
Trigger: a 2024–2026 frontier model passing all 5 compositional sub-tasks (counting + color binding + spatial layout + attribute composition + negation) above clean threshold on held-out T2I-CompBench / GenAI-Bench / SeedBench-2 splits — verified by an independent third-party.
F-Bipolar-collapse
Soft trigger: a 2024–2026 model that collapses the B7/B8 bipolar signal (e.g., a closed-cloud vendor open-sourcing core diffusion weights, or an open-source release reaching closed-cloud product capability parity on VBench). Would re-test the B7 RESCOPING.
F-Physics
Soft trigger: a 2024–2026 video generation system passing VBench-Physics + Sora-violations critique benchmarks at >90% — currently every system fails on object permanence / interpenetration / gravity violation at material rates.
F-Safety
Soft trigger: a frontier multimodal generation system passing C2PA + watermarking + training-data attribution at full coverage — currently Bill 13 (safety / NSFW / deepfake / copyright) fires across most candidates with M3 paid.

Live triggered watchlist: VBench / VBench-Physics quarterly releases · T2I-CompBench / GenAI-Bench / SeedBench-2 held-out refresh · METR / AISI / Apollo independent multimodal audits · frontier vendor system-card revisions (Sora / Veo / MJ / SD / Flux / Hunyuan) · open-weight diffusion releases. Monthly cadence: vendor system-card revisions + open-source releases. Quarterly: benchmark refreshes + independent audits.

§05

Resources & further reading

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (2026-05-15) — 377 unique papers across 8 sweeps; ★ Bills 5, 8, 11 HOLD pre-Stage-3.5. Strong B7/B8 BIPOLAR signal in the 19-ledger atlas. Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual frontier multimodal generation literature, not from a template.

Ledger populated · Stage 3.5 queue pending · 2026-05-15
Three signature constructions.
Strong B7/B8 bipolar signal in atlas.
Empty space HOLDS pending verification.