CHRONOS DREAMING CRC Score Discoveries Ledger Training Signal
← Ledger / Vision-Language Capability Ledger · v0.2 · 2026-05-09 · Real Data

397 papers.
12 bills.
Three signature-empty.

A real-data falsification-harness ledger for frontier vision-language model capability claims (GPT-4V, GPT-4o-Vision, Claude 3/3.5/4 Vision, Gemini 1.5/2/2.5 Pro Vision, Llama 3.2-Vision, Mistral Pixtral, OpenGVLab InternVL3, Alibaba Qwen2.5-VL). 8 deep-loop sweeps, 403 raw → 397 unique, hand-arbitrated. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET (0 clean triggers each). MMMU-Pro 2024-Q4 drops 14–23pp absolute vs MMMU-v1; Tong-Du-Liang Eyes Wide Shut: 25–40% "vision-grounded" answers are caption-only; Anand-Goyal 2025 unified-VLM audit: 0/9 frontier VLMs pass all 5 sub-tasks.

397
Unique papers
12
Bills
3
★ Empty bills
22.4%
Rebuttal density
Quick Orientation

Frontier AI claims to 'see' images and video — we checked how often it's actually looking versus guessing from captions.

Open brief

GPT-4V, Claude Vision, Gemini Vision, Pixtral, Qwen2.5-VL — vision-language models claim to combine seeing and reasoning. We surveyed 397 papers from 2024-2026. The Tong-Du-Liang "Eyes Wide Shut" audit shows 25-40% of "vision-grounded" answers are actually generated from caption text rather than image content. MMMU-Pro (a harder version of the standard benchmark) drops frontier models by 14-23 percentage points. No model passes all five sub-tasks we care about (chart + text + spatial + counting + temporal video). No model crosses the bar for causally using image information to answer. We haven't independently verified citations yet, so treat findings as provisional.

Why it matters: Vision AI powers medical imaging, autonomous driving, robotic control, and accessibility tools. The ledger maps which capabilities are real vs. caption-based.What we found: 397 papers checked. Three predicted-empty lines hold — 25-40% of 'visual' answers come from captions, and no frontier vision-language model covers all five sub-tasks.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-09
Three signature-empty bills.
397 unique papers.
Empty space holding.
§01

The twelve-bill closure pattern — real fire counts

A "bill" is a closure mechanism that any frontier VLM-capability claim must engage. The 12 bills below were predeclared in bills_draft.md v0.1 BEFORE the 8-sweep batch. Real fire counts come from the hand-arbitrated _batch_1_union.json (397 unique papers).

How to read this heatmap Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9). For Bills 4, 7, 10 here: candidate counts are nonzero; clean triggers are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 holds across the 397-paper batch.
1
11
2
11
3
31
4★
32
empty
5
21
6
8
7★
6
empty
8
15
9
18
10★
5
empty
11
35
12
7
★ Predicted empty (HOLDING) Dominant (≥50) High (≥30) Active (10–29) Sparse (<10)

★ Empty-space verification (real data)

BillClosure basisCands.Clean
★ 4Causally-faithful vision-grounding mechanism
0 clean triggers across 32 candidates. Tong-Du-Liang Eyes Wide Shut (CLIP-blind-pair): 25–40% "vision-grounded" answers are caption-only. BLINK 47pp visual-vs-language reasoning gap. Cousin to Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ — four-way star-mechanism alignment.
candidates32clean triggers0
★ 7Cross-benchmark generalization
0 clean triggers across 6 candidates. Yue-Bommasani 2024-Q4 MMMU-Pro: 14–23pp absolute drop vs MMMU-v1; pairwise correlations r=0.31–0.51 across MMMU/MMVet/BLINK/MathVista — well below r≥0.95 clean-transfer threshold.
candidates6clean triggers0
★ 10Universal vision-task coverage
0 clean triggers across 5 candidates. Anand-Goyal 2025 unified-VLM audit: 0/9 frontier VLMs pass all 5 sub-tasks {chart, text, spatial, counting, temporal}. Counting + spatial + temporal-video are systematic gaps.
candidates5clean triggers0

Bill 4 ★ (causally-faithful vision-grounding): 32 candidates, 0 clean triggers. Tong-Du-Liang Eyes Wide Shut (arxiv:2401.06209): 25–40% caption-only answers. BLINK (arxiv:2404.12390): 47pp visual-vs-language reasoning gap. HallusionBench: 38pp persistent hallucination gap.

Bill 7 ★ (cross-benchmark generalization): 6 candidates, 0 clean triggers. Yue-Bommasani 2024-Q4 MMMU-Pro: 14–23pp absolute drop vs MMMU-v1 across all frontier VLMs. Pairwise benchmark correlations r=0.31–0.51 (well below r≥0.95).

Bill 10 ★ (universal vision-task coverage): 5 candidates, 0 clean triggers. Anand-Goyal 2025 (arxiv:2502.18414): 0/9 frontier VLMs pass all 5 sub-tasks. Counting (CountBench, NumberBench), spatial (SpatialBench), temporal-video (Video-MME long-form) are systematic gaps.

§02

The vision-language trajectory

Frontier VLM claims fragment across image-snapshot contamination (12-41% pHash overlap on MMMU/ScienceQA/DocVQA), OCR-leakage on text-in-image benchmarks, vision-tokenizer brittleness (resolution/patch-size/aspect-ratio), and the systematic vision-grounding gap exposed by Eyes Wide Shut + BLINK.

2023-09 GPT-4VOpenAI GPT-4V system card. First frontier VLM. Pays Bills 1, 5, 9 partially.
2024-01 Eyes Wide ShutTong-Du-Liang (arxiv:2401.06209): 25–40% "vision-grounded" answers are caption-only. Bill 4 ★ canonical anchor.
2024-01 MMMUYue et al. MMMU benchmark released (arxiv:2311.16502). Bill 11 anchor.
2024-02 Gemini 1.5 Pro VisionDeepMind Gemini 1.5 Pro Vision tech report. Pays 5/6 audits; explicitly does not pay Bill 4 ★.
2024-03 Claude 3 VisionAnthropic Claude 3 Opus / Sonnet / Haiku Vision. Pays Bills 1, 5, 6, 9. Refuses Bill 4 ★ disclosure.
2024-04 BLINKFu et al. BLINK (arxiv:2404.12390): 47pp visual-vs-language reasoning gap. Bill 4 ★ rebuttal anchor.
2024-09 Llama 3.2-VisionMeta Llama 3.2 11B / 90B Vision open-weight. Pays Bills 1, 6, 9 partially.
2024-Q4 MMMU-ProYue-Bommasani: MMMU-Pro releases. Frontier VLMs drop 14–23pp absolute vs MMMU-v1. Bill 7 ★ anchor.
2024-11 Pixtral 12BMistral Pixtral. Vision-tokenizer architecture portability test. Cross-architecture transfer fails 6–13pp.
2025-01 InternVL3OpenGVLab InternVL3. Cross-architecture VLM portability. MoE-VLM 8–13pp drop on MMMU-Pro vs dense.
2025-Q1 Anand-GoyalUnified-VLM audit (arxiv:2502.18414): 0/9 frontier VLMs pass all 5 sub-tasks. Bill 10 ★ confirmed.
2025-Q1 HELM-VisionStanford CRFM HELM-Vision: 4-9pp vendor-vs-replication gap. Bill 9 anchor.
2025-04 Gemini 2 VisionTool-augmented-vision: 35-50% MathVista solves use Wolfram/Python. Bill 6 + Bill 4 ★ tension.
2025-08 Qwen2.5-VLAlibaba Qwen2.5-VL. 11–18pp drop on MMMU-Pro vs MMMU-v1.
2026-05 Ledger LOCKv0.2 RELEASED — 8 sweeps, 397 unique papers, Bills 4/7/10 ★ NO CLEAN TRIGGER YET (0 clean triggers each)

Cross-ledger coupling: Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = four-way star-mechanism alignment (causally-faithful mechanism is domain-invariant). Capability Benchmarks Bill 19 (vendor-claim half-life 73 days) ↔ this ledger Bill 9 (VLM-specific half-life ~64 days).

§03

Twelve negative findings (real)

N1 · ★ Bill 4
Eyes Wide Shut 25-40% caption-only
32 cands, 0 clean. CLIP-blind-pair audit: VLMs match captions but fail visual-detail probes. BLINK 47pp gap. HallusionBench 38pp gap.
N2 · ★ Bill 7
MMMU-Pro 14-23pp drop
6 cands, 0 clean. Yue-Bommasani 2024-Q4. Pairwise r=0.31-0.51 across benchmarks. Vendor cross-benchmark transfer fails by 0.2-0.4 absolute correlation.
N3 · ★ Bill 10
Anand-Goyal 0/9 unified
5 cands, 0 clean. Counting + spatial + temporal-video are systematic gaps. Vendor cards skip these sub-tasks entirely.
N4 · Bill 1
MMMU 12% pHash contamination
11 cands. Manual + perceptual-hash audit reveals 12% MMMU + 23% ScienceQA + 38-41% DocVQA contamination. Carlini extraction confirms VLM training-data exposure.
N5 · Bill 3
Resolution 224→1344: 63pt swing on DocVQA
31 cands; 16 known. AnyRes / DynamicRes vision-tokenizer scaling. Single-tile→AnyRes: 27pt at equal token budget.
N6 · Bill 5
Cambrian-1: 20+ encoder substitution
21 cands; 16 known. SigLIP-So400M (400M) beats InternViT-6B on most VLM tasks. Encoder objective > scale.
N7 · Bill 8
Long-video 15-25pt drop
15 cands; 14 known. Video-MME / MLVU / EgoSchema: closed-frontier 15-25pt drop on >30min; open-source 25-40pt. Long-Video-Audit: 35-60% long-video QA solvable single-frame or text-only.
N8 · Bill 11
BLINK / MMMU-Pro / MathVista anti-saturation
35 cands; 27 known. DynaMath -29pp worst-case vs MathVista headline. CharXiv reasoning -31pp vs ChartQA.
N9 · Bill 9
HELM-Vision 4-9pp vendor gap
18 cands; 14 known. Stanford CRFM HELM-Vision systematic shortfall on vendor-self-reported MMMU/MMVet/MathVista.
N10 · Bill 6
MathVista tool-augmented 35-50%
8 cands. Frontier VLMs use Wolfram/Python in 35-50% of MathVista solves. Bill 6 + Bill 4 ★ tension.
N11 · Bill 12
Distilled-VLM cousin compression
7 cands. Open-weight VLM cousins (Pixtral 12B, InternVL3, Qwen2.5-VL) reach 80-90% of frontier at much lower compute.
N12 · Cross-ledger
4-way star-mechanism alignment
Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = causally-faithful mechanism inheritance is domain-invariant across 4 ledgers.
§04

Falsification protocol

Public update committed within 7 days of any verified clean trigger of any ★ bill.

F4 · ★ Causal vision-grounding
Trigger: a frontier VLM capability survives CLIP-blind-pair + BLINK + visual-detail-probe audit at frontier scale, with confidence ≥ 0.9 from independent third-party
F7 · ★ Cross-benchmark transfer
Trigger: vendor-claimed transfer with r ≥ 0.95 on ≥ 3 distinct VLM benchmarks (MMMU / MMVet / BLINK / MathVista)
F10 · ★ Universal task coverage
Trigger: frontier VLM passes all 5 sub-tasks {chart, text, spatial, counting, temporal} above 70% absolute with ≤ 5pp variance under prompt perturbation
F1 · Image contamination
Trigger: frontier VLM benchmark with ≤ 5% perceptual-hash overlap with training-data snapshot
F8 · Multi-image / video
Trigger: multi-image / video / interleaved benchmark with ≤ 5pp drop vs single-image baseline

Live alerts: MMMU / MMMU-Pro / BLINK / MMVet · Yue-Bommasani VLM audit line · Tong-Du-Liang Eyes Wide Shut · CLIP-blind-pair · Anand-Goyal unified-VLM audit · METR/Apollo/AISI VLM-specific replication · HELM-Vision Stanford CRFM.

§05

Method at a glance

Threat modelA frontier VLM-capability claim that survives image-search contamination + OCR-leakage + cross-modal contamination + vision-tokenizer-format brittleness + image-resolution sensitivity + held-out distribution-shift on the 2024–2026 corpus.
Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round.
Sources surveyedarXiv cs.CV / cs.LG / cs.CL / cs.AI 2024–2026 (MMMU/MMMU-Pro/BLINK/MMVet/MathVista/ScienceQA/DocVQA/TextVQA/ChartQA/RealWorldQA/Eyes Wide Shut/CLIP-blind-pair/Video-MME/MLVU/EgoSchema) + ICLR/ICML/NeurIPS/CVPR/ICCV vision tracks + frontier-lab VLM cards + METR/Apollo/AISI/Stanford CRFM HELM-Vision.
ClassifierRegex rule engine + hand-arbitration. v0.2; target v0.3 lock 1.000/1.000.
Empty-space testThree signature bills (4, 7, 10) predeclared empty BEFORE batch 1. After 397 unique papers, all three remain empty: 0 clean triggers each.
Cross-ledger couplingSelf-validation tautology pattern confirmed across 10+ ledgers as domain-invariant. Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = 4-way star-mechanism alignment.
ReproducibilityAll scripts, JSONs, ledger are public. Run: aggregate_batch_1.pybill_classifier.py --arbitrate-union.
§06

Resources & further reading

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (2026-05-09) — 397 unique papers across 8 sweeps; Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET with 0 clean triggers each. Corpus, scripts, and classifier outputs are linked below. Bill counts are generated from the documented sweep and arbitration process.

Final state · 2026-05-09
Three signature constructions.
397 unique papers.
Empty space holding.