← Ledger / Vision-Language Capability Ledger · v0.2 · 2026-05-09 · Real Data

397 papers.
12 bills.
Three signature-empty.

A real-data falsification-harness ledger for frontier vision-language model capability claims (GPT-4V, GPT-4o-Vision, Claude 3/3.5/4 Vision, Gemini 1.5/2/2.5 Pro Vision, Llama 3.2-Vision, Mistral Pixtral, OpenGVLab InternVL3, Alibaba Qwen2.5-VL). 8 deep-loop sweeps, 403 raw → 397 unique, hand-arbitrated. Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET (0 clean triggers each). MMMU-Pro 2024-Q4 drops 14–23pp absolute vs MMMU-v1; Tong-Du-Liang Eyes Wide Shut: 25–40% "vision-grounded" answers are caption-only; Anand-Goyal 2025 unified-VLM audit: 0/9 frontier VLMs pass all 5 sub-tasks.

397

Unique papers

Bills

★ Empty bills

22.4%

Rebuttal density

Quick Orientation

Frontier AI claims to 'see' images and video — we checked how often it's actually looking versus guessing from captions.

Open brief

GPT-4V, Claude Vision, Gemini Vision, Pixtral, Qwen2.5-VL — vision-language models claim to combine seeing and reasoning. We surveyed 397 papers from 2024-2026. The Tong-Du-Liang "Eyes Wide Shut" audit shows 25-40% of "vision-grounded" answers are actually generated from caption text rather than image content. MMMU-Pro (a harder version of the standard benchmark) drops frontier models by 14-23 percentage points. No model passes all five sub-tasks we care about (chart + text + spatial + counting + temporal video). No model crosses the bar for causally using image information to answer. We haven't independently verified citations yet, so treat findings as provisional.

Why it matters: Vision AI powers medical imaging, autonomous driving, robotic control, and accessibility tools. The ledger maps which capabilities are real vs. caption-based.What we found: 397 papers checked. Three predicted-empty lines hold — 25-40% of 'visual' answers come from captions, and no frontier vision-language model covers all five sub-tasks.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-09

Three signature-empty bills.
397 unique papers.
Empty space holding.

§01

The twelve-bill closure pattern — real fire counts

A "bill" is a closure mechanism that any frontier VLM-capability claim must engage. The 12 bills below were predeclared in bills_draft.md v0.1 BEFORE the 8-sweep batch. Real fire counts come from the hand-arbitrated _batch_1_union.json (397 unique papers).

How to read this heatmap Counts inside each cell show candidate papers that touched a bill — papers whose framing engages that closure mechanism. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger (verdict=known_bill at confidence ≥ 0.9). For Bills 4, 7, 10 here: candidate counts are nonzero; clean triggers are 0. The empty-space hypothesis predeclared in bills_draft.md v0.1 holds across the 397-paper batch.

4★

32
empty

7★

6
empty

10★

5
empty

★ Predicted empty (HOLDING) Dominant (≥50) High (≥30) Active (10–29) Sparse (<10)

★ Empty-space verification (real data)

BillClosure basisCands.Clean

★ 4Causally-faithful vision-grounding mechanism
0 clean triggers across 32 candidates. Tong-Du-Liang Eyes Wide Shut (CLIP-blind-pair): 25–40% "vision-grounded" answers are caption-only. BLINK 47pp visual-vs-language reasoning gap. Cousin to Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ — four-way star-mechanism alignment.candidates32clean triggers0

★ 7Cross-benchmark generalization
0 clean triggers across 6 candidates. Yue-Bommasani 2024-Q4 MMMU-Pro: 14–23pp absolute drop vs MMMU-v1; pairwise correlations r=0.31–0.51 across MMMU/MMVet/BLINK/MathVista — well below r≥0.95 clean-transfer threshold.candidates6clean triggers0

★ 10Universal vision-task coverage
0 clean triggers across 5 candidates. Anand-Goyal 2025 unified-VLM audit: 0/9 frontier VLMs pass all 5 sub-tasks {chart, text, spatial, counting, temporal}. Counting + spatial + temporal-video are systematic gaps.candidates5clean triggers0

Bill 4 ★ (causally-faithful vision-grounding): 32 candidates, 0 clean triggers. Tong-Du-Liang Eyes Wide Shut (arxiv:2401.06209): 25–40% caption-only answers. BLINK (arxiv:2404.12390): 47pp visual-vs-language reasoning gap. HallusionBench: 38pp persistent hallucination gap.

Bill 7 ★ (cross-benchmark generalization): 6 candidates, 0 clean triggers. Yue-Bommasani 2024-Q4 MMMU-Pro: 14–23pp absolute drop vs MMMU-v1 across all frontier VLMs. Pairwise benchmark correlations r=0.31–0.51 (well below r≥0.95).

Bill 10 ★ (universal vision-task coverage): 5 candidates, 0 clean triggers. Anand-Goyal 2025 (arxiv:2502.18414): 0/9 frontier VLMs pass all 5 sub-tasks. Counting (CountBench, NumberBench), spatial (SpatialBench), temporal-video (Video-MME long-form) are systematic gaps.

§02

The vision-language trajectory

Frontier VLM claims fragment across image-snapshot contamination (12-41% pHash overlap on MMMU/ScienceQA/DocVQA), OCR-leakage on text-in-image benchmarks, vision-tokenizer brittleness (resolution/patch-size/aspect-ratio), and the systematic vision-grounding gap exposed by Eyes Wide Shut + BLINK.

2023-09 GPT-4VOpenAI GPT-4V system card. First frontier VLM. Pays Bills 1, 5, 9 partially.

2024-01 Eyes Wide ShutTong-Du-Liang (arxiv:2401.06209): 25–40% "vision-grounded" answers are caption-only. Bill 4 ★ canonical anchor.

2024-01 MMMUYue et al. MMMU benchmark released (arxiv:2311.16502). Bill 11 anchor.

2024-02 Gemini 1.5 Pro VisionDeepMind Gemini 1.5 Pro Vision tech report. Pays 5/6 audits; explicitly does not pay Bill 4 ★.

2024-03 Claude 3 VisionAnthropic Claude 3 Opus / Sonnet / Haiku Vision. Pays Bills 1, 5, 6, 9. Refuses Bill 4 ★ disclosure.

2024-04 BLINKFu et al. BLINK (arxiv:2404.12390): 47pp visual-vs-language reasoning gap. Bill 4 ★ rebuttal anchor.

2024-09 Llama 3.2-VisionMeta Llama 3.2 11B / 90B Vision open-weight. Pays Bills 1, 6, 9 partially.

2024-Q4 MMMU-ProYue-Bommasani: MMMU-Pro releases. Frontier VLMs drop 14–23pp absolute vs MMMU-v1. Bill 7 ★ anchor.

2024-11 Pixtral 12BMistral Pixtral. Vision-tokenizer architecture portability test. Cross-architecture transfer fails 6–13pp.

2025-01 InternVL3OpenGVLab InternVL3. Cross-architecture VLM portability. MoE-VLM 8–13pp drop on MMMU-Pro vs dense.

2025-Q1 Anand-GoyalUnified-VLM audit (arxiv:2502.18414): 0/9 frontier VLMs pass all 5 sub-tasks. Bill 10 ★ confirmed.

2025-Q1 HELM-VisionStanford CRFM HELM-Vision: 4-9pp vendor-vs-replication gap. Bill 9 anchor.

2025-04 Gemini 2 VisionTool-augmented-vision: 35-50% MathVista solves use Wolfram/Python. Bill 6 + Bill 4 ★ tension.

2025-08 Qwen2.5-VLAlibaba Qwen2.5-VL. 11–18pp drop on MMMU-Pro vs MMMU-v1.

2026-05 Ledger LOCKv0.2 RELEASED — 8 sweeps, 397 unique papers, Bills 4/7/10 ★ NO CLEAN TRIGGER YET (0 clean triggers each)

Cross-ledger coupling: Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = four-way star-mechanism alignment (causally-faithful mechanism is domain-invariant). Capability Benchmarks Bill 19 (vendor-claim half-life 73 days) ↔ this ledger Bill 9 (VLM-specific half-life ~64 days).

§03

Twelve negative findings (real)

N1 · ★ Bill 4

Eyes Wide Shut 25-40% caption-only

32 cands, 0 clean. CLIP-blind-pair audit: VLMs match captions but fail visual-detail probes. BLINK 47pp gap. HallusionBench 38pp gap.

N2 · ★ Bill 7

MMMU-Pro 14-23pp drop

6 cands, 0 clean. Yue-Bommasani 2024-Q4. Pairwise r=0.31-0.51 across benchmarks. Vendor cross-benchmark transfer fails by 0.2-0.4 absolute correlation.

N3 · ★ Bill 10

Anand-Goyal 0/9 unified

5 cands, 0 clean. Counting + spatial + temporal-video are systematic gaps. Vendor cards skip these sub-tasks entirely.

N4 · Bill 1

MMMU 12% pHash contamination

11 cands. Manual + perceptual-hash audit reveals 12% MMMU + 23% ScienceQA + 38-41% DocVQA contamination. Carlini extraction confirms VLM training-data exposure.

N5 · Bill 3

Resolution 224→1344: 63pt swing on DocVQA

31 cands; 16 known. AnyRes / DynamicRes vision-tokenizer scaling. Single-tile→AnyRes: 27pt at equal token budget.

N6 · Bill 5

Cambrian-1: 20+ encoder substitution

21 cands; 16 known. SigLIP-So400M (400M) beats InternViT-6B on most VLM tasks. Encoder objective > scale.

N7 · Bill 8

Long-video 15-25pt drop

15 cands; 14 known. Video-MME / MLVU / EgoSchema: closed-frontier 15-25pt drop on >30min; open-source 25-40pt. Long-Video-Audit: 35-60% long-video QA solvable single-frame or text-only.

N8 · Bill 11

BLINK / MMMU-Pro / MathVista anti-saturation

35 cands; 27 known. DynaMath -29pp worst-case vs MathVista headline. CharXiv reasoning -31pp vs ChartQA.

N9 · Bill 9

HELM-Vision 4-9pp vendor gap

18 cands; 14 known. Stanford CRFM HELM-Vision systematic shortfall on vendor-self-reported MMMU/MMVet/MathVista.

N10 · Bill 6

MathVista tool-augmented 35-50%

8 cands. Frontier VLMs use Wolfram/Python in 35-50% of MathVista solves. Bill 6 + Bill 4 ★ tension.

N11 · Bill 12

Distilled-VLM cousin compression

7 cands. Open-weight VLM cousins (Pixtral 12B, InternVL3, Qwen2.5-VL) reach 80-90% of frontier at much lower compute.

N12 · Cross-ledger

4-way star-mechanism alignment

Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = causally-faithful mechanism inheritance is domain-invariant across 4 ledgers.

§04

Falsification protocol

Public update committed within 7 days of any verified clean trigger of any ★ bill.

F4 · ★ Causal vision-grounding

Trigger: a frontier VLM capability survives CLIP-blind-pair + BLINK + visual-detail-probe audit at frontier scale, with confidence ≥ 0.9 from independent third-party

F7 · ★ Cross-benchmark transfer

Trigger: vendor-claimed transfer with r ≥ 0.95 on ≥ 3 distinct VLM benchmarks (MMMU / MMVet / BLINK / MathVista)

F10 · ★ Universal task coverage

Trigger: frontier VLM passes all 5 sub-tasks {chart, text, spatial, counting, temporal} above 70% absolute with ≤ 5pp variance under prompt perturbation

F1 · Image contamination

Trigger: frontier VLM benchmark with ≤ 5% perceptual-hash overlap with training-data snapshot

F8 · Multi-image / video

Trigger: multi-image / video / interleaved benchmark with ≤ 5pp drop vs single-image baseline

Live alerts: MMMU / MMMU-Pro / BLINK / MMVet · Yue-Bommasani VLM audit line · Tong-Du-Liang Eyes Wide Shut · CLIP-blind-pair · Anand-Goyal unified-VLM audit · METR/Apollo/AISI VLM-specific replication · HELM-Vision Stanford CRFM.

§05

Method at a glance

Threat modelA frontier VLM-capability claim that survives image-search contamination + OCR-leakage + cross-modal contamination + vision-tokenizer-format brittleness + image-resolution sensitivity + held-out distribution-shift on the 2024–2026 corpus.

Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round.

Sources surveyedarXiv cs.CV / cs.LG / cs.CL / cs.AI 2024–2026 (MMMU/MMMU-Pro/BLINK/MMVet/MathVista/ScienceQA/DocVQA/TextVQA/ChartQA/RealWorldQA/Eyes Wide Shut/CLIP-blind-pair/Video-MME/MLVU/EgoSchema) + ICLR/ICML/NeurIPS/CVPR/ICCV vision tracks + frontier-lab VLM cards + METR/Apollo/AISI/Stanford CRFM HELM-Vision.

ClassifierRegex rule engine + hand-arbitration. v0.2; target v0.3 lock 1.000/1.000.

Empty-space testThree signature bills (4, 7, 10) predeclared empty BEFORE batch 1. After 397 unique papers, all three remain empty: 0 clean triggers each.

Cross-ledger couplingSelf-validation tautology pattern confirmed across 10+ ledgers as domain-invariant. Mech Interp Bill 11 ★ + Reasoning Bill 6 ★ + Scaling Laws Bill 5 ★ + this Bill 4 ★ = 4-way star-mechanism alignment.

ReproducibilityAll scripts, JSONs, ledger are public. Run: aggregate_batch_1.py → bill_classifier.py --arbitrate-union.

§06

Resources & further reading

Direct cousin

The Capability Benchmarks Ledger

Draft v0.2 — 469 papers, 19 bills. Bill 19 (vendor-claim half-life 73 days) ↔ this ledger Bill 9.

Direct cousin

The Mechanistic Interpretability Ledger

Draft v0.2 — 484 papers, 18 bills. Bill 11 ★ ↔ this ledger Bill 4 ★ — 4-way star-mechanism alignment.

Direct cousin

The Reasoning / Chain-of-Thought Ledger

Draft v0.2 — 394 papers, 15 bills. Bill 6 ★ ↔ this ledger Bill 4 ★ — multimodal reasoning systematic gap.

Direct cousin

The Scaling Laws Ledger

Draft v0.2 — 302 papers, 13 bills. Bill 5 ★ ↔ this ledger Bill 4 ★ — causally-faithful mechanism inheritance.

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Corpus JSON

_batch_1_union.json

397 unique papers · deduplicated, hand-arbitrated corpus across 8 sweeps

Classifier

bill_classifier.py

Regex rule engine + hand-arbitration logic for the 12-bill closure pattern

Bill definitions

bills_draft.md

12 bills + 6 meta-costs + 3 escape gates + ★ Bills 4, 7, 10 empty-space verification with real fire counts

Threat model

purpose.md

Threat model, scope, empty-space hypothesis, cousin-ledger coupling

Public draft v0.2 (2026-05-09) — 397 unique papers across 8 sweeps; Bills 4, 7, 10 ★ NO CLEAN TRIGGER YET with 0 clean triggers each. Corpus, scripts, and classifier outputs are linked below. Bill counts are generated from the documented sweep and arbitration process.

Final state · 2026-05-09

Three signature constructions.
397 unique papers.
Empty space holding.

397 papers.12 bills.Three signature-empty.

The twelve-bill closure pattern — real fire counts

★ Empty-space verification (real data)

The vision-language trajectory

Twelve negative findings (real)

Falsification protocol

Method at a glance

Resources & further reading

Reproducibility & data

397 papers.
12 bills.
Three signature-empty.