# Vision-Language Capability Ledger — Purpose

## Threat model (one paragraph)

Demonstrate a frontier vision-language model capability claim — on a model
that explicitly markets vision-grounded capability (GPT-4V, GPT-4o-Vision,
Claude 3 / 3.5 Vision, Gemini 1.5 / 2 Pro Vision, Llama 3.2-Vision, Mistral
Pixtral, OpenGVLab InternVL3, Alibaba Qwen2.5-VL) — that survives six
closure audits on the 2024–2026 corpus: **(1) image-search / web-snapshot
contamination, (2) OCR-extracted-text leakage, (3) vision-tokenizer-format
brittleness (resolution, patch-size, aspect-ratio), (4) cross-VLM-architecture
portability, (5) multi-image / video / interleaved generalization, (6)
held-out anti-saturation construction (MMMU-Pro / BLINK / MMVet held-out by
design).** A clean trigger requires independent third-party verification
(METR / Apollo / AISI / Stanford CRFM HELM-vision) within 6 months.

## Why this ledger exists

VLM capability claims drive multimodal product deployment, agentic robotic
control, and safety-evaluation infrastructure. **MMMU-Pro 2024-Q4 audit**
shows frontier VLMs drop 14–23pp absolute vs MMMU-v1; **Tong-Du-Liang Eyes
Wide Shut 2024** finds VLMs match captions but fail visual-detail probes
(CLIP-blind-pair audit: 25–40% "vision-grounded" answers are caption-only;
BLINK 47pp visual-vs-language reasoning gap). The vendor-claim half-life
on VLM benchmarks is a cousin of the text-only 73-day half-life.

## Empty-space hypothesis (predeclared)

We predict that **no 2024–2026 paper triggers Bills 4, 7, or 10 cleanly**:

- **Bill 4 ★** — Causally-faithful vision-grounding mechanism survives
  intervention experiments. Direct cousin to Mech Interp Bill 11 ★ + Reasoning
  Bill 6 ★ + Scaling Laws Bill 5 ★. Eyes Wide Shut + CLIP-blind-pair audits
  predict 0 clean triggers.
- **Bill 7 ★** — Cross-benchmark generalization (MMMU ↔ MMVet ↔ BLINK ↔
  MathVista). Yue-Bommasani 2024-Q4 MMMU-Pro audit predicts pairwise
  correlations r=0.31–0.51 — well below r≥0.95 clean-transfer threshold.
- **Bill 10 ★** — Universal vision-task coverage (chart + text + spatial +
  counting + temporal). Anand-Goyal 2025 predicted to find 0/9 frontier VLMs
  pass all 5 sub-tasks; counting + spatial + temporal-video are systematic gaps.

## Authorship

Kevin Russell (Project 42).
Pre-publication draft, do not cite without permission.

## Status

Stage 1 (SCOPE) — purpose written. Target v0.2 lock 2026-Q3.
