# Vision-Language Capability Ledger — Bills Draft (v0.1)

> Stage 2 (BILLS) artifact. Pre-sweep. **12 candidate bills + 6 meta-costs +
> 3 escape gates**, with **3 bills marked ★ predicted-empty** (4, 7, 10).

## The twelve bills

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | **Image-search / web-snapshot contamination** | Test image absent from training-data snapshot (perceptual hash + CLIP embedding overlap). | |
| 2 | **OCR-extracted-text leakage** | ScienceQA / DocVQA / TextVQA / ChartQA freedom from OCR-corpus contamination. | |
| 3 | **Vision-tokenizer-format brittleness** | Image-resolution / patch-size / aspect-ratio ablation. 9–21pp absolute variance at fixed model. | |
| 4 | **★ Causally-faithful vision-grounding mechanism** | VLM trace generates (not describes) the answer under intervention experiments. Tong-Du-Liang Eyes Wide Shut: 25–40% caption-only answers; BLINK 47pp visual-vs-language gap. | ★ |
| 5 | **Cross-VLM-architecture portability** | Same capability on ≥2 of {CLIP-ViT, SigLIP, Pixtral-conv, InternVL-MoE, Qwen-VL}. | |
| 6 | **Tool-augmented-vision decomposition** | MathVista / ChartQA solves separated from Wolfram/Python tool-pipeline. | |
| 7 | **★ Cross-benchmark generalization (MMMU ↔ MMVet ↔ BLINK ↔ MathVista)** | Vendor transfer claim with r ≥ 0.95 across ≥3 benchmarks. Yue-Bommasani 2024-Q4 MMMU-Pro: 14–23pp drop vs MMMU-v1. | ★ |
| 8 | **Multi-image / video / interleaved generalization** | Multi-image / video / interleaved benchmark with ≤5pp drop vs single-image baseline. | |
| 9 | **Vendor-self-eval independence** | Vendor VLM result reproduced by METR / Apollo / AISI / Stanford CRFM HELM-vision. | |
| 10 | **★ Universal vision-task coverage** | Frontier VLM passes all 5 sub-tasks {chart, text, spatial, counting, temporal-video}. Anand-Goyal 2025: 0/9 unified pass. | ★ |
| 11 | **Anti-saturation construction** | Held-out by design (BLINK), held-out re-eval (MMMU-Pro), iterative reframing. | |
| 12 | **Distilled-cousin / open-weight VLM audit** | Frontier-VLM teacher : open-weight cousin reproduction. | |

## Six meta-costs

| # | Meta-cost | Description |
|---|---|---|
| M1 | **Pre-frontier VLM** | LLaVA-1.5 / MiniGPT-4 / pre-frontier (≤8B). |
| M2 | **Single-resolution-only** | One fixed input resolution. |
| M3 | **Single-vision-encoder-only** | CLIP-only / SigLIP-only. |
| M4 | **Frame-extracted-only** | Multi-image/video reduced to single frame. |
| M5 | **Image-only / no-OCR-audit** | No OCR-leakage assessment. |
| M6 | **Implementation-specific** | Specific encoder / fusion-scaffold / tool-pipeline required. |

## Three escape gates

1. **G1 — Methodology** — proposes new vision-encoder / fusion-mechanism / benchmark on toy.
2. **G2 — Negative-result / rebuttal** — closure failure on prior VLM claim.
3. **G3 — Theoretical-construction** — vision-language theoretical analysis.

## Iteration plan

- **Batch 1 (8 sweeps, target ≥350 papers):**
  - sweep_301: VLM vendor cards (GPT-4V, Claude Vision, Gemini Vision, Llama Vision, Pixtral, InternVL, Qwen-VL)
  - sweep_302: MMMU / MMMU-Pro / BLINK / MMVet / MathVista construction + audits
  - sweep_303: Image-search contamination + OCR-leakage audits
  - sweep_304: Multi-image / video / interleaved benchmarks
  - sweep_305: Eyes Wide Shut / CLIP-blind-pair / vision-grounding mechanism
  - sweep_306: Cross-VLM-architecture portability (CLIP-ViT / SigLIP / Pixtral / InternVL)
  - sweep_307: Tool-augmented vision (MathVista Wolfram/Python)
  - sweep_308: VLM negative-results / rebuttals + Anand-Goyal unified audit

## Status

Stage 2 (BILLS) — bills_draft.md complete. Next: dispatch 8 parallel sweep
agents.
