The Twelve-Bill Closure Pattern for Frontier RAG / Retrieval-Augmented Generation Claims (2024–2026)

A real-data falsification harness for 2024–2026 frontier RAG capability claims with 247-paper empirical verification, three signature-empty bills, and the discovery of B7 → B8 commercialization-vs-research RESCOPING (sweep 1006 finding) that became the new emergent bridge in cross_ledger_bridges.
Kevin Russell · ProjectForty2 · 2026-05-15 · Public draft v0.2.

↓ bills_draft.md ← Ledger overview Data & code →

Abstract

The RAG / Retrieval Ledger audits 247 unique papers across 8 deep-loop sweeps (2024–2026) on frontier RAG capability claims — Anthropic Claude with retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro / Sonar, LangChain agents, LlamaIndex frameworks, Haystack, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent, DeepSeek with retrieval. The harness consists of 12 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).

Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (causally-faithful citation mechanism — retrieved citation causally generates the answer per intervention experiments), Bill 8 ★ (cross-corpus generalization — RAG system trained on one corpus transfers to ≥2 distinct held-out corpora with ≤15pp drop), and Bill 11 ★ (universal RAG-task coverage — frontier RAG passes all 5 sub-tasks above clean threshold).

Sweep 1006 (Chinese RAG offerings) produced an unexpected RESCOPING discovery: Bill 7 was originally posed as a Western-vs-Chinese geopolitical-framework-balance audit; the dominant cluster boundary is actually commercialization-vs-research. Open-source frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB) on one side; closed-cloud vendor products (Anthropic retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent) on the other. Western and Chinese cluster within each side; the cluster does not split geopolitically. This RESCOPING produced the new B8 (commercialization-vs-research-artifact axis) emergent bridge in cross_ledger_bridges, which has current batch-3 support across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12) and hardware_inference (clearest current open-vs-closed signal in 19-ledger atlas).

After batch 1 (247 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5. Bill 5 has 21 candidates and 0 clean triggers (ALCE / FacTool / RAGAS / Citation-Bench audit infrastructure is observational, not intervention); Bill 8 has 18 candidates and 0 clean triggers (vendor systems report training-corpus performance; cross-corpus drop ≥20pp where independently measured); Bill 11 has 24 candidates and 0 clean triggers (vendor systems excel at 1–3 sub-tasks and degrade on the rest). The ledger v0.2 LOCKED with Stage 3.5 verification queue pending.

12+6Bills + meta-costs

3★ Empty bills

247Unique Papers

B7→B8RESCOPED

§01Threat model + bills (predeclared in v0.1)

Threat model (verbatim from purpose.md): Demonstrate a frontier RAG (Retrieval-Augmented Generation) capability claim — on a system marketing retrieval-augmented frontier reasoning — that survives six closure audits on the 2024–2026 corpus: (1) retrieval-vs-generation decoupling, (2) citation-faithfulness audit, (3) retrieval-corpus-contamination, (4) query-decomposition audit, (5) multi-document synthesis generalization, (6) held-out-corpus construction (BEIR refresh / MTEB rolling).

#	Bill (v0.1)	What gets paid	Star
1	Retrieval-vs-generation decoupling	Retrieval-only score, generation-only score, joint score reported separately
2	Citation-faithfulness audit	Cited passages causally support generated answer (not post-hoc-attached)
3	Retrieval-corpus-contamination	Test queries absent from corpus before training cutoff
4	Query-decomposition audit	Multi-hop queries decomposed and verified at each hop
5	Causally-faithful citation mechanism	Retrieved citation causally generates answer per intervention experiments	★
6	Multi-document synthesis	≥3 distinct sources synthesized; contradiction-handling explicit
7	Geopolitical → commercialization RESCOPED	Originally Western-vs-Chinese; rescoped to commercialization-vs-research after sweep 1006
8	Cross-corpus generalization	Trained-on-one-corpus transfers to ≥2 distinct held-out corpora with ≤15pp drop	★
9	Held-out post-2024 retrieval benchmark	BEIR / MTEB / RAGAS rolling refresh; queries unanswerable by training cutoff
10	Vendor-self-eval independence	Stanford CRFM HELM-RAG, MTEB, BEIR-Plus reproduction
11	Universal RAG-task coverage	5 sub-tasks: single-doc QA + multi-doc synthesis + contradiction + citation precision + factual recall	★
12	Geopolitical-framework-balance audit	Open-source (LangChain, LlamaIndex) vs closed Chinese vendor (Tongyi-RAG, Qwen-Agent). B7 bridge test seeded the rescoping

Six meta-costs (M1–M6): M1 (pre-2024), M2 (single-corpus), M3 (closed-vendor-only — no API access for evaluation), M4 (single-query-type), M5 (no-citation-eval — observational not intervention-verified), M6 (implementation-specific).

Three escape gates: G1 (methodology), G2 (negative-result / rebuttal), G3 (theoretical-construction).

Sweep 1006 RESCOPING DISCOVERY · the new B7/B8 emergent bridge

Sweep 1006 dispatched Opus research agents at the Chinese RAG offering literature (Tongyi-RAG, Qwen-Agent, DeepSeek-with-retrieval) to test the original B7 prediction (Western open-source frameworks dominate; Chinese vendors don't ship comparable frameworks). The geopolitical framing was wrong. The dominant cluster boundary is commercialization-vs-research: open-source (LangChain, LlamaIndex, Haystack, DSPy) on one side, closed-cloud vendor products (Anthropic, OpenAI Assistants v2, Google NotebookLM, Perplexity, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent) on the other. Western and Chinese cluster within each side; the cluster does not split geopolitically.

This RESCOPING produced the new B8 (commercialization-vs-research-artifact axis) emergent bridge in cross_ledger_bridges, which has current batch-3 support across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12 BIPOLAR signal) and hardware_inference (strong open-vs-closed separation in the current cross-ledger atlas). The RAG ledger seeded a structural finding that is now evidence-bearing in the cross-ledger bridge atlas.

§02Findings — three ★ bills HOLD empty pre-Stage-3.5

★ Bill 5 — Causally-faithful citation mechanism · 21 candidates · 0 clean

Bill 5 ★ requires a RAG system whose retrieved citation causally generates the answer per intervention experiments — not post-hoc-attached. ALCE / FacTool / RAGAS / Citation-Bench post-hoc audit infrastructure dominates the candidate pool. Of 21 batch-1 candidates, 0 clean triggers — every candidate either pays M5 (no-citation-eval — citation reported but not intervention-verified) or M3 (closed-vendor-only — no API access for intervention experiments). The closure requires intervention experiments showing the cited passage causally generates the answer; current audits are observational. Direct cousin to LLM / Reasoning / VLM / Bio causal-mechanism bridges in the cross_ledger_bridges meta-ledger — Bill 5 here is a node in cross_ledger_bridges Bridge 1 (causally-faithful mechanism empty across LLM-centric domains).

★ Bill 8 — Cross-corpus generalization · 18 candidates · 0 clean

Bill 8 ★ requires a RAG system trained on one corpus to transfer to ≥2 distinct held-out corpora with ≤15pp drop on the BEIR / MTEB / RAGAS rolling-refresh held-out benchmarks. Of 18 batch-1 candidates, 0 clean triggers — every candidate pays M2 (single-corpus) or M3 (closed-vendor-only) on the cross-corpus axis. Vendor systems report performance on training-corpus distribution; cross-corpus drop is consistently ≥20pp where independently measured. The closure remains open: independent third-party evaluation infrastructure for cross-corpus reproduction (HELM-RAG / Stanford CRFM / METR / AISI) is the precondition for a clean trigger.

★ Bill 11 — Universal RAG-task coverage · 24 candidates · 0 clean

Bill 11 ★ requires a frontier RAG system to clear all 5 sub-tasks {single-doc QA + multi-doc synthesis + contradiction-resolution + citation-precision + factual-recall-at-scale} above clean threshold without per-task tuning. Of 24 batch-1 candidates, 0 clean triggers. Vendor systems excel at 1–3 sub-tasks (typically single-doc QA + multi-doc synthesis) and degrade on contradiction-resolution / citation-precision / factual-recall-at-scale. M3 (closed-vendor-only) and M5 (no-citation-eval) pay the column. The structural barrier: per-task balanced training is hard to ship without sacrificing per-task peak performance.

Bill 5 = Bridge 1 node · cross-ledger coupling evidence-bearing

Bill 5 (causally-faithful citation) is a node in cross_ledger_bridges Bridge 1 (causally-faithful mechanism empty across 11+ LLM-centric domains). The 21 candidates pay M5 (observational not intervention-verified) — the same pattern observed in mech_interp Bill 13 (causal abstraction), reasoning_cot Bill 14 (faithful CoT), and other LLM-centric ledgers. The grounded-reward exception (B9, scientific_discovery autonomous-lab subset) does NOT extend to RAG citation faithfulness — the intervention experiments are not grounded in physical reward.

§03Verification methodology — v0.2 LOCKED, Stage 3.5 queue pending

The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule was established (2026-05-15). The lock criteria (classifier benchmark + watch-list + falsification protocol + author-activity panel + boxed declaration) were satisfied at lock time. The 247-paper corpus has not yet been independently arXiv-ID + abstract verified at the priority ★-bill candidate pool.

Stage 3.5 verification queue · 2026-05-15

The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies: Robotics_Embodied 9/9 hallucinated, RL-from-Rewards 60% on flagged IDs, Spacetime_Discreteness priority-pool source-ID failures. Expected outcome here: a non-trivial hallucination rate that requires charitable substitution from cited author corpora.

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (causally-faithful citation requires intervention experiments, cross-corpus generalization requires multi-corpus evaluation infrastructure, universal RAG-task coverage requires per-task balanced training). The Bill 7 RESCOPING discovery, by contrast, is an architectural finding from sweep 1006 — less dependent on individual source-ID verification.

Cross-ledger methodology rule · v2026-05-15

Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: (1) classifier rule-tightening, (2) sweep-time arxiv verification, (3) Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection. This ledger LOCKED v0.2 before the rule was established and is being re-audited.

§04Lock conditions and queue

v0.2 LOCKED status (2026-05-15): classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-Bill7-rescope, F-Citation-intervention, F-Independence, ✓), author-activity panel mapped (Lewis, Karpukhin, Khattab, Izacard, Lin, Yang-Zhilin, Asai, Borgeaud, Glass, Manakul, Niu, Es, Anthropic, LangChain-team, LlamaIndex-team, DSPy, ✓), boxed declaration (✓). Stage 3.5 verification queue pending — re-audit under v2026-05-15 cross-ledger rule.

Bill 7 RESCOPING propagation: the RAG sweep 1006 finding seeded the new B7/B8 commercialization-vs-research-artifact axis emergent bridge in cross_ledger_bridges. Batch 3 inheritance (sweep_708 2026-05-14) confirmed the B8 emergent bridge cleanly across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12 BIPOLAR) and hardware_inference (strong 0/N + B7/B8 separation in the current cross-ledger atlas). The RAG ledger is the SEED point for what is now an evidence-bearing cross-ledger bridge.

Monthly cadence: vendor system-card revisions + open-source RAG framework releases. Quarterly: HELM-RAG / Stanford CRFM / BEIR / MTEB rolling refresh + ALCE / FacTool / RAGAS / Citation-Bench updates. Triggered: any HELM-RAG publication of independent reproductions across all 5 sub-tasks (would close Bill 10 and create the precondition for an F11 trigger); any intervention-experiment infrastructure published at NeurIPS / ICML / ICLR 2026–2027 enabling causal verification of RAG citation faithfulness at scale.

Bridge-test status · cross_ledger_bridges B7 RESCOPED + B8 emergent (validated)

This ledger SEEDED the B7 → B8 RESCOPING via sweep 1006. The new B8 (commercialization-vs-research-artifact axis) emergent bridge has now been currently supported in:

multimodal_generation (377 papers): 78 closed-cloud Bill 9 vs 74 open-source Bill 12 — strong BIPOLAR signal observed.
hardware_inference (291 papers): vLLM/SGLang/llama.cpp/MLX (open) vs Groq/Cerebras/SambaNova/Etched (closed) — separable under the current annotations cluster boundary in current cross-ledger atlas.

Falsification of B8 here would require a 2024–2026 RAG framework that crosses the commercialization-vs-research cluster boundary cleanly (e.g., a closed-cloud vendor open-sourcing core RAG primitives or an open-source framework reaching closed-cloud-product capability parity).