The RAG / Retrieval Ledger audits 247 unique papers across 8 deep-loop sweeps (2024–2026) on frontier RAG capability claims — Anthropic Claude with retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro / Sonar, LangChain agents, LlamaIndex frameworks, Haystack, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent, DeepSeek with retrieval. The harness consists of 12 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).
Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (causally-faithful citation mechanism — retrieved citation causally generates the answer per intervention experiments), Bill 8 ★ (cross-corpus generalization — RAG system trained on one corpus transfers to ≥2 distinct held-out corpora with ≤15pp drop), and Bill 11 ★ (universal RAG-task coverage — frontier RAG passes all 5 sub-tasks above clean threshold).
Sweep 1006 (Chinese RAG offerings) produced an unexpected RESCOPING discovery: Bill 7 was originally posed as a Western-vs-Chinese geopolitical-framework-balance audit; the dominant cluster boundary is actually commercialization-vs-research. Open-source frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB) on one side; closed-cloud vendor products (Anthropic retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent) on the other. Western and Chinese cluster within each side; the cluster does not split geopolitically. This RESCOPING produced the new B8 (commercialization-vs-research-artifact axis) emergent bridge in cross_ledger_bridges, which has current batch-3 support across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12) and hardware_inference (clearest current open-vs-closed signal in 19-ledger atlas).
After batch 1 (247 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5. Bill 5 has 21 candidates and 0 clean triggers (ALCE / FacTool / RAGAS / Citation-Bench audit infrastructure is observational, not intervention); Bill 8 has 18 candidates and 0 clean triggers (vendor systems report training-corpus performance; cross-corpus drop ≥20pp where independently measured); Bill 11 has 24 candidates and 0 clean triggers (vendor systems excel at 1–3 sub-tasks and degrade on the rest). The ledger v0.2 LOCKED with Stage 3.5 verification queue pending.
Threat model (verbatim from purpose.md): Demonstrate a frontier RAG (Retrieval-Augmented Generation) capability claim — on a system marketing retrieval-augmented frontier reasoning — that survives six closure audits on the 2024–2026 corpus: (1) retrieval-vs-generation decoupling, (2) citation-faithfulness audit, (3) retrieval-corpus-contamination, (4) query-decomposition audit, (5) multi-document synthesis generalization, (6) held-out-corpus construction (BEIR refresh / MTEB rolling).
| # | Bill (v0.1) | What gets paid | Star |
|---|---|---|---|
| 1 | Retrieval-vs-generation decoupling | Retrieval-only score, generation-only score, joint score reported separately | |
| 2 | Citation-faithfulness audit | Cited passages causally support generated answer (not post-hoc-attached) | |
| 3 | Retrieval-corpus-contamination | Test queries absent from corpus before training cutoff | |
| 4 | Query-decomposition audit | Multi-hop queries decomposed and verified at each hop | |
| 5 | Causally-faithful citation mechanism | Retrieved citation causally generates answer per intervention experiments | ★ |
| 6 | Multi-document synthesis | ≥3 distinct sources synthesized; contradiction-handling explicit | |
| 7 | Geopolitical → commercialization RESCOPED | Originally Western-vs-Chinese; rescoped to commercialization-vs-research after sweep 1006 | |
| 8 | Cross-corpus generalization | Trained-on-one-corpus transfers to ≥2 distinct held-out corpora with ≤15pp drop | ★ |
| 9 | Held-out post-2024 retrieval benchmark | BEIR / MTEB / RAGAS rolling refresh; queries unanswerable by training cutoff | |
| 10 | Vendor-self-eval independence | Stanford CRFM HELM-RAG, MTEB, BEIR-Plus reproduction | |
| 11 | Universal RAG-task coverage | 5 sub-tasks: single-doc QA + multi-doc synthesis + contradiction + citation precision + factual recall | ★ |
| 12 | Geopolitical-framework-balance audit | Open-source (LangChain, LlamaIndex) vs closed Chinese vendor (Tongyi-RAG, Qwen-Agent). B7 bridge test seeded the rescoping |
Six meta-costs (M1–M6): M1 (pre-2024), M2 (single-corpus), M3 (closed-vendor-only — no API access for evaluation), M4 (single-query-type), M5 (no-citation-eval — observational not intervention-verified), M6 (implementation-specific).
Three escape gates: G1 (methodology), G2 (negative-result / rebuttal), G3 (theoretical-construction).
Sweep 1006 dispatched Opus research agents at the Chinese RAG offering literature (Tongyi-RAG, Qwen-Agent, DeepSeek-with-retrieval) to test the original B7 prediction (Western open-source frameworks dominate; Chinese vendors don't ship comparable frameworks). The geopolitical framing was wrong. The dominant cluster boundary is commercialization-vs-research: open-source (LangChain, LlamaIndex, Haystack, DSPy) on one side, closed-cloud vendor products (Anthropic, OpenAI Assistants v2, Google NotebookLM, Perplexity, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent) on the other. Western and Chinese cluster within each side; the cluster does not split geopolitically.
This RESCOPING produced the new B8 (commercialization-vs-research-artifact axis) emergent bridge in cross_ledger_bridges, which has current batch-3 support across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12 BIPOLAR signal) and hardware_inference (strong open-vs-closed separation in the current cross-ledger atlas). The RAG ledger seeded a structural finding that is now evidence-bearing in the cross-ledger bridge atlas.
Bill 5 ★ requires a RAG system whose retrieved citation causally generates the answer per intervention experiments — not post-hoc-attached. ALCE / FacTool / RAGAS / Citation-Bench post-hoc audit infrastructure dominates the candidate pool. Of 21 batch-1 candidates, 0 clean triggers — every candidate either pays M5 (no-citation-eval — citation reported but not intervention-verified) or M3 (closed-vendor-only — no API access for intervention experiments). The closure requires intervention experiments showing the cited passage causally generates the answer; current audits are observational. Direct cousin to LLM / Reasoning / VLM / Bio causal-mechanism bridges in the cross_ledger_bridges meta-ledger — Bill 5 here is a node in cross_ledger_bridges Bridge 1 (causally-faithful mechanism empty across LLM-centric domains).
Bill 8 ★ requires a RAG system trained on one corpus to transfer to ≥2 distinct held-out corpora with ≤15pp drop on the BEIR / MTEB / RAGAS rolling-refresh held-out benchmarks. Of 18 batch-1 candidates, 0 clean triggers — every candidate pays M2 (single-corpus) or M3 (closed-vendor-only) on the cross-corpus axis. Vendor systems report performance on training-corpus distribution; cross-corpus drop is consistently ≥20pp where independently measured. The closure remains open: independent third-party evaluation infrastructure for cross-corpus reproduction (HELM-RAG / Stanford CRFM / METR / AISI) is the precondition for a clean trigger.
Bill 11 ★ requires a frontier RAG system to clear all 5 sub-tasks {single-doc QA + multi-doc synthesis + contradiction-resolution + citation-precision + factual-recall-at-scale} above clean threshold without per-task tuning. Of 24 batch-1 candidates, 0 clean triggers. Vendor systems excel at 1–3 sub-tasks (typically single-doc QA + multi-doc synthesis) and degrade on contradiction-resolution / citation-precision / factual-recall-at-scale. M3 (closed-vendor-only) and M5 (no-citation-eval) pay the column. The structural barrier: per-task balanced training is hard to ship without sacrificing per-task peak performance.
Bill 5 (causally-faithful citation) is a node in cross_ledger_bridges Bridge 1 (causally-faithful mechanism empty across 11+ LLM-centric domains). The 21 candidates pay M5 (observational not intervention-verified) — the same pattern observed in mech_interp Bill 13 (causal abstraction), reasoning_cot Bill 14 (faithful CoT), and other LLM-centric ledgers. The grounded-reward exception (B9, scientific_discovery autonomous-lab subset) does NOT extend to RAG citation faithfulness — the intervention experiments are not grounded in physical reward.
The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule was established (2026-05-15). The lock criteria (classifier benchmark + watch-list + falsification protocol + author-activity panel + boxed declaration) were satisfied at lock time. The 247-paper corpus has not yet been independently arXiv-ID + abstract verified at the priority ★-bill candidate pool.
The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies: Robotics_Embodied 9/9 hallucinated, RL-from-Rewards 60% on flagged IDs, Spacetime_Discreteness priority-pool source-ID failures. Expected outcome here: a non-trivial hallucination rate that requires charitable substitution from cited author corpora.
The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (causally-faithful citation requires intervention experiments, cross-corpus generalization requires multi-corpus evaluation infrastructure, universal RAG-task coverage requires per-task balanced training). The Bill 7 RESCOPING discovery, by contrast, is an architectural finding from sweep 1006 — less dependent on individual source-ID verification.
Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: (1) classifier rule-tightening, (2) sweep-time arxiv verification, (3) Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection. This ledger LOCKED v0.2 before the rule was established and is being re-audited.
v0.2 LOCKED status (2026-05-15): classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-Bill7-rescope, F-Citation-intervention, F-Independence, ✓), author-activity panel mapped (Lewis, Karpukhin, Khattab, Izacard, Lin, Yang-Zhilin, Asai, Borgeaud, Glass, Manakul, Niu, Es, Anthropic, LangChain-team, LlamaIndex-team, DSPy, ✓), boxed declaration (✓). Stage 3.5 verification queue pending — re-audit under v2026-05-15 cross-ledger rule.
Bill 7 RESCOPING propagation: the RAG sweep 1006 finding seeded the new B7/B8 commercialization-vs-research-artifact axis emergent bridge in cross_ledger_bridges. Batch 3 inheritance (sweep_708 2026-05-14) confirmed the B8 emergent bridge cleanly across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12 BIPOLAR) and hardware_inference (strong 0/N + B7/B8 separation in the current cross-ledger atlas). The RAG ledger is the SEED point for what is now an evidence-bearing cross-ledger bridge.
Monthly cadence: vendor system-card revisions + open-source RAG framework releases. Quarterly: HELM-RAG / Stanford CRFM / BEIR / MTEB rolling refresh + ALCE / FacTool / RAGAS / Citation-Bench updates. Triggered: any HELM-RAG publication of independent reproductions across all 5 sub-tasks (would close Bill 10 and create the precondition for an F11 trigger); any intervention-experiment infrastructure published at NeurIPS / ICML / ICLR 2026–2027 enabling causal verification of RAG citation faithfulness at scale.
This ledger SEEDED the B7 → B8 RESCOPING via sweep 1006. The new B8 (commercialization-vs-research-artifact axis) emergent bridge has now been currently supported in:
multimodal_generation (377 papers): 78 closed-cloud Bill 9 vs 74 open-source Bill 12 — strong BIPOLAR signal observed.
hardware_inference (291 papers): vLLM/SGLang/llama.cpp/MLX (open) vs Groq/Cerebras/SambaNova/Etched (closed) — separable under the current annotations cluster boundary in current cross-ledger atlas.
Falsification of B8 here would require a 2024–2026 RAG framework that crosses the commercialization-vs-research cluster boundary cleanly (e.g., a closed-cloud vendor open-sourcing core RAG primitives or an open-source framework reaching closed-cloud-product capability parity).