# RAG / Retrieval-Augmented Ledger — Bills Draft (v0.1)

> **12 bills + 6 meta-costs + 3 escape gates**, ★ at 5, 8, 11.

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | Retrieval-vs-generation decoupling | Retrieval-only score, generation-only score, joint score reported separately. | |
| 2 | Citation-faithfulness audit | Cited passages causally support generated answer (not post-hoc-attached). | |
| 3 | Retrieval-corpus-contamination | Test queries absent from corpus before training cutoff. | |
| 4 | Query-decomposition audit | Multi-hop queries decomposed and verified at each hop. | |
| 5 | **★ Causally-faithful citation mechanism** | Retrieved citation causally generates answer per intervention experiments. Direct cousin to LLM/Reasoning/VLM/Bio causal mechanism bridges. Predicted empty. | ★ |
| 6 | Multi-document synthesis | ≥3 distinct sources synthesized; contradiction-handling explicit. | |
| 7 | Strong-baseline classical-IR comparison | RAG beats classical BM25 / ColBERT at equivalent latency. | |
| 8 | **★ Cross-corpus generalization** | Trained-on-one-corpus transfers to ≥2 distinct held-out corpora with ≤15pp drop. Predicted empty. | ★ |
| 9 | Held-out post-2024 retrieval benchmark | BEIR / MTEB / RAGAS rolling refresh; queries unanswerable by training cutoff. | |
| 10 | Vendor-self-eval independence | Independent reproduction (Stanford CRFM HELM-RAG, MTEB, BEIR-Plus, RAG-Plus). | |
| 11 | **★ Universal RAG-task coverage** | Frontier RAG passes all 5 sub-tasks {single-doc QA, multi-doc synthesis, contradiction-resolution, citation-precision, factual-recall-at-scale}. Predicted empty. | ★ |
| 12 | Geopolitical-framework-balance audit | Compares Western open-source (LangChain, LlamaIndex, Haystack) vs Chinese vendor offerings (Tongyi-RAG, Qwen-Agent, DeepSeek-with-retrieval). Bridge-test for B7. | |

## Six meta-costs: pre-2024 / single-corpus / single-query-type / closed-vendor-only / no-citation-eval / implementation-specific.

## Three escape gates: G1 methodology / G2 negative-result / G3 theoretical.

## Iteration plan (8 sweeps)
- 1001: Frontier RAG system cards (Anthropic retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro/Sonar, You.com, Phind, Glean)
- 1002: Citation-faithfulness audits (ALCE / FacTool / RAGAS / Citation-Bench)
- 1003: Retrieval-corpus contamination + held-out benchmarks (BEIR refresh, MTEB rolling, MS MARCO leakage audits)
- 1004: Open-source RAG frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB)
- 1005: Multi-document synthesis + multi-hop QA (HotpotQA, MuSiQue, FanOutQA, MultiHopRAG)
- 1006: Chinese RAG offerings (Tongyi-RAG, Qwen-Agent retrieval, DeepSeek with retrieval) for B7 bridge test
- 1007: Independent third-party RAG audits (HELM-RAG, Stanford CRFM, METR retrieval, AISI evals)
- 1008: RAG negative results / hallucination / contradiction / failure modes
