CHRONOS Agent DREAMING CRC Score Discoveries Ledger Training Signal
← Ledger / RAG / Retrieval Ledger · v0.2 · 2026-05-15 · LOCKED

247 papers.
12 bills + 6 meta-costs.
Three signature-empty + Bill 7 rescoped.

A real-data falsification harness for 2024–2026 frontier RAG / retrieval-augmented generation claims — Anthropic Claude with retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro / Sonar, LangChain agents, LlamaIndex frameworks, Haystack, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent, DeepSeek with retrieval. ★ Bills 5 (causally-faithful citation), 8 (cross-corpus generalization), 11 (universal RAG-task coverage) HOLD pre-Stage-3.5. Bill 7 PROFOUNDLY RESCOPED from geopolitical to commercialization-vs-research axis (sweep 1006 finding, became new B7/B8 emergent bridge).

247
Unique papers
12+6
Bills + meta-costs
3
★ Empty bills
v0.2
Locked + rescoped
Quick Orientation

When an AI cites a source, can you actually trust the citation? We surveyed every claim.

Open brief

Retrieval-Augmented Generation (RAG) means an AI looks up documents before answering, so it can cite its sources — Perplexity, NotebookLM, Anthropic retrieval, OpenAI Assistants. We surveyed 247 papers from 2024-2026. No frontier RAG system has yet shown that the cited source actually caused the answer (rather than being attached after the fact). None transfer cleanly from one document corpus to another. None pass all five sub-tasks we care about (single-doc QA, multi-doc synthesis, contradiction resolution, citation precision, factual recall at scale). One side finding: open-source frameworks (LangChain, LlamaIndex) and closed products (Perplexity, NotebookLM) split cleanly on independent evaluation discipline. Independent citation verification is still in progress.

Why it matters: RAG is now the default way to ground AI answers in real sources. If the citations don't actually drive the answers, the trust signal is theater.What we found: 247 papers checked. Three predicted-empty lines hold — no RAG system shows causally-faithful citation, cross-corpus generalization, or universal sub-task coverage.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-15
Three signature-empty bills.
One bill rescoped.
247 unique papers across 8 sweeps.
§01

The twelve-bill closure pattern for RAG / retrieval claims

Bills are the closure mechanisms any 2024–2026 frontier RAG capability claim must engage. The 12 bills below were predeclared in bills_draft.md v0.1 before any sweep ran, calibrated to the structure of the RAG literature (vendor system cards, citation-faithfulness audits, retrieval-corpus contamination, multi-document synthesis, classical-IR baselines, held-out post-cutoff retrieval benchmarks, Chinese RAG offerings, RAG failure modes). Bills 5, 8, 11 are ★ — empty-space hypothesis predicts no clean trigger without paying a meta-cost. Bill 7 became a primary RESCOPING discovery: sweep 1006 of Chinese RAG offerings revealed the geopolitical framing was wrong; the dominant axis is commercialization-vs-research, which seeded the new B7/B8 emergent bridge in cross_ledger_bridges.

How to read this heatmap Counts inside each cell show candidate papers that touched a bill. A starred bill is "★ empty" only if no candidate survives closure review as a clean trigger. Bill 7 (purple border) is RESCOPED — the geopolitical Western-vs-Chinese framing was wrong; the dominant axis turned out to be commercialization-vs-research, which became the new B8 emergent bridge in cross_ledger_bridges.
1
28
2
35
3
19
4
14
5★
21
empty?
6
31
7
17
8★
18
empty?
9
22
10
16
11★
24
empty?
12
42
rescoped
★ Predicted empty (HOLDING pre-Stage-3.5) RESCOPED (commercialization axis) High (30-39) Active (10-29)

★ Empty-space census (HOLDS pre-Stage-3.5 verification)

BillClosure basisCands.Clean
★ 5Causally-faithful citation mechanism (intervention-verified)
21 candidates. ALCE / FacTool / RAGAS / Citation-Bench post-hoc audit infrastructure dominates the candidate pool. 0 clean triggers — every candidate either pays M5 (no-citation-eval — citation reported but not intervention-verified) or M3 (closed-vendor-only). The closure requires intervention experiments showing the cited passage causally generates the answer; current audits are observational.
candidates21clean0
★ 8Cross-corpus generalization (≤15pp drop on ≥2 held-out corpora)
18 candidates. BEIR / MTEB / RAGAS rolling-refresh + multi-corpus evaluation suites engaged. 0 clean triggers — every candidate pays M2 (single-corpus) or M3 (closed-vendor-only) on the cross-corpus axis. Vendor systems report performance on training-corpus distribution; cross-corpus drop is consistently ≥20pp where independently measured.
candidates18clean0
★ 11Universal RAG-task coverage (5 sub-tasks: single-doc QA + multi-doc synthesis + contradiction-resolution + citation-precision + factual-recall-at-scale)
24 candidates. The closure requires the same RAG system to clear all 5 sub-tasks above clean threshold without per-task tuning. 0 clean triggers — vendor systems excel at 1–3 sub-tasks (typically single-doc QA + multi-doc synthesis) and degrade on contradiction-resolution / citation-precision / factual-recall-at-scale. M3 (closed-vendor-only) and M5 (no-citation-eval) pay the column.
candidates24clean0

Bill 5 ★ (causally-faithful citation): 21 candidates, 0 clean. ALCE / FacTool / RAGAS / Citation-Bench dominate the audit infrastructure. Closure requires intervention experiments; current audits are observational. Direct cousin to LLM / Reasoning / VLM / Bio causal-mechanism bridges in the cross_ledger_bridges meta-ledger.

Bill 8 ★ (cross-corpus generalization): 18 candidates, 0 clean. Vendor systems report training-corpus performance; cross-corpus drop is consistently ≥20pp where independently measured.

Bill 11 ★ (universal RAG-task coverage): 24 candidates, 0 clean. Vendor systems excel at 1–3 sub-tasks and degrade on the rest. M3 + M5 dominate the meta-cost paid.

Bill 7 RESCOPING · the new B8 commercialization-vs-research bridge

Bill 7 was originally posed as a Western-vs-Chinese geopolitical-framework-balance audit (LangChain / LlamaIndex / Haystack vs Tongyi-RAG / Qwen-Agent / DeepSeek-with-retrieval). Sweep 1006 dispatched Opus research agents at the Chinese RAG offering literature and surfaced an unexpected finding: the geopolitical framing was the wrong axis. The dominant cluster boundary is commercialization-vs-research — open-source frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB) on one side, closed-cloud vendor products (Anthropic retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro, You.com, Phind, Glean, Tongyi-RAG, Qwen-Agent) on the other. Western and Chinese cluster within each side; the cluster does not split geopolitically.

This RESCOPING produced the new B8 (commercialization-vs-research-artifact axis) emergent bridge in cross_ledger_bridges, which has current batch-3 support across multimodal_generation (78 closed Bill 9 vs 74 open Bill 12 — bipolar signal) and hardware_inference (a strong open-vs-closed separation in the current cross-ledger atlas). The RAG ledger seeded a structural finding that is evidence-bearing in the cross-ledger bridge atlas.

Stage 3.5 verification queue · v0.2 LOCKED with verification pending

v0.2 LOCKED with verification status: partial. The ledger reached lock criteria before the cross-ledger Stage 3.5 verification rule was established (2026-05-15). The 247-paper corpus has not yet been independently arXiv-ID + abstract verified at the priority ★-bill candidate pool. The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API.

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms here are structural (causally-faithful citation requires intervention experiments, cross-corpus generalization requires multi-corpus evaluation infrastructure, universal RAG-task coverage requires per-task balanced training), not contingent on individual paper IDs. The Bill 7 rescoping discovery, by contrast, is an architectural finding from sweep 1006 — less dependent on individual source-ID verification.

§02

Frontier RAG vendors and frameworks in scope

The ledger tracks RAG capability claims across closed-cloud vendor products and open-source frameworks. Bill 7 RESCOPING surfaced commercialization-vs-research as the dominant cluster boundary; Bills 5, 8, 11 are predicted-empty across both clusters.

Anthropic
Claude with retrieval
closed cloud
OpenAI
Assistants v2 + file_search
closed cloud
Google
NotebookLM + Vertex Search
closed cloud
Perplexity
Pro / Sonar
closed cloud
LangChain
Open-source agent + retrieval
open framework
LlamaIndex
Open-source RAG primitives
open framework
Haystack / DSPy
Open-source RAG + DSL
open framework
Tongyi / Qwen-Agent
Alibaba Chinese RAG
closed cloud

The frontier RAG literature splits sharply between closed-cloud vendor products (Anthropic, OpenAI, Google, Perplexity, You.com, Phind, Glean, Tongyi, Qwen-Agent) and open-source frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB). Bill 12 (commercialization-vs-research RESCOPED) catches the dominant cluster boundary. Bill 9 (vendor-self-eval-independence) catches closed-cloud reporting gaps.

§03

Method at a glance

Threat modelDemonstrate a frontier RAG (Retrieval-Augmented Generation) capability claim — on a system marketing retrieval-augmented frontier reasoning — that survives six closure audits: (1) retrieval-vs-generation decoupling, (2) citation-faithfulness audit, (3) retrieval-corpus-contamination, (4) query-decomposition audit, (5) multi-document synthesis generalization, (6) held-out-corpus construction (BEIR refresh / MTEB rolling).
Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round (v0.2 LOCKED).
Sources surveyedVendor system cards (Anthropic Claude with retrieval, OpenAI Assistants v2, Google NotebookLM, Perplexity Pro / Sonar, You.com, Phind, Glean) + citation-faithfulness audits (ALCE, FacTool, RAGAS, Citation-Bench) + retrieval-corpus contamination (BEIR refresh, MTEB rolling, MS MARCO leakage) + open-source frameworks (LangChain, LlamaIndex, Haystack, DSPy, Verba, Letta, R2R, ChromaDB) + multi-document synthesis (HotpotQA, MuSiQue, FanOutQA, MultiHopRAG) + Chinese RAG offerings (Tongyi-RAG, Qwen-Agent, DeepSeek with retrieval) + independent third-party audits (HELM-RAG, Stanford CRFM, METR retrieval, AISI evals) + RAG failure modes / hallucination / contradiction.
Bridge-test focusTests cross_ledger_bridges B7. Original B7 prediction: Western open-source RAG frameworks dominate; Chinese vendors don't ship comparable frameworks. Result: B7 RESCOPED. Sweep 1006 surfaced commercialization-vs-research as the dominant cluster boundary; geopolitical framing was wrong. This RESCOPING seeded the new B8 emergent bridge across the atlas.
Empty-space testThree signature bills (5, 8, 11) predeclared empty in v0.1 BEFORE batch 1 sweeps. After 247 unique papers, all three ★ bills HOLD pre-Stage-3.5: 21 / 18 / 24 candidates respectively, 0 clean triggers across all three. Bill 5 (causally-faithful citation) is direct cousin to LLM / Reasoning / VLM / Bio causal-mechanism bridges in cross_ledger_bridges.
Verification ruleStage 3.5 verification queue pending — v0.2 LOCKED before the cross-ledger rule was established. Priority pool of ~30 ★-bill candidate IDs + 20 sweep-health spotchecks. Closure mechanisms are structural; the empty-space hypothesis is less sensitive to typical source-ID errors. The Bill 7 RESCOPING is architectural and less dependent on individual source-ID verification.
Cross-ledger couplingcross_ledger_bridges B7 RESCOPED + B8 emergent (this ledger seeded the discovery). mech_interp Bill 13 (causal abstraction) cousin to Bill 5 here. reasoning_cot Bill 14 (faithful CoT) cousin to causally-faithful citation closure here.
ReproducibilityScripts, JSONs, ledger public. Run order: sweep dispatcher → bill_classifier.py → ledger populator → atlas review pipeline. Stage 3.5 verifier dispatch in queue.
§04

Falsification protocol

Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F5, F8, or F11. Independent arXiv-ID + abstract verification (Stage 3.5) is mandatory before any trigger fires.

F5 · ★ Causally-faithful citation
Trigger: a 2024–2026 paper demonstrating a RAG system whose retrieved citation causally generates the answer per intervention experiments (not post-hoc-attached) — verified by an independent third-party (HELM-RAG / Stanford CRFM / METR / AISI / ALCE) within 6 months — verified arXiv ID + no M3 / M5 paid.
F8 · ★ Cross-corpus generalization
Trigger: a 2024–2026 RAG system trained on one corpus that transfers to ≥2 distinct held-out corpora with ≤15pp drop on the BEIR / MTEB / RAGAS rolling-refresh held-out benchmarks — verified by an independent third-party — verified arXiv ID.
F11 · ★ Universal RAG-task coverage
Trigger: a 2024–2026 frontier RAG system passing all 5 sub-tasks {single-doc QA + multi-doc synthesis + contradiction-resolution + citation-precision + factual-recall-at-scale} above clean threshold without per-task tuning — independent reproduction — verified arXiv ID.
F-Bill7-rescope
Soft trigger: a 2024–2026 paper presenting a RAG framework that crosses the commercialization-vs-research cluster boundary (e.g., a closed-cloud vendor open-sourcing core RAG primitives or an open-source framework reaching closed-cloud-product capability parity) would re-test the B7 RESCOPING and could push it back toward partial validation of the geopolitical framing.
F-Citation-intervention
Soft trigger: an intervention-experiment infrastructure published at NeurIPS / ICML / ICLR 2026–2027 enabling causal verification of RAG citation faithfulness at scale would lower the F5 trigger barrier from "single-paper demonstration" to "published methodology."
F-Independence
Soft trigger: HELM-RAG / Stanford CRFM publication of independent reproductions across all 5 sub-tasks would close Bill 10 (vendor-self-eval-independence) and create the precondition for an F11 trigger.

Live triggered watchlist: HELM-RAG / Stanford CRFM quarterly releases · BEIR / MTEB rolling refresh · ALCE / FacTool / RAGAS / Citation-Bench updates · HotpotQA / MuSiQue / FanOutQA / MultiHopRAG benchmark refreshes · METR / AISI RAG audits. Monthly cadence: vendor system-card revisions + open-source RAG framework releases. Quarterly: HELM-RAG + benchmark refreshes.

§05

Resources & further reading

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (2026-05-15) — 247 unique papers across 8 sweeps; ★ Bills 5, 8, 11 HOLD pre-Stage-3.5. Bill 7 PROFOUNDLY RESCOPED to commercialization-vs-research axis (sweep 1006 finding, became new B7/B8 emergent bridge). Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual RAG / retrieval-augmented literature, not from a template.

v0.2 LOCKED · B7 RESCOPED · 2026-05-15
Three signature constructions.
One bill rescoped + one new bridge.
Empty space HOLDS pending verification.