CHRONOS Agent CRC Score Discoveries Ledger Training Signal
← Ledger / Hardware Inference Stack Ledger · v0.2 · 2026-05-15 · Ledger populated · 19th of 19 ledgers · Strong 0/N signal in atlas

291 papers.
13 bills + 6 meta-costs.
STRONG 0/N signal in 19-ledger atlas.

A real-data falsification harness for 2024–2026 frontier inference-stack capability claims — vLLM, SGLang, TensorRT-LLM, Triton, NVIDIA Dynamo, Groq, Cerebras WSE-3 / CS-3, SambaNova SN40L, Tenstorrent Wormhole, Lightmatter, Etched Sohu, Rain.ai, Amazon Trainium / Inferentia 3, Google TPU v5p / v6 / Ironwood, Microsoft Maia 100/200, OpenAI custom silicon, AMD MI300X / MI350, NVIDIA B100/B200/H200, Huawei Ascend, Cambricon, Biren + framework optimizations (speculative decoding, MoE serving, KV-cache compression, FP8/FP4 quantization, ONNX runtime, llama.cpp, ggml, MLX, Ollama, LM Studio). ★ Bills 5, 8, 11 HOLD pre-Stage-3.5 with 0/34, 0/38, 0/20 firings — the strong 0/N signal in the current cross-ledger atlas.

291
Unique papers
13+6
Bills + meta-costs
3
★ Empty bills
0/92
Combined ★ firings
19th ledger · ProjectForty2 · strong signal in atlas
A strong closed-vs-open signal in the current cross-ledger atlas. vLLM / SGLang / llama.cpp / MLX (open) vs Groq / Cerebras / SambaNova / Etched (closed) creates the strongly separable under the current annotations cluster boundary — strong test of cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges.
Quick Orientation

Specialized AI chips promise faster, cheaper inference — we checked how the benchmark wins translate to real workloads.

Open brief

Groq, Cerebras, SambaNova, TPU, Trainium, vLLM, SGLang — every inference chip and framework markets dramatic speed and cost wins. We surveyed 291 papers from 2024-2026. No model runs on different vendor stacks at within 10% of speed under matched configuration. No INT4 or FP4 quantization keeps 95% of the original capability under independent third-party benchmarking. No single inference framework runs all four frontier open-source models cleanly. Side finding: open-source frameworks (vLLM, SGLang, llama.cpp, MLX) split cleanly from closed-vendor benchmarks (Groq, Cerebras, SambaNova) on what gets reported and how. Independent citation verification is still in progress.

Why it matters: Datacenter purchases worth billions of dollars hinge on these benchmark wins. The ledger maps which ones survive a fair fight.What we found: 291 papers checked. Three predicted-empty lines hold (0/34, 0/38, 0/20) — the strong signal across the entire atlas. Strong commercialization-vs-research split too.

Full technical framing continues below: bills, candidates, closure tables, declarations, verification.

Ledger declaration · 2026-05-15
Three signature constructions HOLD with STRONG 0/N signal.
291 unique papers across 8 sweeps.
Strong B7/B8 separation in the current cross-ledger atlas.
§01

The thirteen-bill closure pattern for inference-stack claims

Bills are the closure mechanisms any 2024–2026 frontier inference-stack capability claim must engage. The 13 bills below were predeclared in bills_draft.md v0.1 before any sweep ran, calibrated to the structure of the inference-stack literature (benchmark-vs-real-workload, cost-per-token transparency, quantized-vs-FP16 fidelity, batch-vs-streaming, cross-vendor stability, KV-cache / speculative-decoding fidelity, strong-baseline comparison, INT4/FP4 quantization, vendor-self-eval independence, commercial availability vs research-preview, universal inference-platform coverage, commercialization-axis, safety / model-extraction / power-efficiency). Bills 5, 8, 11 are ★.

1
31
2
28
3
22
4
17
5★
34
empty?
6
19
7
14
8★
38
empty?
9
29
10
21
11★
20
empty?
12
42
strong
13
16
★ Predicted empty (STRONG 0/N signal: 0/34, 0/38, 0/20) B7 strong test (commercialization-vs-research) High (25-50) Active (10-24)

★ Empty-space census (STRONG 0/N signal in 19-ledger atlas)

BillClosure basisCands.Clean
★ 5Cross-vendor benchmark stability (≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras under matched config)
34 candidates, 0 clean. The closure requires the same frontier model evaluated under matched batch size / context length / quantization to clear ≤10% TTFT and tokens/sec variance across the dominant inference frameworks. 0 clean triggers — every benchmark-comparison paper either pays M1 (vendor-supplied benchmark only), M2 (cherry-picked workload — single prompt / short context / no contention), or M5 (marketing-grade tokens/sec — theoretical peak, not sustained). Cross-vendor reproductions consistently report 30–60% variance.
candidates34clean0
★ 8INT4 / FP4 / INT8 quantization retains ≥95% of FP16 capability under independent third-party eval
38 candidates, 0 clean — the largest active ★ bill. The closure requires INT4 / FP4 / INT8 quantized frontier models to clear ≥95% of FP16 capability on independent third-party evaluation (METR / Apollo / AISI quantization audits). 0 clean triggers — every candidate shows degradation ≥5% on at least one capability axis, often substantial (10–20% on reasoning / math / code). Power-of-2 narrowing (FP16 → FP8 → FP4) compounds the degradation.
candidates38clean0
★ 11Universal inference-platform coverage (single open-source framework runs Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2 at ≤20% performance variance)
20 candidates, 0 clean. The closure requires a single open-source framework (vLLM, SGLang, llama.cpp, MLX) to run all four frontier open-weight models at ≤20% throughput / TTFT variance on the same hardware. 0 clean triggers — every framework specializes (vLLM excels on Llama / Qwen, SGLang on long-context, llama.cpp on Apple Silicon, MLX on M-series). The 20-paper candidate pool is structurally bounded.
candidates20clean0

Bill 5 ★ (cross-vendor stability): 34 candidates, 0 clean. Cross-vendor reproductions consistently report 30–60% variance. M1 (vendor-supplied) + M2 (cherry-picked) + M5 (marketing-grade) dominate the meta-cost paid.

Bill 8 ★ (INT4/FP4 quantization fidelity): 38 candidates, 0 clean — the largest active ★ bill. Quantization degrades FP16 capability by 5–20% on at least one axis. Most-likely-to-eventually-trigger ★ bill pending METR / Apollo / AISI Stage 3.5 verification — the engineering trajectory toward FP4 with maintained capability is plausible.

Bill 11 ★ (universal inference-platform coverage): 20 candidates, 0 clean. Frameworks specialize structurally — vLLM excels on Llama / Qwen, SGLang on long-context, llama.cpp on Apple Silicon, MLX on M-series.

Combined: 0/92 firings across all three ★ bills — the strong 0/N signal in the current cross-ledger atlas.

B7 RESCOPED + B8 emergent · strong separation in 19-ledger atlas

This ledger is the strong possible test of the rescoped B7 commercialization-vs-research axis. Bill 12 (commercialization-axis) fires 42 times — the dominant non-★ bill — and creates a clean cluster boundary:

Open cluster: vLLM (open-source serving framework), SGLang (open-source long-context), llama.cpp (open-source CPU/Apple), MLX (Apple ML framework, open), AMD ROCm (open compiler stack). Reproducible benchmarks, commodity hardware, community-driven evaluation.

Closed cluster: Groq (custom LPU + closed runtime), Cerebras (CS-3 wafer + closed compiler), SambaNova (SN40L + closed runtime), Etched Sohu (custom transformer ASIC), Lightmatter (photonic compute), Rain.ai. Vendor-supplied benchmarks, proprietary stacks, NDA-restricted preview tiers.

The cluster boundary is strongly separable under the current annotations — cleaner than RAG (mixed open + closed cloud), cleaner than multimodal_generation (78-vs-74 bipolar), stronger than the comparison ledgers currently sampled. This is a strong signal in the cross-ledger bridge atlas; B8 (commercialization-vs-research-artifact axis) is currently supported here.

Stage 3.5 verification queue · ledger populated, verification pending

Status: ledger populated; verification status: partial. The ledger ran 8 deep-loop sweeps × 291 unique papers and produced the strong 0/N signal in the current cross-ledger atlas (0/34, 0/38, 0/20 — combined 0/92 across all three ★ bills). Stage 3.5 verification queue pending — priority pool of ~30 ★-bill candidate IDs (especially Bill 8 INT4/FP4 quantization) plus 20 sweep-health spotchecks dispatching against arxiv-API.

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (cross-vendor stability requires standardized matched-config benchmarks, INT4/FP4 fidelity requires per-bit-width quantization-aware training + independent evaluation, universal framework coverage requires per-framework engineering investment), not contingent on individual paper IDs. The B7/B8 commercialization-vs-research finding is architectural and less dependent on individual source-ID verification.

§02

Frontier inference frameworks and silicon in scope

The ledger tracks frontier inference-stack capability claims across open-source frameworks, closed-vendor accelerators, cloud-vendor silicon, and GPU-vendor frontier products. The open-vs-closed cluster boundary creates the strong B7/B8 signal in the current cross-ledger atlas.

vLLM
Open-source LLM serving
PagedAttention
SGLang
Open-source long-context
RadixAttention
llama.cpp / ggml
Open-source CPU/Apple
GGUF format
MLX
Apple ML framework (open)
unified memory
Groq
Custom LPU + closed runtime
tokens/sec leader
Cerebras
CS-3 wafer + closed compiler
single-wafer scale
SambaNova / Etched
SN40L + Sohu transformer ASIC
closed stack
NVIDIA
H200 / B100 / B200 + TensorRT
dominant GPU vendor
AMD / Intel
MI300X / MI350 + ROCm
GPU + open ROCm
Cloud silicon
Trainium / TPU / Maia
cloud-native closed
Chinese silicon
Huawei Ascend / Cambricon / Biren
B7 third corner
Quantization
FP8 / FP4 / INT4 / INT8
Bill 8 ★ axis

The frontier inference-stack literature splits across three corners that test whether the rescoped B7 holds at the hardware layer: (1) Western open (vLLM, SGLang, AMD ROCm, llama.cpp, MLX), (2) Western closed (Groq, Cerebras, SambaNova, Etched, Lightmatter), (3) Chinese hardware (Huawei Ascend, Cambricon, Biren). The open / closed cluster boundary dominates the geopolitical clustering — three corners collapse to two on the dominant axis.

§03

Method at a glance

Threat modelDemonstrate a frontier inference-stack capability claim — on a system marketing high-throughput / low-latency / low-cost inference of frontier models — that survives six closure audits: (1) benchmark-versus-real-workload (TTFT, tokens/sec at contended load), (2) cost-per-token transparency (true unit economics disclosed), (3) frontier-model fidelity audit (quantized inference doesn't silently degrade quality), (4) batch-vs-streaming behavior, (5) commercial availability vs research-preview gap, (6) closed-vendor benchmark cherry-pick audit.
Deep loops8 sweeps × 5–10 parallel Opus research agents per sweep × 1 batch round (Stage 5 POPULATE).
Sources surveyedInference frameworks (vLLM, SGLang, TensorRT-LLM, Triton, NVIDIA Dynamo, llama.cpp, MLX, Ollama) + closed-vendor hardware (Groq, Cerebras, SambaNova, Lightmatter, Etched Sohu, Rain.ai) + cloud-vendor inference silicon (AWS Trainium / Inferentia, Google TPU v5p / v6 / Ironwood, Microsoft Maia, OpenAI custom) + GPU-vendor frontier (NVIDIA H200 / B100 / B200, AMD MI300X / MI350, Chinese Huawei Ascend / Cambricon / Biren) + quantization fidelity audits (INT4 / FP4 capability evals, METR / Apollo / AISI quantization studies) + speculative decoding + KV-cache compression fidelity papers + B7 bridge test + independent third-party benchmark reproductions + cost-transparency studies.
Bridge-test focusThis ledger is the strong possible test of the rescoped B7 commercialization-vs-research axis. vLLM/SGLang/llama.cpp/MLX (open) vs Groq/Cerebras/SambaNova/Etched (closed) creates the strongly separable under the current annotations cluster boundary in the current cross-ledger atlas. Strong B7/B8 separation observed; B9 NULL check matched the current annotations (power/thermal/latency constrain inference but don't ground).
Empty-space testThree signature bills (5, 8, 11) predeclared empty in v0.1 BEFORE batch 1 sweeps. After 291 unique papers, all three ★ bills HOLD with STRONG 0/N signal in current cross-ledger atlas: 0/34, 0/38, 0/20. Bill 8 (INT4/FP4 quantization) is the largest ★ bill at 38 candidates and the most-likely-to-eventually-trigger pending METR/Apollo/AISI Stage 3.5 verification.
Verification ruleStage 3.5 verification queue pending. Priority pool of ~30 ★-bill candidate IDs (especially Bill 8 INT4/FP4 quantization fidelity claims) + 20 sweep-health spotchecks. Closure mechanisms are structural; the empty-space hypothesis is less sensitive to typical source-ID errors. The B7/B8 architectural finding is less dependent on individual source-ID verification.
Cross-ledger couplingcross_ledger_bridges B7 RESCOPED + B8 emergent (this ledger is the strong test). compute_governance shares the disclosure-axis structural concerns. open_weight Bill 8 cousin to this ledger Bill 11 on cross-framework reproducibility. scaling_laws shares the throughput-vs-capability axis.
ReproducibilityScripts, JSONs, ledger public. Run order: sweep dispatcher → bill_classifier.py → ledger populator → atlas review pipeline. Stage 3.5 verifier dispatch in queue.
§04

Falsification protocol

Each ★ bill becomes a checkable trigger condition. Public update committed within 7 days of any verified clean trigger of F5, F8, or F11.

F5 · ★ Cross-vendor stability
Trigger: a 2024–2026 paper demonstrating the same frontier model evaluated under matched batch / context / quantization clearing ≤10% TTFT and tokens/sec variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras — verified by an independent third-party (METR / Apollo / AISI) within 6 months.
F8 · ★ INT4/FP4 quantization fidelity
Trigger: a 2024–2026 INT4 / FP4 / INT8 quantized frontier model retaining ≥95% of FP16 capability across reasoning + math + code + multilingual + safety axes simultaneously — verified by an independent third-party (METR / Apollo / AISI quantization audit). Most-likely-to-eventually-trigger ★ bill.
F11 · ★ Universal framework coverage
Trigger: a 2024–2026 single open-source framework (vLLM, SGLang, llama.cpp, MLX, or new) running Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2 at ≤20% throughput / TTFT variance on the same hardware — verified by an independent third-party.
F-B7-collapse
Soft trigger: a 2024–2026 closed-vendor accelerator (Groq, Cerebras, SambaNova, Etched) open-sourcing core compiler / runtime, OR an open-source framework (vLLM / SGLang) reaching closed-vendor accelerator throughput parity. Would collapse the strongly separable under the current annotations B7/B8 cluster boundary.
F-Cost-transparency
Soft trigger: a frontier vendor (Groq / Cerebras / OpenAI / Anthropic) publishing true $/Mtok unit economics with compute + memory + power + amortization breakdown. Would close Bill 2 (cost-per-token transparency) — currently every closed vendor pays M3 (closed weights + closed runtime + closed eval) on this axis.
F-Cherry-pick
Soft trigger: an independent third-party benchmark publication establishing matched-workload / matched-config standards across the dominant inference frameworks. Would close Bill 1 (benchmark-vs-real-workload) and create the precondition for an F5 trigger.

Live triggered watchlist: METR / Apollo / AISI quantization audits · MLPerf Inference quarterly results · Independent benchmark reproductions (Artificial Analysis, etc.) · vLLM / SGLang / TensorRT-LLM major releases · Groq / Cerebras / SambaNova / Etched product launches · NVIDIA / AMD / Chinese silicon roadmap announcements. Monthly cadence: framework releases + accelerator launches. Quarterly: MLPerf + independent benchmark reproductions.

§05

Resources & further reading

§R

Reproducibility & data

Every empirical claim resolves to public data. Run the classifier, regenerate the heatmap, audit the corpus, file a falsification.

Public draft v0.2 (2026-05-15) — 291 unique papers across 8 sweeps; ★ Bills 5, 8, 11 HOLD pre-Stage-3.5 with the STRONG 0/N signal in the current cross-ledger atlas (combined 0/92 firings). Strong B7/B8 commercialization-vs-research separation in the current cross-ledger atlas. Real-data output from real Opus research-agent sweeps; bill counts and ★ positions emerge from the actual frontier inference-stack literature, not from a template.

Ledger populated · Stage 3.5 queue pending · 2026-05-15
Three signature constructions HOLD with STRONG 0/N.
Strong B7/B8 separation in 19-ledger atlas.
Empty space HOLDS pending verification.