Public draft · v0.2 · May 2026 · Hardware Inference Stack Ledger

★ Public Draft · v0.2 · Real Data19th of 19 · STRONG 0/N in atlas8 Sweeps · 291 Papers

The Thirteen-Bill Closure Pattern for Frontier Inference-Stack Capability Claims (2024–2026)

A real-data falsification harness for 2024–2026 frontier inference-stack capability claims with 291-paper empirical verification, three signature-empty bills, and the STRONG 0/N signal observed in the current cross-ledger atlas (combined 0/92 ★-bill firings).
Kevin Russell · ProjectForty2 · 2026-05-15 · Public draft v0.2.

↓ bills_draft.md ← Ledger overview Data & code →

Abstract

The Hardware Inference Stack Ledger audits 291 unique papers across 8 deep-loop sweeps (2024–2026) on frontier inference-stack capability claims — vLLM, SGLang, TensorRT-LLM, Triton, NVIDIA Dynamo, Groq, Cerebras WSE-3 / CS-3, SambaNova SN40L, Tenstorrent Wormhole, Lightmatter, Etched Sohu, Rain.ai, Amazon Trainium / Inferentia 3, Google TPU v5p / v6 / Ironwood, Microsoft Maia 100/200, OpenAI custom silicon, AMD MI300X / MI350, NVIDIA B100 / B200 / H200, Huawei Ascend, Cambricon, Biren + framework optimizations (speculative decoding, MoE serving, KV-cache compression, FP8/FP4 quantization, ONNX runtime, llama.cpp, ggml, MLX, Ollama, LM Studio).

The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3). Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (cross-vendor benchmark stability — ≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras under matched config), Bill 8 ★ (INT4 / FP4 / INT8 quantization fidelity — retains ≥95% FP16 capability under independent third-party eval), and Bill 11 ★ (universal inference-platform coverage — single open-source framework runs Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2 at ≤20% performance variance).

After batch 1 (291 unique papers across 8 sweeps), all three ★ bills HOLD pre-Stage-3.5 with the STRONG 0/N signal in the current cross-ledger atlas: 0/34, 0/38, 0/20 — combined 0/92 firings across all three ★ bills. Bill 8 (INT4/FP4 quantization fidelity) is the largest active ★ bill at 38 candidates and the most-likely-to-eventually-trigger pending METR / Apollo / AISI Stage 3.5 verification.

This ledger is the strong possible test of the rescoped B7 commercialization-vs-research axis. vLLM / SGLang / llama.cpp / MLX (open) vs Groq / Cerebras / SambaNova / Etched (closed) creates the strongly separable under the current annotations cluster boundary in the current cross-ledger atlas — strong test of cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges. B9 NULL check matched the current annotations (power / thermal / latency constrain inference but don't ground).

13+6Bills + meta-costs

3★ Empty bills

0/92STRONG 0/N in atlas

291Unique Papers

§01Threat model + bills (predeclared in v0.1)

Threat model (verbatim from purpose.md): Demonstrate a frontier inference-stack capability claim — on a system marketing high-throughput / low-latency / low-cost inference of frontier models — that survives six closure audits on the 2024-2026 corpus: (1) benchmark-versus-real-workload audit (TTFT, tokens/sec at contended load), (2) cost-per-token transparency (true unit economics disclosed), (3) frontier-model fidelity audit (quantized inference doesn't silently degrade quality), (4) batch-vs-streaming behavior, (5) commercial availability vs research-preview gap, (6) closed-vendor benchmark cherry-pick audit.

#	Bill (v0.1)	What gets paid	Star
1	Benchmark-vs-real-workload audit	TTFT / tokens-per-sec at contended load, not synthetic single-request
2	Cost-per-token transparency	True $/Mtok disclosed (compute + memory + power + amortization)
3	Quantized-vs-FP16 fidelity gap	Quantized inference doesn't silently degrade frontier-model quality
4	Batch-vs-streaming behavior	Throughput claims hold under streaming with TTFT constraints
5	Cross-vendor benchmark stability	≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras	★
6	KV-cache / speculative-decoding fidelity	Cache compression / speculative decoding doesn't degrade output distribution
7	Strong-baseline comparison	Open-source (vLLM, SGLang) vs vendor-closed at matched compute
8	INT4 / FP4 quantization fidelity	INT4 / FP4 / INT8 retains ≥95% FP16 capability under independent third-party eval	★
9	Vendor-self-eval independence	Independent reproduction of Groq / Cerebras / SambaNova benchmark claims
10	Commercial availability vs research-preview	Claimed capability actually purchasable, not waitlisted / NDA / research-only
11	Universal inference-platform coverage	Single open-source framework runs all four frontier opens at ≤20% perf variance	★
12	Commercialization-vs-research axis STRONG B7/B8	Open (vLLM, SGLang, llama.cpp, MLX) vs closed (Groq, Cerebras, SambaNova). B7 bridge test — strong in atlas
13	Safety / model-extraction / power-efficiency	Model extraction via inference API, power/W metrics, sustainability claims

Six meta-costs (M1–M6): M1 (vendor-supplied benchmark only — no independent reproduction), M2 (cherry-picked workload — single prompt / short context / no contention), M3 (closed weights + closed runtime + closed eval — triple opacity), M4 (hardware NDA / restricted-tier preview), M5 (marketing-grade tokens/sec — theoretical peak, not sustained), M6 (compiler-tuned vs untuned comparison — adversarial vendor-tuned vs default-baseline).

§02Findings — STRONG 0/N signal in 19-ledger atlas

★ Bill 5 — Cross-vendor benchmark stability · 34 candidates · 0 clean

Bill 5 ★ requires the same frontier model evaluated under matched batch size / context length / quantization to clear ≤10% TTFT and tokens/sec variance across the dominant inference frameworks (vLLM, SGLang, TensorRT-LLM, Groq, Cerebras). Of 34 batch-1 candidates, 0 clean triggers — every benchmark-comparison paper either pays M1 (vendor-supplied benchmark only), M2 (cherry-picked workload — single prompt / short context / no contention), or M5 (marketing-grade tokens/sec — theoretical peak, not sustained). Cross-vendor reproductions consistently report 30–60% variance.

★ Bill 8 — INT4 / FP4 / INT8 quantization fidelity · 38 candidates · 0 clean

Bill 8 ★ requires INT4 / FP4 / INT8 quantized frontier models to clear ≥95% of FP16 capability on independent third-party evaluation (METR / Apollo / AISI quantization audits). The largest active ★ bill at 38 candidates. Of 38 batch-1 candidates, 0 clean triggers — every candidate shows degradation ≥5% on at least one capability axis, often substantial (10–20% on reasoning / math / code). Power-of-2 narrowing (FP16 → FP8 → FP4) compounds the degradation. Most-likely-to-eventually-trigger ★ bill pending METR / Apollo / AISI Stage 3.5 verification — the engineering trajectory toward FP4 with maintained capability is plausible.

★ Bill 11 — Universal inference-platform coverage · 20 candidates · 0 clean

Bill 11 ★ requires a single open-source framework (vLLM, SGLang, llama.cpp, MLX) to run all four frontier open-weight models (Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2) at ≤20% throughput / TTFT variance on the same hardware. Of 20 batch-1 candidates, 0 clean triggers — every framework specializes (vLLM excels on Llama / Qwen, SGLang on long-context, llama.cpp on Apple Silicon / CPU, MLX on M-series). The 20-paper candidate pool is structurally bounded.

STRONG 0/N signal · combined 0/92 across all three ★ bills

Combined: 0/92 firings across all three ★ bills (0/34 + 0/38 + 0/20) — the STRONG 0/N signal in the current cross-ledger atlas. Compare with sibling ledgers:

spacetime_discreteness: 4 ★ bills, 13 + 9 + 16 + 19 candidates with 0/57 firings. multimodal_generation: 3 ★ bills, 0/8 + 0/9 + 0/39 = 0/56 firings (also clean but smaller pool). scientific_discovery: 3 ★ bills, 10/29 + 0/17 + 0/36 = 10/82 (Bill 4 PARTIAL with 10 autonomous-lab triggers).

Hardware inference is the only ledger where the combined ★-bill candidate pool exceeds 90 with zero clean triggers across all three bills. This is the a strong signal in the current cross-ledger atlas of B5 (0/N audit pattern across forensic researchers).

B7 RESCOPED + B8 emergent · strong separation in 19-ledger atlas

Bill 12 (commercialization-axis) fires 42 times — the dominant non-★ bill — and creates a clean cluster boundary:

Open cluster: vLLM (open-source serving), SGLang (open-source long-context), llama.cpp (open-source CPU/Apple), MLX (Apple ML framework, open), AMD ROCm (open compiler stack). Reproducible benchmarks, commodity hardware, community-driven evaluation.

Closed cluster: Groq (custom LPU + closed runtime), Cerebras (CS-3 wafer + closed compiler), SambaNova (SN40L + closed runtime), Etched Sohu (custom transformer ASIC), Lightmatter (photonic compute), Rain.ai. Vendor-supplied benchmarks, proprietary stacks, NDA-restricted preview tiers.

The cluster boundary is strongly separable under the current annotations — cleaner than RAG (mixed open + closed cloud), cleaner than multimodal_generation (78-vs-74 bipolar), cleaner than every other ledger in the 19-ledger atlas. B8 (commercialization-vs-research-artifact axis) is empirically validated here.

§03Verification methodology — Stage 3.5 queue pending · Bill 8 highest priority

Stage 3.5 verification queue · 2026-05-15

The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (especially Bill 8 INT4/FP4 quantization fidelity claims, where METR / Apollo / AISI audits are the closure precondition) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies (Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures).

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (cross-vendor stability requires standardized matched-config benchmarks; INT4/FP4 fidelity requires per-bit-width quantization-aware training + independent evaluation; universal framework coverage requires per-framework engineering investment). The B7/B8 commercialization-vs-research finding is architectural and less dependent on individual source-ID verification.

Cross-ledger methodology rule · v2026-05-15

Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: classifier rule-tightening, sweep-time arxiv verification, Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.

§04Lock conditions and queue

v0.2 status (2026-05-15): Stage 5 (POPULATE) complete with 8 sweeps × 291 papers (✓), classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-B7-collapse, F-Cost-transparency, F-Cherry-pick, ✓), author-activity panel mapped (vLLM, SGLang, NVIDIA-TensorRT / Triton / Dynamo, Groq, Cerebras, SambaNova, Tenstorrent, Lightmatter, Etched-Sohu, Rain-AI, Amazon-Trainium, Google-TPU, Microsoft-Maia, OpenAI-Silicon, AMD-MI300, NVIDIA-B200, Huawei-Ascend, Cambricon, Biren, llama-cpp, MLX, Ollama, METR, Apollo, AISI, ✓). Stage 3.5 verification queue pending with priority on Bill 8 INT4/FP4 quantization claims.

Cadence: Monthly — framework releases (vLLM, SGLang, TensorRT-LLM) + accelerator launches (Groq, Cerebras, SambaNova, Etched). Quarterly — MLPerf Inference results + independent benchmark reproductions (Artificial Analysis, etc.) + METR / Apollo / AISI quantization audits. Triggered: any closed-vendor accelerator open-sourcing core compiler / runtime, OR any open-source framework reaching closed-vendor accelerator throughput parity (would collapse the strongly separable under the current annotations B7/B8 cluster boundary).

Bridge-test status · cross_ledger_bridges B7 RESCOPED + B8 emergent (STRONG test)

This ledger is a strong test of the cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges in the current cross-ledger atlas. Strong single signal observed:

Combined 0/N signal: 0/92 across all three ★ bills (0/34 + 0/38 + 0/20).
Cluster separation: vLLM/SGLang/llama.cpp/MLX (open) vs Groq/Cerebras/SambaNova/Etched (closed) — strongly separable boundary under the current annotations.
B5 contribution: Strong contribution to the cross-ledger B5 (0/N audit pattern) bridge.

Falsification of B5 / B7 / B8 here would require simultaneous collapse of (a) one of the three ★ bills via clean trigger, AND (b) the open-vs-closed cluster boundary via cross-cluster framework convergence. Both would require evidence not currently present in the sampled corpus.