The Hardware Inference Stack Ledger audits 291 unique papers across 8 deep-loop sweeps (2024–2026) on frontier inference-stack capability claims — vLLM, SGLang, TensorRT-LLM, Triton, NVIDIA Dynamo, Groq, Cerebras WSE-3 / CS-3, SambaNova SN40L, Tenstorrent Wormhole, Lightmatter, Etched Sohu, Rain.ai, Amazon Trainium / Inferentia 3, Google TPU v5p / v6 / Ironwood, Microsoft Maia 100/200, OpenAI custom silicon, AMD MI300X / MI350, NVIDIA B100 / B200 / H200, Huawei Ascend, Cambricon, Biren + framework optimizations (speculative decoding, MoE serving, KV-cache compression, FP8/FP4 quantization, ONNX runtime, llama.cpp, ggml, MLX, Ollama, LM Studio).
The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3). Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 5 ★ (cross-vendor benchmark stability — ≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras under matched config), Bill 8 ★ (INT4 / FP4 / INT8 quantization fidelity — retains ≥95% FP16 capability under independent third-party eval), and Bill 11 ★ (universal inference-platform coverage — single open-source framework runs Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2 at ≤20% performance variance).
After batch 1 (291 unique papers across 8 sweeps), all three ★ bills HOLD pre-Stage-3.5 with the STRONG 0/N signal in the current cross-ledger atlas: 0/34, 0/38, 0/20 — combined 0/92 firings across all three ★ bills. Bill 8 (INT4/FP4 quantization fidelity) is the largest active ★ bill at 38 candidates and the most-likely-to-eventually-trigger pending METR / Apollo / AISI Stage 3.5 verification.
This ledger is the strong possible test of the rescoped B7 commercialization-vs-research axis. vLLM / SGLang / llama.cpp / MLX (open) vs Groq / Cerebras / SambaNova / Etched (closed) creates the strongly separable under the current annotations cluster boundary in the current cross-ledger atlas — strong test of cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges. B9 NULL check matched the current annotations (power / thermal / latency constrain inference but don't ground).
Threat model (verbatim from purpose.md): Demonstrate a frontier inference-stack capability claim — on a system marketing high-throughput / low-latency / low-cost inference of frontier models — that survives six closure audits on the 2024-2026 corpus: (1) benchmark-versus-real-workload audit (TTFT, tokens/sec at contended load), (2) cost-per-token transparency (true unit economics disclosed), (3) frontier-model fidelity audit (quantized inference doesn't silently degrade quality), (4) batch-vs-streaming behavior, (5) commercial availability vs research-preview gap, (6) closed-vendor benchmark cherry-pick audit.
| # | Bill (v0.1) | What gets paid | Star |
|---|---|---|---|
| 1 | Benchmark-vs-real-workload audit | TTFT / tokens-per-sec at contended load, not synthetic single-request | |
| 2 | Cost-per-token transparency | True $/Mtok disclosed (compute + memory + power + amortization) | |
| 3 | Quantized-vs-FP16 fidelity gap | Quantized inference doesn't silently degrade frontier-model quality | |
| 4 | Batch-vs-streaming behavior | Throughput claims hold under streaming with TTFT constraints | |
| 5 | Cross-vendor benchmark stability | ≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras | ★ |
| 6 | KV-cache / speculative-decoding fidelity | Cache compression / speculative decoding doesn't degrade output distribution | |
| 7 | Strong-baseline comparison | Open-source (vLLM, SGLang) vs vendor-closed at matched compute | |
| 8 | INT4 / FP4 quantization fidelity | INT4 / FP4 / INT8 retains ≥95% FP16 capability under independent third-party eval | ★ |
| 9 | Vendor-self-eval independence | Independent reproduction of Groq / Cerebras / SambaNova benchmark claims | |
| 10 | Commercial availability vs research-preview | Claimed capability actually purchasable, not waitlisted / NDA / research-only | |
| 11 | Universal inference-platform coverage | Single open-source framework runs all four frontier opens at ≤20% perf variance | ★ |
| 12 | Commercialization-vs-research axis STRONG B7/B8 | Open (vLLM, SGLang, llama.cpp, MLX) vs closed (Groq, Cerebras, SambaNova). B7 bridge test — strong in atlas | |
| 13 | Safety / model-extraction / power-efficiency | Model extraction via inference API, power/W metrics, sustainability claims |
Six meta-costs (M1–M6): M1 (vendor-supplied benchmark only — no independent reproduction), M2 (cherry-picked workload — single prompt / short context / no contention), M3 (closed weights + closed runtime + closed eval — triple opacity), M4 (hardware NDA / restricted-tier preview), M5 (marketing-grade tokens/sec — theoretical peak, not sustained), M6 (compiler-tuned vs untuned comparison — adversarial vendor-tuned vs default-baseline).
Bill 5 ★ requires the same frontier model evaluated under matched batch size / context length / quantization to clear ≤10% TTFT and tokens/sec variance across the dominant inference frameworks (vLLM, SGLang, TensorRT-LLM, Groq, Cerebras). Of 34 batch-1 candidates, 0 clean triggers — every benchmark-comparison paper either pays M1 (vendor-supplied benchmark only), M2 (cherry-picked workload — single prompt / short context / no contention), or M5 (marketing-grade tokens/sec — theoretical peak, not sustained). Cross-vendor reproductions consistently report 30–60% variance.
Bill 8 ★ requires INT4 / FP4 / INT8 quantized frontier models to clear ≥95% of FP16 capability on independent third-party evaluation (METR / Apollo / AISI quantization audits). The largest active ★ bill at 38 candidates. Of 38 batch-1 candidates, 0 clean triggers — every candidate shows degradation ≥5% on at least one capability axis, often substantial (10–20% on reasoning / math / code). Power-of-2 narrowing (FP16 → FP8 → FP4) compounds the degradation. Most-likely-to-eventually-trigger ★ bill pending METR / Apollo / AISI Stage 3.5 verification — the engineering trajectory toward FP4 with maintained capability is plausible.
Bill 11 ★ requires a single open-source framework (vLLM, SGLang, llama.cpp, MLX) to run all four frontier open-weight models (Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2) at ≤20% throughput / TTFT variance on the same hardware. Of 20 batch-1 candidates, 0 clean triggers — every framework specializes (vLLM excels on Llama / Qwen, SGLang on long-context, llama.cpp on Apple Silicon / CPU, MLX on M-series). The 20-paper candidate pool is structurally bounded.
Combined: 0/92 firings across all three ★ bills (0/34 + 0/38 + 0/20) — the STRONG 0/N signal in the current cross-ledger atlas. Compare with sibling ledgers:
spacetime_discreteness: 4 ★ bills, 13 + 9 + 16 + 19 candidates with 0/57 firings. multimodal_generation: 3 ★ bills, 0/8 + 0/9 + 0/39 = 0/56 firings (also clean but smaller pool). scientific_discovery: 3 ★ bills, 10/29 + 0/17 + 0/36 = 10/82 (Bill 4 PARTIAL with 10 autonomous-lab triggers).
Hardware inference is the only ledger where the combined ★-bill candidate pool exceeds 90 with zero clean triggers across all three bills. This is the a strong signal in the current cross-ledger atlas of B5 (0/N audit pattern across forensic researchers).
Bill 12 (commercialization-axis) fires 42 times — the dominant non-★ bill — and creates a clean cluster boundary:
Open cluster: vLLM (open-source serving), SGLang (open-source long-context), llama.cpp (open-source CPU/Apple), MLX (Apple ML framework, open), AMD ROCm (open compiler stack). Reproducible benchmarks, commodity hardware, community-driven evaluation.
Closed cluster: Groq (custom LPU + closed runtime), Cerebras (CS-3 wafer + closed compiler), SambaNova (SN40L + closed runtime), Etched Sohu (custom transformer ASIC), Lightmatter (photonic compute), Rain.ai. Vendor-supplied benchmarks, proprietary stacks, NDA-restricted preview tiers.
The cluster boundary is strongly separable under the current annotations — cleaner than RAG (mixed open + closed cloud), cleaner than multimodal_generation (78-vs-74 bipolar), cleaner than every other ledger in the 19-ledger atlas. B8 (commercialization-vs-research-artifact axis) is empirically validated here.
The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (especially Bill 8 INT4/FP4 quantization fidelity claims, where METR / Apollo / AISI audits are the closure precondition) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies (Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures).
The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms are structural (cross-vendor stability requires standardized matched-config benchmarks; INT4/FP4 fidelity requires per-bit-width quantization-aware training + independent evaluation; universal framework coverage requires per-framework engineering investment). The B7/B8 commercialization-vs-research finding is architectural and less dependent on individual source-ID verification.
Independent arXiv-ID + abstract verification before any breach commitment. Three complementary methodological tools: classifier rule-tightening, sweep-time arxiv verification, Stage 3.5 verification. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection.
v0.2 status (2026-05-15): Stage 5 (POPULATE) complete with 8 sweeps × 291 papers (✓), classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F5, F8, F11, F-B7-collapse, F-Cost-transparency, F-Cherry-pick, ✓), author-activity panel mapped (vLLM, SGLang, NVIDIA-TensorRT / Triton / Dynamo, Groq, Cerebras, SambaNova, Tenstorrent, Lightmatter, Etched-Sohu, Rain-AI, Amazon-Trainium, Google-TPU, Microsoft-Maia, OpenAI-Silicon, AMD-MI300, NVIDIA-B200, Huawei-Ascend, Cambricon, Biren, llama-cpp, MLX, Ollama, METR, Apollo, AISI, ✓). Stage 3.5 verification queue pending with priority on Bill 8 INT4/FP4 quantization claims.
Cadence: Monthly — framework releases (vLLM, SGLang, TensorRT-LLM) + accelerator launches (Groq, Cerebras, SambaNova, Etched). Quarterly — MLPerf Inference results + independent benchmark reproductions (Artificial Analysis, etc.) + METR / Apollo / AISI quantization audits. Triggered: any closed-vendor accelerator open-sourcing core compiler / runtime, OR any open-source framework reaching closed-vendor accelerator throughput parity (would collapse the strongly separable under the current annotations B7/B8 cluster boundary).
This ledger is a strong test of the cross_ledger_bridges B7 RESCOPED + B8 emergent commercialization-vs-research bridges in the current cross-ledger atlas. Strong single signal observed:
Combined 0/N signal: 0/92 across all three ★ bills (0/34 + 0/38 + 0/20).
Cluster separation: vLLM/SGLang/llama.cpp/MLX (open) vs Groq/Cerebras/SambaNova/Etched (closed) — strongly separable boundary under the current annotations.
B5 contribution: Strong contribution to the cross-ledger B5 (0/N audit pattern) bridge.
Falsification of B5 / B7 / B8 here would require simultaneous collapse of (a) one of the three ★ bills via clean trigger, AND (b) the open-vs-closed cluster boundary via cross-cluster framework convergence. Both would require evidence not currently present in the sampled corpus.