# Hardware Inference Stack Ledger — Bills Draft (v0.1)

> **13 bills + 6 meta-costs + 3 escape gates**, ★ at 5, 8, 11.

| # | Bill | What gets paid | ★ |
|---:|---|---|:---:|
| 1 | Benchmark-vs-real-workload audit | TTFT / tokens-per-sec measured under contended load, not synthetic single-request. | |
| 2 | Cost-per-token transparency | True $/Mtok disclosed (compute + memory + power + amortization). | |
| 3 | Quantized-vs-FP16 fidelity gap | Quantized inference doesn't silently degrade frontier-model quality. | |
| 4 | Batch-vs-streaming behavior | Throughput claims hold under streaming with TTFT constraints. | |
| 5 | **★ Cross-vendor benchmark stability** | Same frontier model ≤10% TTFT / throughput variance across vLLM / SGLang / TensorRT-LLM / Groq / Cerebras under matched config. Predicted empty. | ★ |
| 6 | KV-cache / speculative-decoding fidelity | Cache compression / speculative decoding doesn't degrade output distribution. | |
| 7 | Strong-baseline comparison | Open-source (vLLM, SGLang) vs vendor-closed at matched compute. | |
| 8 | **★ Quantization fidelity audit (INT4/FP4)** | INT4/FP4/INT8 retains ≥95% FP16 capability under independent third-party eval. Predicted empty. | ★ |
| 9 | Vendor-self-eval independence | Independent reproduction of Groq / Cerebras / SambaNova benchmark claims. | |
| 10 | Commercial availability vs research-preview | Claimed capability actually purchasable, not waitlisted / NDA / research-only. | |
| 11 | **★ Universal inference-platform coverage** | Single open-source framework runs Llama 4 + DeepSeek V3 + Qwen 3 + Mistral Large 2 at ≤20% perf variance. Predicted empty. | ★ |
| 12 | Commercialization-vs-research axis | Open (vLLM, SGLang, llama.cpp, MLX) vs closed (Groq, Cerebras, SambaNova). *B7 bridge test — purest.* | |
| 13 | Safety / model-extraction / power-efficiency audit | Model extraction via inference API, power/W metrics, sustainability claims. | |

## Meta-costs

| # | Meta-cost | Disqualifier |
|---:|---|---|
| M1 | Vendor-supplied benchmark only | No independent reproduction → disqualified. |
| M2 | Cherry-picked workload | Single prompt / short context / no contention → disqualified. |
| M3 | Closed weights + closed runtime + closed eval | Triple opacity → disqualified. |
| M4 | Hardware NDA / restricted-tier preview | Not commercially available → disqualified. |
| M5 | Marketing-grade tokens/sec | Theoretical peak, not sustained → disqualified. |
| M6 | Compiler-tuned vs untuned comparison | Adversarial vendor-tuned vs default-baseline → disqualified. |

## Escape gates

| # | Gate | Why allowed |
|---:|---|---|
| G1 | Pure research-preview chip demo | If labeled as such and no commercial claim attached — allowed (no claim → no bill). |
| G2 | Architectural taxonomy / survey | Not a capability claim — allowed if no benchmark assertion. |
| G3 | Negative-result / refutation paper | Counts as rebuttal evidence (closes bills) — allowed. |

## Iteration plan (8 sweeps)

- **1301**: Inference frameworks (vLLM, SGLang, TensorRT-LLM, Triton, NVIDIA Dynamo, llama.cpp, MLX, Ollama)
- **1302**: Closed-vendor hardware (Groq, Cerebras, SambaNova, Lightmatter, Etched Sohu, Rain.ai)
- **1303**: Cloud-vendor inference silicon (AWS Trainium / Inferentia, Google TPU v5p/v6/Ironwood, Microsoft Maia, OpenAI custom)
- **1304**: GPU-vendor frontier (NVIDIA H200/B100/B200, AMD MI300X/MI350, Chinese Huawei Ascend / Cambricon / Biren)
- **1305**: Quantization fidelity audits (INT4/FP4 capability evals, METR / Apollo / AISI quantization studies)
- **1306**: Speculative decoding + KV-cache compression fidelity papers
- **1307**: B7 bridge test — open inference (vLLM/SGLang/llama.cpp/MLX) vs closed (Groq/Cerebras/SambaNova)
- **1308**: Independent third-party benchmark reproductions + negative results + cost-transparency studies
