The Multilingual / Low-Resource Ledger audits 299 unique papers across 8 deep-loop sweeps (2024–2026) on frontier multilingual / low-resource capability claims — NLLB-200, Aya-Expanse, Llama 3 / 3.1 / 4 multilingual, Qwen 2.5 / 3, Gemini, Claude, Mistral Saba, Apertus, Sea Mistral, NeMo Megatron multilingual, MaLA-500, Cendol, Crosslingual-Generalist. The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).
Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 4 ★ (low-resource deep-learning parity — ≤500K-sentence language reaches ≥80% of high-resource performance), Bill 7 ★ (cross-script generalization — same model passes Latin + CJK + Arabic + Devanagari + Brahmic with ≤10pp absolute gap), and Bill 10 ★ (universal multilingual coverage — vendor frontier model passes ≥150 of 200 Flores languages above 60% BLEU).
After batch 1 (299 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5. Bill 4 has 33 candidates and 0 clean triggers (every candidate pays M2 / M3 / M4); Bill 7 has 46 candidates and 0 clean triggers (Brahmic and Tifinagh are the weakest-coverage script families); Bill 10 has 145 candidates — the largest active bill — and 0 clean triggers (NLLB-200 reaches ~120/200 above 60% BLEU; Aya-Expanse ~80/200; closed-vendor models report subset evaluations only). The 75% rebuttal density on Bill 4 is the highest in the ledger.
The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule (2026-05-15) was established. The "★ HOLDS empty" claim is provisional pending Stage 3.5 verification of the priority candidate pool, in line with the 2026-05-15 rule pioneered by Robotics_Embodied (9/9 hallucinated breach IDs), reinforced by RL-from-Rewards (60% on flagged IDs), and confirmed at Spacetime_Discreteness (priority-pool source-ID failures).
Threat model (verbatim from purpose.md): Demonstrate a frontier multilingual capability claim on a system marketing multilingual or low-resource competence (NLLB-200, Aya-Expanse, Llama 3 / 4 multilingual, Qwen 2.5/3, Claude / Gemini multilingual, Mistral Saba, Apertus, MaLA-500, Cendol, Crosslingual-Generalist) — that survives six closure audits on the 2024–2026 corpus: (1) low-resource-language sample-density audit, (2) cross-script generalization (Latin / CJK / Arabic / Cyrillic / Devanagari / Brahmic / Tifinagh), (3) translation-vs-generation decoupling, (4) dialect-and-register preservation, (5) post-training-language-drift audit, (6) held-out language benchmark construction (Flores-101 → Flores-200 → Flores-Plus refresh). A clean trigger requires independent third-party verification (Stanford HELM-Multilingual / Common Voice / MasakhaneNLP / SEACrowd / METR / AISI) within 6 months.
| # | Bill (v0.1) | What gets paid | Star |
|---|---|---|---|
| 1 | Sample-density audit | Per-language training-corpus size disclosed; minimum threshold for clean evaluation | |
| 2 | Tokenizer fertility per language | Tokens-per-word disparity (English vs low-resource) | |
| 3 | High-vs-low-resource gap audit | Quantified gap between top-10 and bottom-10 languages | |
| 4 | Low-resource deep-learning parity | ≤500K-sentence language reaches ≥80% of high-resource performance | ★ |
| 5 | Cross-domain transfer | Per-domain coverage across languages (news → biomedical → legal → conversational) | |
| 6 | Translation-vs-generation decoupling | MT capability separated from generation capability | |
| 7 | Cross-script generalization | Same policy passes Latin + CJK + Arabic + Devanagari + Brahmic with ≤10pp gap | ★ |
| 8 | Dialect / register preservation | AAVE / Indian English / Singlish / Brazilian-vs-EU Portuguese / Maghrebi-vs-MSA Arabic | |
| 9 | Post-training-language-drift audit | Does instruction-tuning erode multilingual base-model competence? | |
| 10 | Universal multilingual coverage at frontier scale | ≥150 of 200 Flores languages above 60% BLEU | ★ |
| 11 | Vendor-self-eval independence | Stanford HELM-Multilingual / MasakhaneNLP / SEACrowd / Aya-Eval reproduction | |
| 12 | Held-out post-2024 benchmark construction | Flores-200 → Flores-Plus, AmericasNLP rolling refresh, Masakhane held-out | |
| 13 | Anti-saturation construction | Multilingual MMLU, XTREME-R; anti-contamination by design, rolling refresh |
Six meta-costs (M1–M6): M1 (pre-2024 historical anchor), M2 (English-only-evaluation), M3 (Latin-script-only), M4 (single-task — MT only without generation), M5 (single-domain — news-only), M6 (implementation-specific — vendor proprietary).
Three escape gates: G1 (theoretical / proof paper), G2 (tooling / dataset construction), G3 (survey / review paper).
This ledger tests cross_ledger_bridges B2 (closure cycle 18mo → 3-4mo) two-speed prediction: low-resource languages have slower vendor-claim half-life than English. The 75% rebuttal density on Bill 4 is consistent with the prediction — low-resource claims are rebutted faster than they replicate. The 2027 audit cycle re-poll will test whether the two-speed pattern is temporally stable.
Bill 4 ★ requires a ≤500K-sentence low-resource language to reach ≥80% of high-resource (English) performance on the same model + same evaluation suite. NLLB-200 is the dominant cited model; Aya-Expanse, MaLA-500, Cendol, and SEACrowd-trained models also engage. Of 33 batch-1 candidates, 0 clean triggers — every candidate either pays M2 (English-only-evaluation), M3 (Latin-script-only), or M4 (MT only without generation). Sample-density disclosure (Bill 1) catches most candidates short of the closure threshold. The 75% rebuttal density on Bill 4 is the highest in the ledger.
Bill 7 ★ requires the same model to clear ≤10pp absolute gap across Latin + CJK + Arabic + Devanagari + Brahmic on a held-out evaluation. The largest cross-script audit corpus to date is XLM-R / NLLB / Aya-Expanse spanning ~25 scripts. Of 46 batch-1 candidates, 0 clean triggers — every candidate pays M3 (Latin-script-only baseline) or M4 (single-task). Brahmic and Tifinagh remain the weakest-coverage script families across all evaluated models. Cross-script clean trigger requires the same model to clear ≥80% on all five script families simultaneously.
Bill 10 ★ requires a vendor frontier model to pass ≥150 of 200 Flores languages above 60% BLEU. The largest active bill in the ledger at 145 candidates. NLLB-200 reaches ~120/200 above the threshold; Aya-Expanse ~80/200; closed-vendor models (Claude, Gemini, GPT-4o multilingual) report subset evaluations only. Of 145 batch-1 candidates, 0 clean triggers — every candidate pays M2 (English-only-evaluation reporting) or M5 (single-domain — news-only) on the universal-coverage axis. Flores-Plus + AmericasNLP rolling refresh keeps the bar moving.
A clean ★-bill trigger here would shift UNESCO / European Language Resources Association (ELRA) data-allocation cycles, Common Voice / Masakhane / SEACrowd / AmericasNLP rolling-refresh priorities, and vendor low-resource-language road-maps (NLLB-300 / Aya-Expanse 200+ / Apertus). Material — would reorganize multilingual-data investment for the next cycle — but not a binding-regulation lever. Cousin to Spacetime_Discreteness's funding-allocation lever rather than Factorization's NIST PQC migration lever.
The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule was established (2026-05-15). The lock criteria (classifier benchmark + watch-list ≥30 + falsification protocol ≥10 + author-activity panel + boxed declaration) were satisfied at lock time. The 299-paper corpus has not yet been independently arXiv-ID + abstract verified at the priority ★-bill candidate pool.
The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies: Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures. Expected outcome here: a non-trivial hallucination rate that requires charitable substitution from cited author corpora.
The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms here are structural (low-resource parity is gated by sample-density disclosure, cross-script gated by tokenizer fertility, universal Flores coverage gated by training-data uniformity), not contingent on individual paper IDs. Even after charitable substitution, no real-paper substitute is expected to trigger Bills 4 / 7 / 10 cleanly — every substitute pays M2 or M4.
Independent arXiv-ID + abstract verification before any breach commitment. Driven by the cross-ledger methodology learning of 2026-05-15. Three complementary methodological tools: (1) classifier rule-tightening catches some false breaches; (2) sweep-time arxiv verification catches more; (3) Stage 3.5 catches a different class of residual errors. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection. This ledger LOCKED v0.2 before the rule was established and is being re-audited.
v0.2 LOCKED status (2026-05-15): classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F4, F7, F10, F-Drift, F-Saturation, F-Independence, ✓), author-activity panel mapped (NLLB-Team, Cohere-Aya, Conneau, Pires, Kreutzer, Goyal, Costa-jussà, Adelani, MasakhaneNLP, SEACrowd, AmericasNLP, Bender, Joshi, Khanuja, Mistral-Saba, Apertus, ✓), boxed declaration (✓). Stage 3.5 verification queue pending — re-audit under v2026-05-15 cross-ledger rule.
Monthly cadence: vendor system-card revisions + arxiv multilingual-NLP. Quarterly: Stanford HELM-Multilingual, Flores-Plus / AmericasNLP rolling refresh, Masakhane / SEACrowd held-out evaluations. Annual / event-triggered: Common Voice annual data drops, NLLB / Aya-Expanse / Apertus annual updates, METR / AISI multilingual safety audits. Triggered: any vendor frontier model whose multilingual capability claim is reproduced ≤5pp by an independent third-party on full Flores-200; any 2024–2026 paper showing instruction-tuning erodes >15pp of multilingual base-model competence.
The full preprint with sweep-level evidence, paper-by-paper audit ledger, and v0.3 lock conditions will be released as an arXiv preprint after Stage 3.5 verification completes on the priority candidate pool.
The 75% rebuttal density on Bill 4 is consistent with B2 (closure-cycle 18mo → 3-4mo) two-speed prediction: low-resource vendor-claim half-life is materially longer than English. Falsification of B2-here requires a 2024–2026 multilingual claim that reaches Bill 4 closure (≥80% of high-resource at ≤500K sentences) AND survives 12-month independent re-check at the same closure threshold without rebuttal. None observed in current corpus.