The Thirteen-Bill Closure Pattern for Frontier Multilingual / Low-Resource Capability Claims (2024–2026)

A real-data falsification harness for 2024–2026 frontier multilingual / low-resource capability claims with 299-paper empirical verification, three signature-empty bills, and a Stage 3.5 verification queue pending.
Kevin Russell · ProjectForty2 · 2026-05-15 · Public draft v0.2.

↓ bills_draft.md ← Ledger overview Data & code →

Abstract

The Multilingual / Low-Resource Ledger audits 299 unique papers across 8 deep-loop sweeps (2024–2026) on frontier multilingual / low-resource capability claims — NLLB-200, Aya-Expanse, Llama 3 / 3.1 / 4 multilingual, Qwen 2.5 / 3, Gemini, Claude, Mistral Saba, Apertus, Sea Mistral, NeMo Megatron multilingual, MaLA-500, Cendol, Crosslingual-Generalist. The harness consists of 13 bills, six meta-costs (M1–M6), and three escape gates (G1–G3).

Three bills were predicted ★ empty in v0.1 BEFORE batch 1 sweeps ran: Bill 4 ★ (low-resource deep-learning parity — ≤500K-sentence language reaches ≥80% of high-resource performance), Bill 7 ★ (cross-script generalization — same model passes Latin + CJK + Arabic + Devanagari + Brahmic with ≤10pp absolute gap), and Bill 10 ★ (universal multilingual coverage — vendor frontier model passes ≥150 of 200 Flores languages above 60% BLEU).

After batch 1 (299 unique papers across 8 sweeps), all three ★ bills HOLD empty pre-Stage-3.5. Bill 4 has 33 candidates and 0 clean triggers (every candidate pays M2 / M3 / M4); Bill 7 has 46 candidates and 0 clean triggers (Brahmic and Tifinagh are the weakest-coverage script families); Bill 10 has 145 candidates — the largest active bill — and 0 clean triggers (NLLB-200 reaches ~120/200 above 60% BLEU; Aya-Expanse ~80/200; closed-vendor models report subset evaluations only). The 75% rebuttal density on Bill 4 is the highest in the ledger.

The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule (2026-05-15) was established. The "★ HOLDS empty" claim is provisional pending Stage 3.5 verification of the priority candidate pool, in line with the 2026-05-15 rule pioneered by Robotics_Embodied (9/9 hallucinated breach IDs), reinforced by RL-from-Rewards (60% on flagged IDs), and confirmed at Spacetime_Discreteness (priority-pool source-ID failures).

13+6Bills + meta-costs

3★ Empty bills

299Unique Papers

v0.2Locked

§01Threat model + bills (predeclared in v0.1)

Threat model (verbatim from purpose.md): Demonstrate a frontier multilingual capability claim on a system marketing multilingual or low-resource competence (NLLB-200, Aya-Expanse, Llama 3 / 4 multilingual, Qwen 2.5/3, Claude / Gemini multilingual, Mistral Saba, Apertus, MaLA-500, Cendol, Crosslingual-Generalist) — that survives six closure audits on the 2024–2026 corpus: (1) low-resource-language sample-density audit, (2) cross-script generalization (Latin / CJK / Arabic / Cyrillic / Devanagari / Brahmic / Tifinagh), (3) translation-vs-generation decoupling, (4) dialect-and-register preservation, (5) post-training-language-drift audit, (6) held-out language benchmark construction (Flores-101 → Flores-200 → Flores-Plus refresh). A clean trigger requires independent third-party verification (Stanford HELM-Multilingual / Common Voice / MasakhaneNLP / SEACrowd / METR / AISI) within 6 months.

#	Bill (v0.1)	What gets paid	Star
1	Sample-density audit	Per-language training-corpus size disclosed; minimum threshold for clean evaluation
2	Tokenizer fertility per language	Tokens-per-word disparity (English vs low-resource)
3	High-vs-low-resource gap audit	Quantified gap between top-10 and bottom-10 languages
4	Low-resource deep-learning parity	≤500K-sentence language reaches ≥80% of high-resource performance	★
5	Cross-domain transfer	Per-domain coverage across languages (news → biomedical → legal → conversational)
6	Translation-vs-generation decoupling	MT capability separated from generation capability
7	Cross-script generalization	Same policy passes Latin + CJK + Arabic + Devanagari + Brahmic with ≤10pp gap	★
8	Dialect / register preservation	AAVE / Indian English / Singlish / Brazilian-vs-EU Portuguese / Maghrebi-vs-MSA Arabic
9	Post-training-language-drift audit	Does instruction-tuning erode multilingual base-model competence?
10	Universal multilingual coverage at frontier scale	≥150 of 200 Flores languages above 60% BLEU	★
11	Vendor-self-eval independence	Stanford HELM-Multilingual / MasakhaneNLP / SEACrowd / Aya-Eval reproduction
12	Held-out post-2024 benchmark construction	Flores-200 → Flores-Plus, AmericasNLP rolling refresh, Masakhane held-out
13	Anti-saturation construction	Multilingual MMLU, XTREME-R; anti-contamination by design, rolling refresh

Six meta-costs (M1–M6): M1 (pre-2024 historical anchor), M2 (English-only-evaluation), M3 (Latin-script-only), M4 (single-task — MT only without generation), M5 (single-domain — news-only), M6 (implementation-specific — vendor proprietary).

Three escape gates: G1 (theoretical / proof paper), G2 (tooling / dataset construction), G3 (survey / review paper).

Bridge-test focus · cross_ledger_bridges B2 two-speed prediction

This ledger tests cross_ledger_bridges B2 (closure cycle 18mo → 3-4mo) two-speed prediction: low-resource languages have slower vendor-claim half-life than English. The 75% rebuttal density on Bill 4 is consistent with the prediction — low-resource claims are rebutted faster than they replicate. The 2027 audit cycle re-poll will test whether the two-speed pattern is temporally stable.

§02Findings — three ★ bills HOLD empty pre-Stage-3.5

★ Bill 4 — Low-resource deep-learning parity · 33 candidates · 0 clean

Bill 4 ★ requires a ≤500K-sentence low-resource language to reach ≥80% of high-resource (English) performance on the same model + same evaluation suite. NLLB-200 is the dominant cited model; Aya-Expanse, MaLA-500, Cendol, and SEACrowd-trained models also engage. Of 33 batch-1 candidates, 0 clean triggers — every candidate either pays M2 (English-only-evaluation), M3 (Latin-script-only), or M4 (MT only without generation). Sample-density disclosure (Bill 1) catches most candidates short of the closure threshold. The 75% rebuttal density on Bill 4 is the highest in the ledger.

★ Bill 7 — Cross-script generalization · 46 candidates · 0 clean

Bill 7 ★ requires the same model to clear ≤10pp absolute gap across Latin + CJK + Arabic + Devanagari + Brahmic on a held-out evaluation. The largest cross-script audit corpus to date is XLM-R / NLLB / Aya-Expanse spanning ~25 scripts. Of 46 batch-1 candidates, 0 clean triggers — every candidate pays M3 (Latin-script-only baseline) or M4 (single-task). Brahmic and Tifinagh remain the weakest-coverage script families across all evaluated models. Cross-script clean trigger requires the same model to clear ≥80% on all five script families simultaneously.

★ Bill 10 — Universal multilingual coverage at frontier scale · 145 candidates · 0 clean

Bill 10 ★ requires a vendor frontier model to pass ≥150 of 200 Flores languages above 60% BLEU. The largest active bill in the ledger at 145 candidates. NLLB-200 reaches ~120/200 above the threshold; Aya-Expanse ~80/200; closed-vendor models (Claude, Gemini, GPT-4o multilingual) report subset evaluations only. Of 145 batch-1 candidates, 0 clean triggers — every candidate pays M2 (English-only-evaluation reporting) or M5 (single-domain — news-only) on the universal-coverage axis. Flores-Plus + AmericasNLP rolling refresh keeps the bar moving.

Policy lever · linguistic-resource allocation

A clean ★-bill trigger here would shift UNESCO / European Language Resources Association (ELRA) data-allocation cycles, Common Voice / Masakhane / SEACrowd / AmericasNLP rolling-refresh priorities, and vendor low-resource-language road-maps (NLLB-300 / Aya-Expanse 200+ / Apertus). Material — would reorganize multilingual-data investment for the next cycle — but not a binding-regulation lever. Cousin to Spacetime_Discreteness's funding-allocation lever rather than Factorization's NIST PQC migration lever.

§03Verification methodology — v0.2 LOCKED, Stage 3.5 queue pending

The ledger v0.2 LOCKED before the cross-ledger Stage 3.5 verification rule was established (2026-05-15). The lock criteria (classifier benchmark + watch-list ≥30 + falsification protocol ≥10 + author-activity panel + boxed declaration) were satisfied at lock time. The 299-paper corpus has not yet been independently arXiv-ID + abstract verified at the priority ★-bill candidate pool.

Stage 3.5 verification queue · 2026-05-15

The "★ HOLDS empty" claim is provisional pending Stage 3.5. Priority pool of ~30 ★-bill candidate IDs (10 per ★ bill) plus 20 sweep-health spotchecks dispatching against arxiv-API. The methodology lesson from sibling ledgers applies: Robotics_Embodied and RL-from-Rewards checked-source failures, plus Spacetime_Discreteness priority-pool source-ID failures. Expected outcome here: a non-trivial hallucination rate that requires charitable substitution from cited author corpora.

The empty-space hypothesis is less sensitive to typical source-ID errors: closure mechanisms here are structural (low-resource parity is gated by sample-density disclosure, cross-script gated by tokenizer fertility, universal Flores coverage gated by training-data uniformity), not contingent on individual paper IDs. Even after charitable substitution, no real-paper substitute is expected to trigger Bills 4 / 7 / 10 cleanly — every substitute pays M2 or M4.

Cross-ledger methodology rule · v2026-05-15

Independent arXiv-ID + abstract verification before any breach commitment. Driven by the cross-ledger methodology learning of 2026-05-15. Three complementary methodological tools: (1) classifier rule-tightening catches some false breaches; (2) sweep-time arxiv verification catches more; (3) Stage 3.5 catches a different class of residual errors. All three are now mandatory pre-lock conditions for every ledger in the ProjectForty2 collection. This ledger LOCKED v0.2 before the rule was established and is being re-audited.

§04Lock conditions and queue

v0.2 LOCKED status (2026-05-15): classifier benchmark satisfied (✓), watch-list ≥30 entries with cadences (✓), falsification protocol ≥10 conditions (F4, F7, F10, F-Drift, F-Saturation, F-Independence, ✓), author-activity panel mapped (NLLB-Team, Cohere-Aya, Conneau, Pires, Kreutzer, Goyal, Costa-jussà, Adelani, MasakhaneNLP, SEACrowd, AmericasNLP, Bender, Joshi, Khanuja, Mistral-Saba, Apertus, ✓), boxed declaration (✓). Stage 3.5 verification queue pending — re-audit under v2026-05-15 cross-ledger rule.

Monthly cadence: vendor system-card revisions + arxiv multilingual-NLP. Quarterly: Stanford HELM-Multilingual, Flores-Plus / AmericasNLP rolling refresh, Masakhane / SEACrowd held-out evaluations. Annual / event-triggered: Common Voice annual data drops, NLLB / Aya-Expanse / Apertus annual updates, METR / AISI multilingual safety audits. Triggered: any vendor frontier model whose multilingual capability claim is reproduced ≤5pp by an independent third-party on full Flores-200; any 2024–2026 paper showing instruction-tuning erodes >15pp of multilingual base-model competence.

The full preprint with sweep-level evidence, paper-by-paper audit ledger, and v0.3 lock conditions will be released as an arXiv preprint after Stage 3.5 verification completes on the priority candidate pool.

Bridge-test status · cross_ledger_bridges B2 two-speed

The 75% rebuttal density on Bill 4 is consistent with B2 (closure-cycle 18mo → 3-4mo) two-speed prediction: low-resource vendor-claim half-life is materially longer than English. Falsification of B2-here requires a 2024–2026 multilingual claim that reaches Bill 4 closure (≥80% of high-resource at ≤500K sentences) AND survives 12-month independent re-check at the same closure threshold without rebuttal. None observed in current corpus.