WorldSpeech: A Multilingual Speech Corpus from Around the World

Published 9 May 2026 in cs.CL, cs.AI, and cs.LG | (2605.09167v1)

Abstract: Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a large-scale multilingual corpus with 65,000 hours of human-transcribed speech across 76 languages to address data scarcity in ASR.
The paper details an iterative alignment refinement method that boosts data retention by up to 201.1% for low-resource/non-Latin-script languages.
The paper demonstrates ASR fine-tuning improvements, achieving up to 91.7% WER reduction, validating the corpus's impact on speech recognition accuracy.

WorldSpeech: A Comprehensive Multilingual Speech Corpus for Global ASR

Introduction and Motivation

Automatic Speech Recognition (ASR) has undergone significant advancements for high-resource languages, predominantly due to the availability of extensive, robustly aligned audio-transcript datasets. However, for the vast majority of the world's languages—including many typologically distinct, low-resource, or dialectal variants—publicly available, high-quality aligned data remains insufficient. This data scarcity stymies progress in multilingual and cross-lingual ASR, limits the deployment of speech technologies across diverse linguistic communities, and contributes to the persistence of the AI digital divide.

"WorldSpeech: A Multilingual Speech Corpus from Around the World" (2605.09167) directly addresses these limitations by constructing and publicly releasing a large-scale, multilingual corpus comprising 65,000 hours of human-transcribed, audio-aligned speech spanning 76 languages, with substantial coverage of dialectal and regional variation. Notably, for 48 of these languages, WorldSpeech constitutes the largest or first publicly available ground-truth aligned corpus.

Methods: Data Collection, Alignment, and Scaling

Heterogeneous Data Sourcing and Standardization

WorldSpeech draws from 79 parliamentary and public-domain sources, encompassing legislative proceedings, national/international broadcaster archives (notably from RFA, VOA, RFE/RL), audiobooks (LibriVox, Aozora), and trial transcripts, spanning 82 countries. Source heterogeneity required substantial engineering for format standardization (audio to mono 24kHz, transcripts to plain text) and tailored handling of multi-script, multi-layout, and multilingual documents. OCR solutions (Tesseract, Surya) were employed to handle problematic PDF encodings, with additional language-specific normalization for script and orthographic idiosyncrasies. Intra-session code-switching, common in many parliamentary settings, was resolved using automatic language detection and segment-level ASR token routing.

Segmentation and Audio-Transcript Alignment

Raw long-form audio (1–10 hours) was segmented using Silero VAD to identify natural speech boundaries, targeting segment durations between 3–30 seconds. For each language, the optimal ASR backbone was empirically selected per 10-hour ablation: Whisper-large-v3-turbo sufficed for well-resourced and European languages, whereas MMS-1B per-language adapters, and community models, were utilized for lower-resource settings.

Alignment employed a two-stage character error rate (CER)-based search inspired by EuroSpeech [31], matching ASR transcripts of audio segments against the ground-truth transcript windows. Segments were retained if CER < 0.3, recording the CER as metadata per segment for post-hoc quality filtering.

A significant methodological advancement in WorldSpeech is the iterative alignment refinement loop. For languages where initial ASR yields were low due to poor out-of-domain generalization (often below 30% of available audio), the ASR model was fine-tuned on first-pass aligned segments and the alignment process was re-run. This process increased retained segment hours by +19.5% to +201.1% across languages, with the largest relative gains occurring for non-Latin scripts and languages lacking any ASR-adapted resource. A third pass yielded diminishing (<9%) marginal returns.

Corpus Composition and Scale

WorldSpeech delivers extensive depth and breadth relative to prior public corpora:

76 languages spanning all major language families and geographic regions, with 53 over 50h, 37 over 200h, 28 over 500h, and 24 exceeding 1,000h of aligned speech.
Significant dialectal coverage: multi-country variants for Spanish, Arabic, English, French, Hindi, and others.
Per-segment metadata: includes source, session, language code, duration, CER, and audio quality (DNSMOS-P.835 OVR and SNR).
Bias analysis: Source bias is towards formal/registed speech (parliamentary, broadcast, read literature), but inclusion of non-parliamentary and broadcaster sources enhances linguistic and sociolectal diversity.

For 48 languages, including Kreol Seselwa, Lao, Burmese, and Armenian, and for several new dialectal categories, WorldSpeech is the largest or only open-access, ground-truth ASR corpus.

ASR Fine-Tuning Experiments and Results

Experimental Setup

The ASR utility of WorldSpeech was validated by fine-tuning whisper-large-v3-turbo on its aligned data for 11 typologically and resource-level distinct languages. Models utilized AdamW, bf16, and language-conditioned decoding heads, with training scripts and exhaustive compute records released for full transparency.

WER and CER Improvements

Fine-tuning with WorldSpeech yields strong average relative WER reductions of 63.5%, with per-language WER/CER reductions frequently exceeding 70%. For low-resource languages with high baseline error rates (WER > 1.0), fine-tuning collapses WER to subcritical values:

Samoan: WER from 4.72 to 0.39 (−91.7%)
Romansh: 1.31 to 0.17 (−87.5%)
Lao: 2.47 to 0.76 (−69.4%)
Burmese: 1.01 to 0.39 (−61.2%)
Georgian: 1.07 to 0.48 (−55.1%)

For moderate-baseline cases, WER decreases by 40–70% (e.g., Luxembourgish, Albanian, Armenian, Swahili). Performance curves reveal sharp WER gains in the first 200 hours of training data with diminishing improvements past 500 hours, consistent with observed scaling laws for ASR.

The iterative refinement scheme demonstrated the ability to double or triple the aligned data volume for several low-resource/non-Latin-script languages (e.g., +150.2% Khmer, +201.1% Burmese), directly leveraging human-transcribed data with minimal additional human or computational supervision.

Limitations

Several caveats are inherent in both the corpus and methodology:

Demographic Bias: Speaker pool is predominantly adult, formally-educated, and involved with public-facing institutional roles; spontaneous and informal registers are underrepresented, potentially compromising downstream transfer to conversational domains.
ASR Limitations for Alignment: Alignment efficacy remains bounded by initial ASR performance, especially for underrepresented scripts; iterative refinement mitigates but does not fully obviate this effect.
Source License Heterogeneity: Some configurations (notably RFA and RFE/RL derived sources) are provided as metadata-only datasets due to non-permissive redistribution terms, requiring user-side retrieval.

Implications and Prospective Directions

WorldSpeech shifts the practical and theoretical boundaries for multilingual ASR:

Scalable ASR Research: Enables data-efficient model adaptation, extensible pretraining, and robust benchmarking for 76 languages, many of which were previously absent from supervised speech research pipelines.
Dialect and Register Diversity: Supports finer-grained evaluation and dialect-robust speech modeling, essential for linguistic accessibility in multilingual societies.
Foundation for Cross-domain Transfer: The rich per-segment metadata and quality scores facilitate downstream domain adaptation and curriculum learning research.
Benchmarking and Methodological Validation: WorldSpeech provides a ground-truth test-bed for alignment, self-training, and quality-controlled ASR pipelines in the wild.

On a practical level, it de-risks the deployment of speech-enabled applications in the Global South and for minority languages. The public release of both corpus and codebase establishes a reproducible standard for future multilingual ASR resource development.

Conclusion

"WorldSpeech: A Multilingual Speech Corpus from Around the World" constitutes the most extensive public, human-verified, multi-domain aligned speech corpus to date for low- and mid-resource languages, with robust experimental validation of its utility for reducing ASR WER across diverse typologies. Its scalable alignment pipeline, innovative iterative refinement, and meticulous engineering of both data and metadata set a new benchmark for ASR resource development and accessibility. The corpus directly enables substantial improvements in ASR accuracy for multiple underserved languages, providing a critical infrastructure asset for inclusive speech technology research and deployment (2605.09167).

Markdown Report Issue