- The paper presents a large-scale multilingual corpus with 65,000 hours of human-transcribed speech across 76 languages to address data scarcity in ASR.
- The paper details an iterative alignment refinement method that boosts data retention by up to 201.1% for low-resource/non-Latin-script languages.
- The paper demonstrates ASR fine-tuning improvements, achieving up to 91.7% WER reduction, validating the corpus's impact on speech recognition accuracy.
WorldSpeech: A Comprehensive Multilingual Speech Corpus for Global ASR
Introduction and Motivation
Automatic Speech Recognition (ASR) has undergone significant advancements for high-resource languages, predominantly due to the availability of extensive, robustly aligned audio-transcript datasets. However, for the vast majority of the world's languages—including many typologically distinct, low-resource, or dialectal variants—publicly available, high-quality aligned data remains insufficient. This data scarcity stymies progress in multilingual and cross-lingual ASR, limits the deployment of speech technologies across diverse linguistic communities, and contributes to the persistence of the AI digital divide.
"WorldSpeech: A Multilingual Speech Corpus from Around the World" (2605.09167) directly addresses these limitations by constructing and publicly releasing a large-scale, multilingual corpus comprising 65,000 hours of human-transcribed, audio-aligned speech spanning 76 languages, with substantial coverage of dialectal and regional variation. Notably, for 48 of these languages, WorldSpeech constitutes the largest or first publicly available ground-truth aligned corpus.
Methods: Data Collection, Alignment, and Scaling
Heterogeneous Data Sourcing and Standardization
WorldSpeech draws from 79 parliamentary and public-domain sources, encompassing legislative proceedings, national/international broadcaster archives (notably from RFA, VOA, RFE/RL), audiobooks (LibriVox, Aozora), and trial transcripts, spanning 82 countries. Source heterogeneity required substantial engineering for format standardization (audio to mono 24kHz, transcripts to plain text) and tailored handling of multi-script, multi-layout, and multilingual documents. OCR solutions (Tesseract, Surya) were employed to handle problematic PDF encodings, with additional language-specific normalization for script and orthographic idiosyncrasies. Intra-session code-switching, common in many parliamentary settings, was resolved using automatic language detection and segment-level ASR token routing.
Segmentation and Audio-Transcript Alignment
Raw long-form audio (1–10 hours) was segmented using Silero VAD to identify natural speech boundaries, targeting segment durations between 3–30 seconds. For each language, the optimal ASR backbone was empirically selected per 10-hour ablation: Whisper-large-v3-turbo sufficed for well-resourced and European languages, whereas MMS-1B per-language adapters, and community models, were utilized for lower-resource settings.
Alignment employed a two-stage character error rate (CER)-based search inspired by EuroSpeech [31], matching ASR transcripts of audio segments against the ground-truth transcript windows. Segments were retained if CER < 0.3, recording the CER as metadata per segment for post-hoc quality filtering.
Iterative Alignment Refinement
A significant methodological advancement in WorldSpeech is the iterative alignment refinement loop. For languages where initial ASR yields were low due to poor out-of-domain generalization (often below 30% of available audio), the ASR model was fine-tuned on first-pass aligned segments and the alignment process was re-run. This process increased retained segment hours by +19.5% to +201.1% across languages, with the largest relative gains occurring for non-Latin scripts and languages lacking any ASR-adapted resource. A third pass yielded diminishing (<9%) marginal returns.
Corpus Composition and Scale
WorldSpeech delivers extensive depth and breadth relative to prior public corpora:
- 76 languages spanning all major language families and geographic regions, with 53 over 50h, 37 over 200h, 28 over 500h, and 24 exceeding 1,000h of aligned speech.
- Significant dialectal coverage: multi-country variants for Spanish, Arabic, English, French, Hindi, and others.
- Per-segment metadata: includes source, session, language code, duration, CER, and audio quality (DNSMOS-P.835 OVR and SNR).
- Bias analysis: Source bias is towards formal/registed speech (parliamentary, broadcast, read literature), but inclusion of non-parliamentary and broadcaster sources enhances linguistic and sociolectal diversity.
For 48 languages, including Kreol Seselwa, Lao, Burmese, and Armenian, and for several new dialectal categories, WorldSpeech is the largest or only open-access, ground-truth ASR corpus.
ASR Fine-Tuning Experiments and Results
Experimental Setup
The ASR utility of WorldSpeech was validated by fine-tuning whisper-large-v3-turbo on its aligned data for 11 typologically and resource-level distinct languages. Models utilized AdamW, bf16, and language-conditioned decoding heads, with training scripts and exhaustive compute records released for full transparency.
WER and CER Improvements
Fine-tuning with WorldSpeech yields strong average relative WER reductions of 63.5%, with per-language WER/CER reductions frequently exceeding 70%. For low-resource languages with high baseline error rates (WER > 1.0), fine-tuning collapses WER to subcritical values:
- Samoan: WER from 4.72 to 0.39 (−91.7%)
- Romansh: 1.31 to 0.17 (−87.5%)
- Lao: 2.47 to 0.76 (−69.4%)
- Burmese: 1.01 to 0.39 (−61.2%)
- Georgian: 1.07 to 0.48 (−55.1%)
For moderate-baseline cases, WER decreases by 40–70% (e.g., Luxembourgish, Albanian, Armenian, Swahili). Performance curves reveal sharp WER gains in the first 200 hours of training data with diminishing improvements past 500 hours, consistent with observed scaling laws for ASR.
Iterative Refinement Gains
The iterative refinement scheme demonstrated the ability to double or triple the aligned data volume for several low-resource/non-Latin-script languages (e.g., +150.2% Khmer, +201.1% Burmese), directly leveraging human-transcribed data with minimal additional human or computational supervision.
Limitations
Several caveats are inherent in both the corpus and methodology:
- Demographic Bias: Speaker pool is predominantly adult, formally-educated, and involved with public-facing institutional roles; spontaneous and informal registers are underrepresented, potentially compromising downstream transfer to conversational domains.
- ASR Limitations for Alignment: Alignment efficacy remains bounded by initial ASR performance, especially for underrepresented scripts; iterative refinement mitigates but does not fully obviate this effect.
- Source License Heterogeneity: Some configurations (notably RFA and RFE/RL derived sources) are provided as metadata-only datasets due to non-permissive redistribution terms, requiring user-side retrieval.
Implications and Prospective Directions
WorldSpeech shifts the practical and theoretical boundaries for multilingual ASR:
- Scalable ASR Research: Enables data-efficient model adaptation, extensible pretraining, and robust benchmarking for 76 languages, many of which were previously absent from supervised speech research pipelines.
- Dialect and Register Diversity: Supports finer-grained evaluation and dialect-robust speech modeling, essential for linguistic accessibility in multilingual societies.
- Foundation for Cross-domain Transfer: The rich per-segment metadata and quality scores facilitate downstream domain adaptation and curriculum learning research.
- Benchmarking and Methodological Validation: WorldSpeech provides a ground-truth test-bed for alignment, self-training, and quality-controlled ASR pipelines in the wild.
On a practical level, it de-risks the deployment of speech-enabled applications in the Global South and for minority languages. The public release of both corpus and codebase establishes a reproducible standard for future multilingual ASR resource development.
Conclusion
"WorldSpeech: A Multilingual Speech Corpus from Around the World" constitutes the most extensive public, human-verified, multi-domain aligned speech corpus to date for low- and mid-resource languages, with robust experimental validation of its utility for reducing ASR WER across diverse typologies. Its scalable alignment pipeline, innovative iterative refinement, and meticulous engineering of both data and metadata set a new benchmark for ASR resource development and accessibility. The corpus directly enables substantial improvements in ASR accuracy for multiple underserved languages, providing a critical infrastructure asset for inclusive speech technology research and deployment (2605.09167).