One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Published 28 Apr 2026 in eess.AS and cs.CL | (2604.26136v1)

Abstract: Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces an ensemble distillation pipeline that leverages multiple teacher models to create high-quality synthetic data for scientific speech.
The paper demonstrates parameter-efficient domain adaptation using per-language LoRA modules to maintain high speaker similarity while lowering error rates.
The paper validates composite evaluation metrics with Whisper and ECAPA-TDNN, achieving robust performance across Arabic, Chinese, and French scientific content.

Cross-Lingual Voice Cloning for Scientific Speech: A Technical Perspective

Task Motivation and Problem Context

This work addresses a core problem in spoken language technology: generating cross-lingual synthetic speech that both preserves speaker identity and achieves high intelligibility for scientific content. The research is motivated by the need for academic voice cloning systems capable of conveying technical material—characterized by domain-specific terminology, code-switching, and distinct prosody—across diverse languages. General-purpose multilingual Text-to-Speech (TTS) foundation models, while effective on broad data distributions, tend to underperform for specialized scientific speech unless carefully domain-adapted.

Methodological Innovations

Ensemble Distillation on Domain-Specific Data

A central methodological contribution is the ensemble distillation pipeline, leveraging multiple high-performing zero-shot voice cloning models to address the paucity of parallel training data in the scientific domain. Specifically, the authors synthesize target-language speech for each utterance in the ACL 60/60 academic corpus using three teacher models (OmniVoice, VoxCPM, Chatterbox), computing a composite quality metric as the mean of intelligibility (CER, measured via Whisper) and speaker similarity (cosine similarity from ECAPA-TDNN embeddings). Only the best-scoring synthetic sample for each utterance is retained for fine-tuning, resulting in a curated dataset that reflects both domain-specific linguistic complexity and high fidelity to speaker characteristics.

Parameter-Efficient Domain Adaptation

Building atop the OmniVoice foundation model, parameter-efficient fine-tuning is realized through Low-Rank Adaptation (LoRA) modules. Unlike monolithic multilingual adapters, which can dilute language-specific phonological representations, per-language LoRA modules are trained for Arabic, Chinese, and French. This approach updates a small subset of self-attention and FFN weights, minimizing computational cost and avoiding catastrophic forgetting. Training stability is further enhanced by integrating Rank-Stabilized LoRA (RSLoRA) and by employing a carefully managed learning rate schedule and autoregressive cross-entropy objective over discrete audio tokens derived from a HIGGS-based tokenizer.

Experimental Pipeline and Numerical Results

Dataset Composition and Synthesis

The distilled training set comprises 1,404 utterances (468 per target language), generated through the best-of-ensemble strategy described above. Non-primary teacher models contributed over 26% of final selected samples—demonstrating the value of aggregation across architectures.

Comparative Evaluation

Performance is evaluated on both a 4-speaker subset and a comprehensive blind test set derived from scientific publications and cross-accent English reference speakers. Key metrics are WER and CER for intelligibility, and speaker similarity (SIM) for identity transfer, with character-level CER emphasized for Chinese. Several state-of-the-art baselines are compared, including Chatterbox, Qwen3-TTS, XTTS-V2, and VoxCPM2.

Notable Quantitative Outcomes

French: Qwen3-TTS achieved the lowest error rates (WER = 0.050), but OmniVoice provided the highest SIM (0.753), indicating superior speaker preservation.
Chinese: Fine-tuned OmniVoice achieved highest SIM (0.719 on the full blind set), though CER was lower for some baselines on subsamples, highlighting the complexity of domain adaptation in tonal languages.
Arabic: Fine-tuned OmniVoice outperformed XTTS-V2 and VoxCPM2 on both CER and SIM (CER = 0.071, SIM = 0.723), demonstrating the efficacy of per-language adapters.

Fine-tuning consistently reduced WER and CER for all languages, with only marginal changes in SIM, confirming that the domain adaptation process did not degrade fidelity to the reference speaker. The results collectively demonstrate that parameter-efficient tuning on ensemble-distilled data is highly effective for domain and language specialization of large TTS models.

Theoretical and Practical Implications

The study reinforces the effectiveness of ensemble distillation for curating high-quality synthetic data in low-resource domains and demonstrates that PEFT methods, particularly LoRA variants, can efficiently specialize foundation TTS models for scientific content. The methodology is directly extensible to other domains where data scarcity and domain transfer are bottlenecks, e.g., legal or medical speech. By releasing their framework and training code, the authors facilitate reproducibility and provide a pipeline for further research.

A notable implication is the validation of composite automated metrics (Whisper for CER and ECAPA-TDNN for SIM) as scalable proxies for human judgment in voice cloning quality assessment. However, as the authors note, these metrics may not fully capture nuanced artifacts, suggesting avenues for integrating human perceptual evaluation in future work.

Limitations and Future Directions

The primary limitation is dataset scale (1,404 utterances), which may constrain generalization and coverage of domain-unique pronunciations, as well as the reliance on automated metrics. The per-language adapter approach, while effective for specialization, increases the number of deployed models relative to unified architectures. Future expansions could assess the benefits of multilingual or semi-supervised adapters, explore semi-parametric few-shot adaptation, and undertake comprehensive human listening tests. The authors also highlight ethical concerns regarding voice cloning misuse, stressing the importance of watermarking and authentication.

Conclusion

This work demonstrates a robust and resource-efficient framework for cross-lingual voice cloning of scientific speech, combining ensemble data distillation and parameter-efficient per-language fine-tuning. The approach yields improvements in transcription accuracy across Arabic, Chinese, and French, while maintaining high speaker similarity. The technical strategies employed—ensemble distillation, PEFT adaptation, and composite quality metrics—offer a replicable path for domain adaptation of foundation TTS models, with direct applicability to both research and assistive technologies in specialized communication contexts.

Markdown Report Issue