Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Published 20 May 2024 in eess.AS, cs.CR, and cs.SD | (2405.11767v1)

Abstract: The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. M. Łajszczak, G. Cámbara, Y. Li, F. Beyhan, A. van Korlaar, F. Yang, A. Joly, Á. Martín-Cortinas, A. Abbas, A. Michalski et al., “Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data,” arXiv preprint arXiv:2402.08093, 2024.
  2. Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang et al., “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv preprint arXiv:2403.03100, 2024.
  3. A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
  4. N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in Proc. ICLR, 2023.
  5. N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien et al., “The voiceprivacy 2020 challenge: Results and findings,” Computer Speech & Language, vol. 74, p. 101362, 2022.
  6. N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Panariello, N. Evans, J. Yamagishi, and M. Todisco, “The voiceprivacy 2024 challenge evaluation plan,” arXiv preprint arXiv:2404.02677, 2024.
  7. C. O. Mawalim, S. Okada, and M. Unoki, “Speaker anonymization by pitch shifting based on time-scale modification,” in Proc. Symp. on Security and Privacy in Speech Communication, 2022, pp. 35–42.
  8. “Jeremycchsu/python-wrapper-for-world-vocoder,” https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.
  9. M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transcations on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
  10. J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker Anonymisation Using the McAdams Coefficient,” in Proc. Interspeech, 2021, pp. 1099–1103.
  11. S. Meyer, X. Miao, and N. T. Vu, “Voicepat: An efficient open-source evaluation toolkit for voice privacy research,” IEEE Open Journal of Signal Processing, vol. 5, pp. 257–265, 2024.
  12. N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Panariello, N. Evans, J. Yamagishi, and M. Todisco, “The voiceprivacy 2022 challenge,” 2nd Symposium on Security and Privacy in Speech Communication, 2023.
  13. F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker Anonymization Using X-vector and Neural Waveform Models,” in Proc. ISCA Workshop on Speech Synthesis, 2019, pp. 155–160.
  14. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  15. “Voice-privacy-challenge/voice-privacy-challenge-2022,” https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2022.
  16. S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody is not identity: A speaker anonymization approach using prosody cloning,” in Proc. ICASSP, 2023, pp. 1–5.
  17. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A Language Modeling Approach to Audio Generation,” IEEE/ACM TASLP, vol. 31, pp. 2523–2533, 2023.
  18. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
  19. “eurecom-asp/spk_anon_nac_lm,” https://github.com/eurecom-asp/spk_anon_nac_lm.
  20. C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,” 2017.
  21. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162, 17–23 Jul 2022, pp. 2709–2720.
  22. K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  23. J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proc. ICML, 2021, pp. 5530–5540.
  24. T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi, T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe, “ESPNet2-TTS: Extending the edge of tts research,” arXiv preprint arXiv:2110.07840, 2021.
  25. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  26. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020.
  27. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
  28. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  29. T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
  30. Y. Zhao, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z. Ling, and T. Toda, “Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -,” in Proc. Joint Workshop for the BC and VCC 2020, 2020, pp. 80–98.

Summary

  • The paper demonstrates that multi-speaker TTS models trained on anonymized data effectively balance data privacy and speech naturalness.
  • It compared five anonymization methods, including signal-processing and DNN-based approaches, to conceal speaker identity.
  • Objective metrics (EER, WER, UTMOS) and subjective evaluations confirm that deep neural methods maintain performance despite anonymization.

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Overview

Training speech generation models with vast amounts of data collected from multiple speakers is common practice. Such strategies, however, can lead to security and privacy concerns as models may memorize and inadvertently reveal sensitive biometric information. This paper explores whether multi-speaker text-to-speech (TTS) models can still perform well when trained on data that has undergone speaker anonymization (SA). Essentially, SA aims to hide the speaker's identity while retaining other key attributes of the speech.

Problem and Goals

The researchers here investigated training a multi-speaker TTS model using SA data to ensure privacy without sacrificing quality. The goal was twofold:

  1. Ensure the anonymization process meets specific criteria.
  2. Maximize the TTS performance evaluated through naturalness and speaker similarity.

To achieve this, they applied five different anonymization methods to a dataset and examined both the anonymized data and the resulting TTS model's performance.

Speaker Anonymization Methods

The paper utilized two types of anonymization techniques: signal processing-based and deep neural network (DNN)-based methods.

Signal Processing Methods

Two methods were employed here:

  1. Pitch Shift: Pitch shifting modifies the f0 sequence (related to the pitch) of the speech, either up or down.
  2. Spectral Envelope Modification: This method modifies the spectral envelope of the speech, changing its timbre to anonymize the speaker.

Deep Neural Network Methods

Three DNN-based methods were adopted:

  1. VPC'22 B1b: This method extracts features like f0 and linguistic representations and anonymizes the speaker identity using an average of distant speaker representations from a specified pool.
  2. GAN-based Method: Utilizes a generative adversarial network to generate anonymized x-vectors that represent the speaker.
  3. NACLM (Neural Audio Codec LLM): Based on generating high-level semantic tokens and acoustic prompts to anonymize the speech.

Evaluation and Results

The authors evaluated the anonymized data using both objective metrics like Equal Error Rate (EER) and Word Error Rate (WER), and subjective metrics such as naturalness and similarity as evaluated by human listeners. Here's a summary of the findings:

Objective Results

  • EER: Indicates how well the anonymization hid the speaker's identity—the higher, the better.
  • WER: Measures the speech's utility—lower is better.
  • GVD (Gain of Voice Distinctiveness): Helps understand how distinctly the anonymized speech maintains its unique features.
  • UTMOS: A predictor for subjective naturalness ratings.

Subjective Results

Listeners rated the naturalness and speaker similarity of both the anonymized data and the TTS system using that data. The study revealed that:

  • No single SA system was dominant across all metrics.
  • Deep neural network-based systems generally outperformed signal-processing based approaches.
  • The GAN-based method performed well in terms of TTS naturalness.
  • UTMOS and GVD emerged as strong indicators for TTS quality.

Implications

The implications are practical for both data privacy and TTS quality. Training on anonymized data can help protect user privacy without significantly degrading the model's performance. Key metrics like UTMOS for naturalness and GVD for voice distinctiveness can guide researchers in evaluating the effectiveness of anonymization methods before running extensive TTS training processes.

Future Directions

This work opens the door to further improving SA systems, focusing on:

  • Enhancing SA methods to improve both privacy and TTS performance.
  • Understanding the balance between anonymization and utility.
  • Expanding this approach to other speech generation tasks like speaker-adaptive TTS and speech enhancement.

One thought-provoking consideration raised by the authors is the need for a consensus on what constitutes a valid anonymization threshold, involving discussions beyond the technical community.

Overall, this paper provides valuable insights and metrics for future research aiming to balance privacy and quality in multi-speaker TTS systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 47 likes about this paper.