Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Published 20 May 2024 in eess.AS, cs.CR, and cs.SD | (2405.11767v1)

Abstract: The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.

Abstract PDF HTML Upgrade to Chat

Authors (3)

References (30)

Summary

The paper demonstrates that multi-speaker TTS models trained on anonymized data effectively balance data privacy and speech naturalness.
It compared five anonymization methods, including signal-processing and DNN-based approaches, to conceal speaker identity.
Objective metrics (EER, WER, UTMOS) and subjective evaluations confirm that deep neural methods maintain performance despite anonymization.

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Overview

Training speech generation models with vast amounts of data collected from multiple speakers is common practice. Such strategies, however, can lead to security and privacy concerns as models may memorize and inadvertently reveal sensitive biometric information. This paper explores whether multi-speaker text-to-speech (TTS) models can still perform well when trained on data that has undergone speaker anonymization (SA). Essentially, SA aims to hide the speaker's identity while retaining other key attributes of the speech.

Problem and Goals

The researchers here investigated training a multi-speaker TTS model using SA data to ensure privacy without sacrificing quality. The goal was twofold:

Ensure the anonymization process meets specific criteria.
Maximize the TTS performance evaluated through naturalness and speaker similarity.

To achieve this, they applied five different anonymization methods to a dataset and examined both the anonymized data and the resulting TTS model's performance.

Speaker Anonymization Methods

The paper utilized two types of anonymization techniques: signal processing-based and deep neural network (DNN)-based methods.

Signal Processing Methods

Two methods were employed here:

Pitch Shift: Pitch shifting modifies the f0 sequence (related to the pitch) of the speech, either up or down.
Spectral Envelope Modification: This method modifies the spectral envelope of the speech, changing its timbre to anonymize the speaker.

Deep Neural Network Methods

Three DNN-based methods were adopted:

VPC'22 B1b: This method extracts features like f0 and linguistic representations and anonymizes the speaker identity using an average of distant speaker representations from a specified pool.
GAN-based Method: Utilizes a generative adversarial network to generate anonymized x-vectors that represent the speaker.
NACLM (Neural Audio Codec LLM): Based on generating high-level semantic tokens and acoustic prompts to anonymize the speech.

Evaluation and Results

The authors evaluated the anonymized data using both objective metrics like Equal Error Rate (EER) and Word Error Rate (WER), and subjective metrics such as naturalness and similarity as evaluated by human listeners. Here's a summary of the findings:

Objective Results

EER: Indicates how well the anonymization hid the speaker's identity—the higher, the better.
WER: Measures the speech's utility—lower is better.
GVD (Gain of Voice Distinctiveness): Helps understand how distinctly the anonymized speech maintains its unique features.
UTMOS: A predictor for subjective naturalness ratings.

Subjective Results

Listeners rated the naturalness and speaker similarity of both the anonymized data and the TTS system using that data. The study revealed that:

No single SA system was dominant across all metrics.
Deep neural network-based systems generally outperformed signal-processing based approaches.
The GAN-based method performed well in terms of TTS naturalness.
UTMOS and GVD emerged as strong indicators for TTS quality.

Implications

The implications are practical for both data privacy and TTS quality. Training on anonymized data can help protect user privacy without significantly degrading the model's performance. Key metrics like UTMOS for naturalness and GVD for voice distinctiveness can guide researchers in evaluating the effectiveness of anonymization methods before running extensive TTS training processes.

Future Directions

This work opens the door to further improving SA systems, focusing on:

Enhancing SA methods to improve both privacy and TTS performance.
Understanding the balance between anonymization and utility.
Expanding this approach to other speech generation tasks like speaker-adaptive TTS and speech enhancement.

One thought-provoking consideration raised by the authors is the need for a consensus on what constitutes a valid anonymization threshold, involving discussions beyond the technical community.

Overall, this paper provides valuable insights and metrics for future research aiming to balance privacy and quality in multi-speaker TTS systems.

Markdown Report Issue