RTCFake: Speech Deepfake Detection in Real-Time Communication

Published 26 Apr 2026 in cs.SD | (2604.23742v1)

Abstract: With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The \textit{RTCFake} dataset is provided in the {https://huggingface.co/datasets/JunXueTech/RTCFake}.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents a novel phoneme-guided consistency learning (PCL) framework that aligns offline and online phoneme embeddings for enhanced detection robustness.
It constructs a comprehensive dataset with over 600 hours of paired clean and RTC-transmitted speech across diverse platforms and noise conditions.
Experimental results demonstrate that PCL achieves an average EER of 5.81%, outperforming traditional frame-level and mixed training approaches.

RTCFake: Speech Deepfake Detection in Real-Time Communication

Motivation and Problem Formulation

Speech deepfakes, enabled by advances in TTS and VC models, present critical security risks in RTC applications. Contemporary SDD methods and datasets predominantly target offline or simulated transmission scenarios, which do not capture the signal perturbations and composite nonlinear distortions imposed by mainstream RTC platforms during real communication. These include not only codec compression but also built-in modules for noise suppression, echo cancellation, and packet loss, each acting as a black box and fundamentally altering the speech signal. As a result, existing SDD systems exhibit severe domain mismatch and lack generalization to real-world RTC conditions.

Figure 1: Prior datasets simulate only limited distortions, while the RTCFake pipeline captures the true black-box characteristics and complex distortions of online RTC platforms.

RTCFake Dataset Construction and Properties

The RTCFake dataset remedies these deficiencies by constructing paired offline (clean) and online (transmitted) speech data. Approximately 600 hours of speech—including both bona fide and deepfake utterances generated via state-of-the-art TTS/VC models across English and Chinese speakers—were transmitted over seven major RTC platforms (Zoom, QQ, WeChat, DingTalk, Lark, VooV, Telegram). The design ensures coverage of platform diversity, several noise scenarios, and speaker variations.

Figure 2: Amplitude spectrograms reveal major shifts in energy distribution, temporal misalignment, and noise artifacts, especially after transmission over real RTC platforms.

The collection pipeline (Figure 3) uses a two-device setup to record "ground-truth" and post-transmission versions of each utterance, ensuring high-fidelity offline-online pairing. Metadata annotations include speaker, platform, generation method, noise type, and text transcription, supporting systematic benchmarking and analysis.

Figure 3: The end-to-end dataset collection integrates offline speech composition, noise conditions, and authentic RTC platform transmission.

Dataset Statistics

The final corpus comprises over 600 hours, 307 speakers, and a large balanced set between bona fide and deepfake instances under both offline and online conditions. The evaluation set specifically includes unseen platforms and noise configurations, creating a rigorous and practical benchmark.

Black-box Distortions and Representation Consistency

Systematic analyses demonstrate that local frame-level acoustic cues are fragile under RTC platform perturbations and result in strong distribution shifts between offline and online signals. By contrast, phoneme-level representations—reflecting underlying linguistic content—exhibit higher cross-domain similarity (higher mean similarity, lower variance) and thus greater structural stability.

Figure 4: Phoneme-level representations display dramatically higher offline–online stability than frame-level features, motivating their use for robust representation learning.

Phoneme-Guided Consistency Learning (PCL)

The central methodological contribution is the PCL framework. During training, a pre-trained phoneme recognizer identifies segment boundaries; representations within each phoneme are pooled, and the model is constrained via a bidirectional consistency loss to align paired offline and online phoneme embeddings. The overall optimization combines this PCL term with standard classification (cross-entropy) loss, encouraging the detection model to prioritize platform-invariant semantic cues over fragile signal details.

Experimental Results

Baseline and Ablation Performance

Experiments leverage XLSR+AASIST as a strong backbone, evaluated under all train/test domain combinations (offline, online, their mix) and using both open domain and in-house datasets.

Models trained solely on open-source SDD datasets (e.g., ASVspoof, DFADD) generalize poorly to RTCFake, with EER exceeding 30–50% in RTC settings.
Domain shift is stark: offline-trained models give low EER on offline test data (<6%) but degrade by >2x on online data; the inverse holds for online-only training.
Simple mixed training improves robustness, but does not resolve platform variations or unseen noise.

The PCL training regime achieves the lowest and most stable EERs across both seen and unseen platforms, averaging 5.81%, outperforming classical frame-level consistency learning and mix-based strategies.

Figure 5: Across seen and unseen platforms, PCL outperforms both mix and frame-level consistency (FCL) strategies, especially under severe platform domain mismatch.

Cross-Platform and Noise Robustness

RTCFake includes challenging conditions, including fully unseen communication platforms and noise scenarios:

PCL reduces EER across all unseen platforms and is particularly resilient to those with nonstandard or stronger transmission-induced distortions (e.g., Telegram).
In unseen noise conditions (e.g., office, coffee, echo, rain), PCL offers substantial performance gains and EER stability over mixed and frame-level strategies.
Figure 6: PCL consistently yields lower mean EER and reduced variance compared to FCL, demonstrating hyperparameter stability and generalization under offline and online test conditions.

Methodological and Empirical Insights

Existing benchmarks and approaches are inadequate for deployment in authentic RTC security applications; datasets must incorporate realistic black-box platform distortions.
Frame-level feature reliance is brittle; robust SDD requires exploiting domains and feature sets aligned with the invariants that platforms preserve—i.e., phoneme-level semantics.
By injecting cross-domain phoneme-level consistency into the objective, models become far less sensitive to distribution shifts and artifact loss induced by aggressive RTC processing.

Limitations and Future Directions

Although RTCFake advances SDD evaluation toward real-world standards, further domain factors—such as device heterogeneity, varied user microphone/speaker setups, or extreme platform-side compression—not fully explored herein might reduce generalization. Ensuring generalization to even more adversarial settings and less cooperative user behaviors will demand adaptive and platform-agnostic methods leveraging wider context and multimodal data.

Conclusion

RTCFake establishes a new foundation for SDD system evaluation under black-box RTC platform conditions. The experimental findings show that enforcing representation invariance at the phoneme level is a highly effective regularization for robust and generalizable speech deepfake detection. The dataset and PCL framework provide a practical, rigorous testbed and methodology, anticipating future SDD systems resilient to both platform and noise variability.

Markdown Report Issue