- The paper presents a decision-theoretic uncertainty quantification framework that uses downstream classifier risk to validate generative domain adaptation outputs.
- It employs a 1D Pix2pix GAN for denoising and a modified AlexNet for AF classification, restoring performance even under significant noise.
- The study demonstrates that uncertainty-driven filtering effectively improves clinical reliability by mitigating artefacts from generative adaptation.
Trustworthy Deep Domain Adaptation for Wearable PPG Signal Analysis with Decision-Theoretic Uncertainty Quantification
Introduction and Motivation
The paper addresses the challenge of deploying deep learning systems for wearable photoplethysmography (PPG) signal analysis across domain shifts, particularly when PPG signals are contaminated with noise not seen during model training. Traditional domain adaptation using deep generative models—such as GAN-based denoising—can augment input test distributions to resemble those of the training set. However, generative models are susceptible to producing artefacts and hallucinations, raising concerns over the downstream reliability and trustworthiness of subsequent discriminative analyses, especially for clinical tasks like atrial fibrillation (AF) detection.
A prominent limitation is that standard uncertainty quantification (UQ) and calibration approaches often assume access to ground truth for evaluating uncertainty reliability. In real-world generative domain adaptation, ground truth may be unavailable or the quality assessment must be action-dependent (e.g., whether a denoised series leads to correct diagnosis). The paper proposes a decision-theoretic UQ (DTUQ) framework that formalizes the use of a downstream classifier as an uncertainty grounder and calibrator, providing a pragmatic metric for actionable trust in generative adaptation outputs.
Methodological Approach
Case Study: PPG Denoising for AF Classification
The work constructs a systematic case study leveraging the Deepbeat dataset. A 1D variant of AlexNet, trained for binary AF classification, is used as the reference discriminative model, with test data artificially augmented by additive Gaussian noise to induce a significant train-test distribution gap. The domain adaptation mechanism is realized through a 1D Pix2pix GAN with a UNet generator backbone, trained to regress noisy/clean PPG pairs. Denosing outputs are further preprocessed (clamped to [0,2]) to ensure physiologically meaningful values before evaluation.


Figure 1: Example PPG time series depicting noisy input, GAN denoised signal, and clean ground truth for both AF and non-AF segments.
Decision-Theoretic Uncertainty Quantification
Rather than conventional UQ metrics (e.g., mean squared error), DTUQ operationalizes uncertainty as the expected loss (decision cost) induced by using a generated example within a real classifier. The decision cost is set as the misclassification loss: for downstream classifier prediction p(y∣x) and prediction a, the risk is ρ(a∣x)=1−p(a∣x), minimized by choosing a=argmaxyp(y∣x).
Predictive entropy (and specifically, per-instance entropy) of the downstream classifier on denoised samples serves as the uncertainty estimate, with calibration and reliability evaluated via the Uncertainty Calibration Error (UCE) and per-class reliability diagrams.


Figure 2: Reliability diagrams for raw, noisy, and GAN-denoised test sets, demonstrating per-class calibration and uncertainty reliability with respect to classifier performance.
Qualitative and Statistical Uncertainty Analysis
A scatterplot analysis between the entropy on noisy versus denoised samples offers insight into whether uncertainty reflects true generative artefacts or only underlying measurement variance, with a moderate Pearson (0.68) and Spearman (0.59) correlation supporting the discriminative sensitivity of uncertainty to generative model outputs.
Figure 3: Scatterplot comparing noisy and denoised predictive entropy, indicating that denoising often causes substantive changes in classifier uncertainty.
Experimental Results
The key empirical findings are based on multiple classification performance metrics (AUC, F1, MCC, sensitivity, specificity, balanced accuracy) across unaugmented, noisy, and denoised test splits, as well as a low-uncertainty filtered subset of denoised predictions.
- Noise Injection: Artificial noise notably degrades all classifier metrics, with AUC declining from 0.84 (clean) to 0.75 (noisy).
- Denoising via GAN: Application of generative model adaptation regains substantial performance, restoring AUC to 0.80.
- Uncertainty-driven Filtering: Selecting the denoised samples with entropy in the lowest 75% (low-uncertainty subset) aligns performance with, or slightly above, the original clean set (AUC = 0.85). This empirically grounds uncertainty as a meaningful filter.
- Calibration Analysis: Reliability diagrams highlight moderate calibration, with better concordance between high classifier entropy and high error, except for some bias in the lowest-entropy AF class predictions.
The uncertainty metrics thus serve a dual role: as actionable filters for data deployment and as indirect but effective QA mechanisms for generative pre-processing in domain adaptation pipelines.
Theoretical and Practical Implications
The formalization of downstream classifier uncertainty as an actionable trust metric, rooted in decision-theoretic risk, addresses limitations of standard generative model QA in data-limited and annotation-sparse medical settings. By shifting emphasis from input-reconstruction statistics to task-relevant cost (e.g., misclassification), the DTUQ framework both operationalizes data curation strategies and provides stronger support for regulatory or clinical deployment under uncertainty.
From a theoretical standpoint, the work reframes the evaluation of generative adaptation quality in terms of subjective expected loss, connectible to preferred actions rather than only marginal summary statistics. The approach is natural for healthcare and industrial monitoring, where ultimate decision accuracy and error cost dominate over pixel-wise or signal-wise similarity.
Future Directions
Extending DTUQ-based trustworthiness to more complex, multi-class problems, or toward richer uncertainty characterizations (e.g., disentangling aleatoric/epistemic sources) remains open. Experiments with more realistic noise models or real-world modalities beyond PPG would inform generality. Integrating DTUQ with selective classification or automated abstention mechanisms constitutes a promising avenue for robust AI deployment. Further, advances in per-sample utility learning or user-defined risk functions could personalize model adaptation reliability.
Conclusion
The paper robustly demonstrates that decision-theoretic uncertainty quantification, instantiated through downstream classifier predictive entropy and grounded by its relation to error cost, enables reliable assessment and actionable filtering of generative domain adaptation outputs in wearable PPG analysis. The methodology supports practical deployment by quantifying the trustworthiness of adapted examples in terms directly relevant to the target clinical task, offering a rigorous foundation for future extensions in trustworthy time series domain adaptation and uncertainty-aware AI systems.