Trustworthy deep domain adaptation for wearable photoplethysmography signal analysis with decision-theoretic uncertainty quantification

Published 19 Apr 2026 in cs.LG | (2604.17480v1)

Abstract: In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model's training data. This can help improve the discriminative model's performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.

Abstract PDF Upgrade to Chat

Authors (1)

Ciaran Bench

Summary

The paper presents a decision-theoretic uncertainty quantification framework that uses downstream classifier risk to validate generative domain adaptation outputs.
It employs a 1D Pix2pix GAN for denoising and a modified AlexNet for AF classification, restoring performance even under significant noise.
The study demonstrates that uncertainty-driven filtering effectively improves clinical reliability by mitigating artefacts from generative adaptation.

Trustworthy Deep Domain Adaptation for Wearable PPG Signal Analysis with Decision-Theoretic Uncertainty Quantification

Introduction and Motivation

The paper addresses the challenge of deploying deep learning systems for wearable photoplethysmography (PPG) signal analysis across domain shifts, particularly when PPG signals are contaminated with noise not seen during model training. Traditional domain adaptation using deep generative models—such as GAN-based denoising—can augment input test distributions to resemble those of the training set. However, generative models are susceptible to producing artefacts and hallucinations, raising concerns over the downstream reliability and trustworthiness of subsequent discriminative analyses, especially for clinical tasks like atrial fibrillation (AF) detection.

A prominent limitation is that standard uncertainty quantification (UQ) and calibration approaches often assume access to ground truth for evaluating uncertainty reliability. In real-world generative domain adaptation, ground truth may be unavailable or the quality assessment must be action-dependent (e.g., whether a denoised series leads to correct diagnosis). The paper proposes a decision-theoretic UQ (DTUQ) framework that formalizes the use of a downstream classifier as an uncertainty grounder and calibrator, providing a pragmatic metric for actionable trust in generative adaptation outputs.

Methodological Approach

Case Study: PPG Denoising for AF Classification

The work constructs a systematic case study leveraging the Deepbeat dataset. A 1D variant of AlexNet, trained for binary AF classification, is used as the reference discriminative model, with test data artificially augmented by additive Gaussian noise to induce a significant train-test distribution gap. The domain adaptation mechanism is realized through a 1D Pix2pix GAN with a UNet generator backbone, trained to regress noisy/clean PPG pairs. Denosing outputs are further preprocessed (clamped to $[0,2]$ ) to ensure physiologically meaningful values before evaluation.

Figure 1: Example PPG time series depicting noisy input, GAN denoised signal, and clean ground truth for both AF and non-AF segments.

Decision-Theoretic Uncertainty Quantification

Rather than conventional UQ metrics (e.g., mean squared error), DTUQ operationalizes uncertainty as the expected loss (decision cost) induced by using a generated example within a real classifier. The decision cost is set as the misclassification loss: for downstream classifier prediction $p(y|x)$ and prediction $a$ , the risk is $\rho(a|x) = 1 - p(a|x)$ , minimized by choosing $a = \arg\max_y p(y|x)$ .

Predictive entropy (and specifically, per-instance entropy) of the downstream classifier on denoised samples serves as the uncertainty estimate, with calibration and reliability evaluated via the Uncertainty Calibration Error (UCE) and per-class reliability diagrams.

Figure 2: Reliability diagrams for raw, noisy, and GAN-denoised test sets, demonstrating per-class calibration and uncertainty reliability with respect to classifier performance.

Qualitative and Statistical Uncertainty Analysis

A scatterplot analysis between the entropy on noisy versus denoised samples offers insight into whether uncertainty reflects true generative artefacts or only underlying measurement variance, with a moderate Pearson (0.68) and Spearman (0.59) correlation supporting the discriminative sensitivity of uncertainty to generative model outputs.

Figure 3: Scatterplot comparing noisy and denoised predictive entropy, indicating that denoising often causes substantive changes in classifier uncertainty.

Experimental Results

The key empirical findings are based on multiple classification performance metrics (AUC, F1, MCC, sensitivity, specificity, balanced accuracy) across unaugmented, noisy, and denoised test splits, as well as a low-uncertainty filtered subset of denoised predictions.

Noise Injection: Artificial noise notably degrades all classifier metrics, with AUC declining from 0.84 (clean) to 0.75 (noisy).
Denoising via GAN: Application of generative model adaptation regains substantial performance, restoring AUC to 0.80.
Uncertainty-driven Filtering: Selecting the denoised samples with entropy in the lowest 75% (low-uncertainty subset) aligns performance with, or slightly above, the original clean set (AUC = 0.85). This empirically grounds uncertainty as a meaningful filter.
Calibration Analysis: Reliability diagrams highlight moderate calibration, with better concordance between high classifier entropy and high error, except for some bias in the lowest-entropy AF class predictions.

The uncertainty metrics thus serve a dual role: as actionable filters for data deployment and as indirect but effective QA mechanisms for generative pre-processing in domain adaptation pipelines.

Theoretical and Practical Implications

The formalization of downstream classifier uncertainty as an actionable trust metric, rooted in decision-theoretic risk, addresses limitations of standard generative model QA in data-limited and annotation-sparse medical settings. By shifting emphasis from input-reconstruction statistics to task-relevant cost (e.g., misclassification), the DTUQ framework both operationalizes data curation strategies and provides stronger support for regulatory or clinical deployment under uncertainty.

From a theoretical standpoint, the work reframes the evaluation of generative adaptation quality in terms of subjective expected loss, connectible to preferred actions rather than only marginal summary statistics. The approach is natural for healthcare and industrial monitoring, where ultimate decision accuracy and error cost dominate over pixel-wise or signal-wise similarity.

Future Directions

Extending DTUQ-based trustworthiness to more complex, multi-class problems, or toward richer uncertainty characterizations (e.g., disentangling aleatoric/epistemic sources) remains open. Experiments with more realistic noise models or real-world modalities beyond PPG would inform generality. Integrating DTUQ with selective classification or automated abstention mechanisms constitutes a promising avenue for robust AI deployment. Further, advances in per-sample utility learning or user-defined risk functions could personalize model adaptation reliability.

Conclusion

The paper robustly demonstrates that decision-theoretic uncertainty quantification, instantiated through downstream classifier predictive entropy and grounded by its relation to error cost, enables reliable assessment and actionable filtering of generative domain adaptation outputs in wearable PPG analysis. The methodology supports practical deployment by quantifying the trustworthiness of adapted examples in terms directly relevant to the target clinical task, offering a rigorous foundation for future extensions in trustworthy time series domain adaptation and uncertainty-aware AI systems.

Markdown Report Issue