Debiased Machine Learning for Conformal Prediction of Counterfactual Outcomes Under Runtime Confounding

Published 4 Apr 2026 in stat.ML and cs.LG | (2604.03772v1)

Abstract: Data-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a debiased machine learning framework that adjusts for runtime confounding to ensure valid conformal prediction intervals for counterfactual outcomes.
It combines plug-in and weighting estimation strategies with DML to achieve root-n convergence and robust interval coverage under systematic covariate shifts.
Empirical results on synthetic and semi-synthetic data demonstrate that the proposed methods outperform naive approaches by maintaining near-nominal coverage even with unobserved confounders.

Debiased Machine Learning for Conformal Prediction of Counterfactual Outcomes Under Runtime Confounding

Introduction and Problem Formulation

This paper addresses the challenge of quantifying uncertainty in counterfactual predictions under runtime confounding—scenarios where only a subset of the necessary confounders is observed in the target population during deployment. The motivation arises from decision-support applications, such as individualized treatment effect estimation in precision medicine or personalized marketing, where counterfactual models are trained in a data-rich source population but must be applied to a target population with systematic covariate and confounder shortfalls. Existing conformal prediction approaches are limited by the assumption that all confounders are measured both at training and runtime, leading to miscoverage and invalid inference when this condition is violated.

The authors formalize the runtime confounding problem as a distributional shift involving two covariate sets: a fully observed subset $\bm V$ available in both populations, and an augmented set $= (\bm V, \bm U)$ , with $\bm U$ unobserved at runtime. They aim to develop procedures for conformal prediction that guarantee valid prediction interval coverage for counterfactual potential outcomes $Y(a)$ in the target population, explicitly handling the absence of some confounders in the population of interest.

The presented framework extends conformal prediction approaches for counterfactuals and individual treatment effect (ITE) estimation under covariate shift [lei2021conformal, yang2024doubly, alaa2023conformal], transfer learning-based causal inference [shyr2025multi, bica2022transfer], and data fusion strategies [degtiar2023review, graham2024towards]. The principal technical distinction is the explicit accommodation of runtime confounding within the conformal prediction paradigm, leveraging both modern semiparametric efficiency theory and debiased/multiply-robust machine learning. Importantly, the approach requires only that $\bm V$ captures all source-target systematic shifts in $Y(a)$ , thereby relaxing the conventional requirement for complete confounder measurement in the target population.

Formal Framework and Methodology

The methodology synthesizes (i) identification and adjustment for dual sources of covariate shift (across treatment levels and populations), (ii) plug-in and weighted estimation strategies, and (iii) efficient, multiply-robust estimators via efficient influence curve (EIC) theory. The procedure is formally grounded in the following core assumptions: treatment positivity, consistency, unconfoundedness (in the source), and source-target exchangeability given $\bm V$ . Violations of these assumptions, such as unmeasured confounders or mismatched sources and targets, can invalidate inference; the paper provides additional discussion and graphical representations of plausible and non-plausible settings.

Central to the conformal prediction construction is the conformity score $R_a(Y,\bm V)$ , suitable for settings with partial covariate observability. The authors derive new identification functionals for the required quantiles of these scores in the target population, showing that both regression-based (g-computation) and weighting-based (inverse probability with dual shift-correcting weights) approaches are justified. For quantile regression-based conformity scores under runtime confounding, they present new loss functions involving composite weights accounting for both treatment and sampling mechanisms.

Debiased Machine Learning and Efficiency Results

Plug-in and standard weighting approaches are shown to be consistent but can converge at slow rates when high-dimensional flexible nuisance models are fit. The main technical innovation is the adaptation of DML/semiparametric efficiency theory, providing a multiply-robust estimator for the score quantile via the EIC. This estimator enjoys root- $n$ convergence under mild rate conditions, with the coverage error of prediction intervals shrinking at the parametric rate even when machine learning models are used, contingent on at least one regularized estimator in each of two pairs $(m_a, \kappa)$ and $= (\bm V, \bm U)$ 0 converging consistently.

The DML estimator is implemented in a split conformal prediction algorithm, avoiding computational intractability of repeated nuisance model re-fitting by localizing EIC estimation around an $= (\bm V, \bm U)$ 1-consistent pilot estimate, as prescribed in recent semiparametric theory [kallus2024localized].

Empirical Results

The empirical study includes both synthetic and semi-synthetic experiments, systematically evaluating interval coverage and efficiency under a controlled degree of runtime confounding, source-target sample size ratios, and varying sample sizes.

Figure 1: Experiment results demonstrate empirical coverage and interval length for different sample sizes, highlighting effect of runtime confounding (10 unmeasured confounders) on method performance.

Across all scenarios, the DML and weighted conformal approaches achieve empirical coverage close to the target rate (e.g., 90%), with DML intervals generally as narrow or narrower than their competitors. By contrast, naively ignoring runtime confounding yields severe miscoverage, especially as the number of unmeasured confounders or sample size increases. Stratified analysis by counterfactual outcome and runtime confounding severity further substantiates the robustness of the proposed strategies.

Figure 2: Empirical performance stratified by counterfactual outcome confirms coverage and efficiency across both $= (\bm V, \bm U)$ 2 and $= (\bm V, \bm U)$ 3.

Figure 3: The procedure remains robust to varying numbers of true runtime confounders (here, 5 confounders), maintaining nominal coverage rates as $= (\bm V, \bm U)$ 4 increases.

Figure 4: With 15 runtime confounders, the advantage of properly accounting for runtime confounding is further accentuated; naive procedures show increasing miscoverage.

In semi-synthetic settings based on the ACIC NSLM database, the DML and weighted approaches similarly maintain valid coverage, while unadjusted approaches exhibit miscoverage exacerbated by quantile-based conformity scoring or under severe confounding scenarios.

Figure 5: On semi-synthetic ACIC challenge data, both proposed methods achieve valid coverage and efficient intervals regardless of sample size, substantially outperforming methods that ignore runtime confounding.

Complementary analyses varying $= (\bm V, \bm U)$ 5 and additional sensitivity checks corroborate the method's robustness.

Figure 6: Empirical performance versus $= (\bm V, \bm U)$ 6 indicates that both DML and weighting strategies are robust to varying proportions of source data.

Discussion and Implications

The paper establishes that appropriately adjusting for both runtime confounding and population shift is essential for inference-valid conformal prediction intervals on counterfactual outcomes. The multiply-robust DML approach provides formal guarantees (asymptotic coverage, root- $= (\bm V, \bm U)$ 7 convergence, model-multiple robustness) under practical assumptions, with strong empirical validation.

Practically, this work enables deployment of data-driven DSTs in settings where runtime confounding is unavoidable, such as when privacy, cost, or logistics restrict the measurement of confounders at runtime. Theoretically, the results position semiparametric DML as fundamental to assumption-lean, efficient counterfactual prediction with quantifiable uncertainty.

Potential extensions include the development of formal sensitivity analysis to violations of exchangeability and unconfoundedness, support for continuous treatments and time-to-event outcomes, and adaptation to strict privacy constraints or federated/multi-source environments.

Conclusion

This paper presents a computationally and statistically efficient framework for conformal prediction of counterfactual outcomes under runtime confounding, advancing the state-of-the-art in both method development and theoretical understanding at the intersection of causal inference, machine learning, and uncertainty quantification. The proposed debiased machine learning and weighted conformal procedures deliver valid and efficient interval coverage under minimal assumptions, robust to practical complications inherent in modern deployment scenarios. These advances further enable the use of counterfactual DSTs and causal machine learning systems in real-world settings characterized by incomplete covariate information at deployment.

Markdown Report Issue