TabSCM: A practical Framework for Generating Realistic Tabular Data

Published 24 Apr 2026 in cs.LG | (2604.22337v1)

Abstract: Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel method, TabSCM, that employs structural causal models for realistic tabular data synthesis.
It uses a hybrid approach combining score-based diffusion for continuous variables and gradient-boosted trees for categorical variables.
Empirical evaluations demonstrate superior fidelity, utility, fairness, and fast runtime efficiency across diverse datasets.

TabSCM: A Causally-Coherent and Efficient Framework for Realistic Tabular Data Generation

Introduction

The paper "TabSCM: A practical Framework for Generating Realistic Tabular Data" (2604.22337) introduces TabSCM, a structural causal model-based method for the synthesis of high-quality, mixed-type tabular data. It directly addresses the core limitations prevailing in current state-of-the-art tabular data generators—including diffusion models, GANs, and LLM-based approaches—by providing strong causal fidelity, mechanism-level interpretability, competitive utility/faithfulness, and efficient sampling. The method models and preserves explicit causal dependencies among variables, which is critical for applications in regulated domains (e.g., finance, healthcare) where valid counterfactual reasoning, fairness, and rule adherence are required. TabSCM leverages a completed partially directed acyclic graph (CPDAG) derived from causal discovery algorithms to orient the functional relationships, and then fits conditional assignments using a hybrid of conditional diffusion models and gradient-boosted trees depending on variable type.

Model Formulation and Methodology

TabSCM employs a structured SCM formalism grounded in a DAG inferred from real observed data. The pipeline involves four canonical steps: causal discovery, graph refinement, per-variable model fitting (structural assignments), and counterfactual intervention support. Edge orientation in the CPDAG ensures compatibility with a valid topological order. For root nodes, marginal distributions are estimated nonparametrically (via KDE or categorical frequencies). For each non-root node, TabSCM fits its conditional using score-based diffusion models (for continuous children) or gradient-boosted classifiers (for categorical children), thereby achieving hybridization for arbitrarily mixed-type tables.

Figure 1: Conceptual framework showing causal discovery, graph refinement, assignment learning, and counterfactual modeling in TabSCM.

The generative process involves ancestral sampling along the topological order, ensuring semantic and structural validity. Counterfactual queries are natively supported; interventions set variables to fixed values, regenerating descendants accordingly using perturbed SCMs.

Empirical Evaluation

Experiments span seven public datasets capturing both classification and regression tasks, with scales ranging from small to large and diverse domains (health, finance, housing, environment). A rigorous evaluation protocol assesses statistical fidelity (marginal density error, correlation error), downstream utility (AUC/RMSE of models trained on synthetic, tested on real data), privacy (distance to closest record, DCR), imperceptibility (C2ST), and higher-order faithfulness ( $\alpha$ -precision/ $\beta$ -recall).

TabSCM demonstrates SoTA or competitive results across all major axes:

Statistical fidelity: TabSCM achieves low error in both marginal distributions and correlation structures. It frequently matches or outperforms pure diffusion and deep generative models, and exhibits strong generalizability across domains.
Figure 2: Average density error and training runtime across seven real datasets for TabSCM and baselines.

Figure 3: Mean density error, correlation error, and AUC as a function of diffusion steps and training epochs.
Utility: Models trained on TabSCM-sampled data perform comparably to those trained on real data, as evidenced by high AUC or low RMSE, even under settings with severe class imbalance.
Privacy: TabSCM maintains higher DCR than many deep generative baselines, indicating a lower risk of synthetic data inversion or memorization.
Runtime: Due to the per-variable modeling and the modularization of stochastic conditionals, TabSCM is up to $583\times$ faster than diffusion-only baselines, with scalable applicability to large datasets.
Constraint satisfaction: TabSCM shows low domain rule violation rates, outperforming both deep and transformer-based generators on constraint-based validity checks.
Figure 4: Epoch/running time trade-offs for density/correlation error and downstream AUC.

Imbalanced Learning, Faithfulness, and Counterfactuals

TabSCM demonstrates robust performance in minority-class upsampling and imbalanced learning scenarios. By leveraging causal conditionals and sampling from generative mechanisms conditioned on parent sets, TabSCM restores fairness (low FNR/FPR) and recovers trust metrics comparable to those on balanced data, significantly outperforming heuristic oversampling (e.g., SMOTE, ADASYN) and other baselines.

Figure 5: False Negative/Positive Rate for synthetic upsampling; TabSCM restores fairness in highly imbalanced regimes.

In higher-order structure preservation, TabSCM achieves near-diffusion-model-level performance in $\alpha$ -precision and $\beta$ -recall, indicating that its causal decomposition does not sacrifice joint distribution fidelity. It also demonstrates top-level indistinguishability to trained C2ST detectors.

Interpretability and Mechanism-Auditing

The explicit factorization of the joint distribution allows forensic inspection, mechanism-level explanation, and stress testing. Comparison of SHAP attributions for the same real/test set, when training on either real or TabSCM-synthesized data, confirms strong alignment, highlighting the preservation of feature importances and real-world mechanisms.

Figure 6: SHAP value agreement between models trained on real versus TabSCM-generated synthetic data.

Figure 7: Mean absolute SHAP values for learned structural assignments show TabSCM recovers domain-consistent mechanisms.

Structural assignments can be independently interpreted and directly intervened upon, supporting counterfactual analysis and policy simulation at fine granularity. TabSCM generates plausible counterfactual examples with realistic attribute distributions, in contrast to predictor-driven recourse algorithms that may produce less plausible or semantically nonsensical counterfactuals under constraints.

Figure 8: Marginal distributions for free attributes in counterfactual examples generated by DiCE (model-agnostic) and TabSCM (mechanism-based).

Practical and Theoretical Implications

Integrating SCMs into tabular data synthesis marks a substantive methodological advancement by merging the transparency and modularity of classical structural modeling with the expressivity and flexibility of modern score-based diffusion and ensemble methods. Practically, TabSCM addresses regulatory, auditing, and fairness requirements in sensitive application domains by providing inspection-ready, interpretable data generation, native support for interventions, and facilitation of robust policy scenario analysis. Its modular design enables efficient resource scaling and rapid iteration over heterogeneous data types. Theoretically, this work demonstrates that explicit causal factorization can yield competitive or superior fidelity and utility, and can be computationally advantageous compared to monolithic deep neural synthesis.

TabSCM's explicit mechanisms unlock extensions in fairness-aware data generation, transparent data debugging, and simulation-based counterfactual explainability. The framework is immediately extensible to more sophisticated causal discovery and refitting approaches (e.g., inclusion of latent confounders, time-dependent or hierarchical models). Furthermore, as regulatory frameworks (such as the EU AI Act) begin to require explainable synthetic data for high-stakes applications, TabSCM provides an actionable architecture template.

Conclusion

TabSCM establishes a new practical standard for causally-sound, statistically faithful, and auditable tabular synthetic data generation. It shows that hybrid diffusion-ensemble conditional assignments within an SCM framework enable strong utility, low privacy risk, rapid sample generation, and robust fair counterfactual modeling. By making the data-generating process transparent, modular, and intervention-ready, TabSCM is positioned as a foundational architecture for future practical and responsible synthetic data workflows.

Markdown Report Issue