Revisiting Differentiable Structure Learning: Inconsistency of $\ell_1$ Penalty and Beyond
Abstract: Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While Ng et al. (2024) highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.
- Vector autoregressions, policy analysis, and directed acyclic graphs: an application to the us economy. Journal of Applied Economics, 6(1):1–24, 2003.
- Dagma: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239, 2022.
- Alexis Bellot and Mihaela van der Schaar. Deconfounded score method: Scoring dags with dense unobserved confounding. arXiv preprint arXiv:2103.15106, 2021.
- Differentiable causal discovery under unmeasured confounding. In International Conference on Artificial Intelligence and Statistics, pp. 2314–2322. PMLR, 2021.
- Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865–21877, 2020.
- Multi-task learning of order-consistent causal graphs. Advances in Neural Information Processing Systems, 34:11083–11095, 2021.
- David Maxwell Chickering. Learning bayesian networks is np-complete. Learning from data: Artificial intelligence and statistics V, pp. 121–130, 1996.
- David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
- Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5:1287–1330, 2004.
- James Cussens. Bayesian network learning with cutting planes. In 27th Conference on Uncertainty in Artificial Intelligence, pp. 153–160. AUAI Press, 2011.
- Global optimality in bivariate gradient-based dag learning. Advances in Neural Information Processing Systems, 36:17929–17968, 2023.
- PÂ ERDdS and AÂ R&wi. On random graphs i. Publ. math. debrecen, 6(290-297):18, 1959.
- Differentiable causal discovery under latent interventions. In Conference on Causal Learning and Reasoning, pp. 253–274. PMLR, 2022.
- Feddag: Federated dag structure learning. arXiv preprint arXiv:2112.03555, 2021.
- Missdag: Causal discovery in the presence of missing data with continuous additive noise models. Advances in Neural Information Processing Systems, 35:5024–5038, 2022.
- Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.
- Unsuitability of NOTEARS for causal graph discovery when dealing with dimensional quantities. Neural Processing Letters, 54:1–9, 06 2022.
- Structural agnostic modeling: Adversarial learning of causal graphs. Journal of Machine Learning Research, 23(219):1–62, 2022.
- Exact bayesian structure discovery in bayesian networks. The Journal of Machine Learning Research, 5:549–573, 2004.
- Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Gradient-based neural dag learning. arXiv preprint arXiv:1906.02226, 2019.
- High-dimensional learning of linear causal networks via inverse covariance estimation. Journal of Machine Learning Research, 15(88):3065–3105, 2014.
- Stable differentiable causal discovery. In Proceedings of the 41st International Conference on Machine Learning, 2024.
- Towards federated bayesian network structure learning with continuous optimization. In International Conference on Artificial Intelligence and Statistics, pp. 8095–8111. PMLR, 2022.
- On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
- On the convergence of continuous constrained optimization for structure learning. In International Conference on Artificial Intelligence and Statistics, pp. 8176–8198. Pmlr, 2022a.
- Masked gradient-based causal structure learning. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), pp. 424–432. SIAM, 2022b.
- Structure learning with continuous optimization: A sober look and beyond. In Causal Learning and Reasoning, pp. 71–105. PMLR, 2024.
- Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pp. 1595–1605. Pmlr, 2020.
- JÂ PEARL. Probabilistic reasoning in intelligent systems; network of plausible inference. Morgan Kaufmann, 1988, 1988.
- Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2014.
- A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics, 3:121–129, 2017.
- Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018.
- Beware of the simulated DAG! causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems, 2021.
- Colide: Concomitant linear dag estimation. arXiv preprint arXiv:2310.02895, 2023.
- The tetrad project: Constraint based aids to causal model specification. Multivariate Behavioral Research, 33(1):65–117, 1998.
- Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pp. 461–464, 1978.
- Finding optimal Bayesian networks by dynamic programming. Carnegie Mellon University. Center for Automated Learning and Discovery, 2005.
- An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62–72, 1991.
- Causation, prediction, and search. MIT press, 2001.
- Nts-notears: Learning nonparametric temporal dags with time-series data and prior knowledge. arXiv preprint arXiv:2109.04286, 2021.
- Use of directed acyclic graphs (dags) to identify confounders in applied health research: review and recommendations. International journal of epidemiology, 50(2):620–632, 2021.
- Algorithms for large scale markov blanket discovery. In FLAIRS, volume 2, pp. 376–81, 2003.
- Causal discovery from incomplete data: a deep learning approach. arXiv preprint arXiv:2001.05343, 2020.
- Sequential recommendation with causal behavior discovery. arXiv preprint arXiv:2204.00216, 2022.
- Dags with no fears: A closer look at continuous optimization for learning bayesian networks. Advances in Neural Information Processing Systems, 33:3895–3906, 2020.
- Stephen J Wright. Numerical optimization, 2006.
- Feature selection using stochastic gates. In International conference on machine learning, pp. 10648–10659. PMLR, 2020.
- Learning causal representations for robust domain adaptation. IEEE Transactions on Knowledge and Data Engineering, 2021.
- Dag-gnn: Dag structure learning with graph neural networks. In International conference on machine learning, pp. 7154–7163. PMLR, 2019.
- Learning optimal bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48:23–65, 2013.
- Truncated matrix power iteration for differentiable dag learning. Advances in Neural Information Processing Systems, 35:18390–18402, 2022.
- Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.
- Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pp. 3414–3425. Pmlr, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.