Score matching through the roof: linear, nonlinear, and latent variables causal discovery (2407.18755v1)

Published 26 Jul 2024 in stat.ML, cs.AI, and stat.ME

Abstract: Causal discovery from observational data holds great promise, but existing methods rely on strong assumptions about the underlying causal structure, often requiring full observability of all relevant variables. We tackle these challenges by leveraging the score function $\nabla \log p(X)$ of observed variables for causal discovery and propose the following contributions. First, we generalize the existing results of identifiability with the score to additive noise models with minimal requirements on the causal mechanisms. Second, we establish conditions for inferring causal relations from the score even in the presence of hidden variables; this result is two-faced: we demonstrate the score's potential as an alternative to conditional independence tests to infer the equivalence class of causal graphs with hidden variables, and we provide the necessary conditions for identifying direct causes in latent variable models. Building on these insights, we propose a flexible algorithm for causal discovery across linear, nonlinear, and latent variable models, which we empirically validate.

Authors (5)

Francesco Montagna (6 papers)
Philipp M. Faller (5 papers)
Patrick Bloebaum (5 papers)
Elke Kirschbaum (9 papers)
Francesco Locatello (92 papers)

Summary

The paper demonstrates that the score function accurately identifies causal structure in additive noise models by linking sink node residuals to noise terms.
It extends theoretical foundations by using Hessian conditions to capture conditional independencies and determine Markov equivalence in latent variable scenarios.
AdaScore, the proposed algorithm, flexibly switches between modes to recover directed and bidirected edges in both linear and nonlinear causal discovery.

This paper, "Score matching through the roof: linear, nonlinear, and latent variables causal discovery" (Montagna et al., 26 Jul 2024 ), explores leveraging the score function $\nabla \log p(X)$ of observed variables for causal discovery, aiming to relax common strong assumptions and handle latent variables. The core idea is that the score function and its derivatives contain information about the underlying causal structure.

The paper makes two main theoretical contributions. First, it generalizes existing results to show that the score function identifies the causal structure for additive noise models (ANMs), including both linear and nonlinear mechanisms, with minimal assumptions on the form of the mechanisms or noise distributions beyond additivity and non-Gaussianity for linear cases. This extends prior work which was limited to specific linear or nonlinear cases. A key insight here is that for a sink node $X_s$ (a variable with no children in the causal graph), its score component $\partial_{X_s} \log p(X)$ is a function of its noise term $N_s$ . The paper shows that the residual $R_s = X_s - E[X_s | X_{\setminus X_s}]$ (the error from optimally predicting $X_s$ from all other variables) is equal to the noise term $N_s$ for a sink node. Thus, the score of a sink node can be predicted perfectly from its residual. Proposition 1 formalizes this: a node $X_j$ is a sink if and only if the mean squared error of predicting $\partial_{X_j} \log p(X)$ from $R_j$ is zero: $E\left[\left(E\left[\partial_{X_j} \log p(X) \mid R_j = r_j\right] - \partial_{X_j} \log p(X)\right)^2\right] = 0$ . This condition can be used to identify sinks and, iteratively, the causal order in fully observed ANMs, as done in prior score-based algorithms like NoGAM.

The second major contribution addresses the presence of unobserved variables (latent confounders). The paper demonstrates how the score can still provide information about the causal structure in this more complex setting. When latent variables are present, the observable variables $V$ are a marginalization of the full set of variables $X=V \cup U$ (observed and unobserved). The causal relationships among $V$ are represented by a marginal graph, often a maximal ancestral graph (MAG), which can contain directed ( $\to$ ) and bidirected ( $\leftrightarrow$ ) edges. The set of conditional independencies among $V$ corresponds to m-separation in the MAG, and graphs implying the same independencies form a Markov equivalence class represented by a partial ancestral graph (PAG). Proposition 2 shows that vanishing cross-partial derivatives of the log density of the observed variables $\frac{\partial^2}{\partial V_i \partial V_j} \log p(V_Z) = 0$ are equivalent to conditional independence $V_i \indep V_j | V_Z \setminus \{V_i, V_j\}$, which in turn is equivalent to m-separation in the marginal MAG. This means the Hessian matrix of $\log p(V)$ can be used as an alternative to traditional conditional independence tests to identify the equivalence class (skeleton and v-structures) of the marginal MAG.

Furthermore, under the assumption of an additive noise model for observed variables where noise terms are recentered by the additive effects of latent variables (Assumption 1), Proposition 3 provides conditions for identifying direct causal effects between observed variables, even in the presence of latent confounders. For two adjacent nodes $V_i, V_j$ in the marginal graph, a direct causal effect $V_i \to V_j$ exists and is not confounded by latent variables if and only if predicting the score component $\partial_{V_j} \log p(V_Z)$ from the residual $R_j(V_Z) = V_j - E[V_j | V_{Z \setminus \{j\}}]$ (where $V_Z = V_{^_j} \cup \{V_i, V_j\}$ and $V_{^_j}$ are the observed parents of $V_j$ in the full graph) results in zero mean squared error. If the mean squared error is non-zero, it implies either $V_i$ is not a direct cause of $V_j$ or there is an unobserved confounding path affecting the relationship. This provides a score-based criterion to distinguish between directed edges and bidirected edges (representing latent confounding).

Building on these theoretical insights, the paper proposes AdaScore (Adaptive Score-based causal discovery), a flexible algorithm summarized in Algorithm 1. AdaScore can operate in different modes depending on the user's assumptions, outputting a Markov equivalence class (using Proposition 2), a directed acyclic graph (leveraging Proposition 1 like NoGAM), or a mixed graph accounting for latent variables. The mixed graph version aims to identify direct, unconfounded causal effects using Proposition 3 while relying on Proposition 2 for general conditional independencies.

The algorithm iteratively processes nodes. In the best case, it identifies 'unconfounded sinks' using the Proposition 1 condition (generalized for latent variables) applied to the set of remaining variables. If such a sink $V_i$ is found (meaning $V_i$ has no children among the remaining nodes and no latent confounder paths connect it to other remaining nodes), its neighbors (identified using Proposition 2) are marked as its parents, and $V_i$ is removed. If no unconfounded sink is found among remaining nodes, the algorithm selects an arbitrary node, identifies its neighbors using Proposition 2, and then attempts to orient edges using the Proposition 3 condition on subsets of the neighborhood. This process involves checking if predicting a node's score component from its residual (conditioned on a subset of neighbors) results in zero mean squared error. If $V_i \to V_j$ is identified (meaning $V_j$ 's score is predictable from its residual given a parent set including $V_i$ , but $V_i$ 's score is not predictable from its residual given a parent set including $V_j$ ), the algorithm may prioritize exploring from $V_j$ . Nodes with no outgoing directed edges identified this way are eventually removed. Finally, any remaining unoriented (bidirected) edges can be pruned using Proposition 2. The computational complexity is polynomial in the best case (when many unconfounded sinks are found iteratively) but can be exponential in the worst case due to checking subsets of neighbors, similar to constraint-based methods like FCI.

For practical implementation with finite samples, the theoretical conditions involving zero values are translated into statistical hypothesis tests. The score function and its Hessian are estimated using score matching techniques, such as the Stein gradient estimator. Hypothesis tests (e.g., t-tests for the mean of Hessian entries for Proposition 2, Mann-Whitney U-tests for comparing distributions of prediction errors for Proposition 3) are used to decide whether quantities are statistically different from zero or significantly different from each other. Cross-validation is employed to generate residuals to prevent overfitting. A post-processing pruning step based on CAM (Causal Additive Models) is also mentioned to refine directed edges.

The paper evaluates AdaScore using synthetic data generated with the causally library, simulating linear and nonlinear ANMs with and without latent variables under sparse and dense graph structures (Erdös-Renyi). Performance is measured using Structural Hamming Distance (SHD) and F1 score for skeleton recovery in non-additive settings. Experiments show that AdaScore's performance is generally comparable to or better than baselines like NoGAM, CAM-UV, RCD, and DirectLiNGAM, particularly excelling in nonlinear additive settings. While it doesn't always drastically outperform existing methods specialized for particular settings (like RCD for linear or CAM-UV for nonlinear with latents), its strength lies in its broad theoretical guarantees covering linear, nonlinear, and latent variable ANMs, making it potentially more applicable when specific structural assumptions are uncertain. The empirical results also show consistent performance improvement with increasing sample size.

A limitation of the empirical evaluation is the reliance on synthetic data, a common practice in causal discovery due to the lack of suitable real-world benchmarks with known ground truth. The paper cautions that results on synthetic graphs might not fully capture performance in complex real-world scenarios.

In summary, this paper advances score-based causal discovery by extending its theoretical foundations to a broader class of additive noise models and establishing conditions for identifying both the equivalence class and direct, unconfounded causal effects in the presence of latent variables. The proposed AdaScore algorithm provides a flexible framework to leverage these results, adaptable to different modeling assumptions, offering a step towards causal discovery methods less dependent on strong, untestable prior knowledge.

PDF Markdown

Score matching through the roof: linear, nonlinear, and latent variables causal discovery (2407.18755v1)

Summary

Related Papers

Tweets