A novel constraint-based structure learning algorithm using marginal causal prior knowledge

Causal DAG modelsLet \({\mathcal{G}}\left( {V,E} \right)\) be a DAG with a vertex set V and an edge set E. The undirected graph obtained from replacing all directed edges of \({\mathcal{G}}\) with undirected edges is the skeleton of \({\mathcal{G}}\). When there is an edge (undirected or directed) between two variables A and B in a graph \({\mathcal{G}}\), A and B are adjacent. When there is a directed edge \(A \to B\) in a graph \({\mathcal{G}}\), A is a parent of B; equivalently, B is a child of A. The parents, children, and adjacent vertices of variable A in a DAG \({\mathcal{G}}\) are denoted by \(pa\left( {A,{\mathcal{G}}} \right)\),\({ }ch\left( {A,{\mathcal{G}}} \right)\) and \(adj\left( {A,{\mathcal{G}}} \right)\), respectively. If there is a directed (undirected) edge between any two consecutive vertices \(\left( {X_{i} ,X_{i + 1} ,X_{i + 2} , \ldots ,X_{j} } \right)\), then it is a directed (undirected) path. A is a cause of B if there is a directed path from A to B. For simplicity, we use \({ }X_{i} \to X_{j} \) and \(X_{i} \dashrightarrow X_{j}\) to denote a directed edge and a directed path, respectively. We use to denote there is no directed path from \(X_{i}\) to \(X_{j}\). If there is a directed path from \(X_{i}\) to \(X_{j}\), \(X_{i}\) is an ancestor (indirect cause) of \(X_{j}\) and \(X_{j}\) is a descendant of \(X_{i}\), denoted by \(X_{i} \in an\left( {X_{j} ,{\mathcal{G}}} \right)\) and \(X_{j} \in de\left( {X_{i} ,{\mathcal{G}}} \right)\). A v-structure is a structure of \(A \to B \leftarrow C\) with constraint that \(A\) is not adjacent to \(C\), and the vertex \(B\) is called a collider.A DAG encodes the conditional independence (CI) relations induced by d-separation25. When two DAGs are Markov equivalent, their conditional independence relations are the same26,27. According to Pearl et al.28, if two DAGs have the same skeleton and colliders, then they are equivalent. An equivalence class can be represented by a CPDAG \({\mathcal{G}}^{*}\). Given a consistent background knowledge set K with respect to \({\mathcal{G}}^{*}\), it is possible that some undirected edges in \({\mathcal{G}}^{*}\) can be oriented, resulting in a partially directed graph \({ \mathcal{H}}\). A maximal PDAG (MPDAG)13 can be constructed by orienting some undirected edges in \({\mathcal{H}}\) using Meek’s criteria16.Let \(P_{{\mathcal{G}}}\) be the joint distribution of a given DAG \({\mathcal{G}}\). The causal Markov assumption indicates that \(P_{{\mathcal{G}}}\) can be decomposed as \(P\left( {x_{1} , \ldots ,x_{n} } \right) = \mathop \prod \limits_{i = 1}^{n} P\left( {x_{1} |pa\left( {x_{1} ,{\mathcal{G}}} \right)} \right)\). If the conditional independencies in \(P_{{\mathcal{G}}}\) and in DAG \({\mathcal{G}}\) are the same, we say that \(P_{{\mathcal{G}}}\) is faithful to \({\mathcal{G}}\). In this paper, we also assume that there is no hidden variable or selection bias.PC algorithmPC (Peter and Clark) algorithm can find the global structure of high-dimensional CPDAG10. The PC algorithm utilizes CI tests to determine the skeleton and then employs Meek rules to establish the direction of edges. The PC-stable algorithm improved the oracle PC algorithm by removing part of its order-dependence9. The PC-stable algorithm, outlined in Algorithm 1, includes three main steps. Step 1 uses CI tests to ascertain the skeleton of DAG and separation sets between vertices. Then, Step 2 and 3 orient the remaining undirected arcs. The PC-stable algorithm is able to take prior knowledge into consideration. The PC algorithm is computationally efficient and is capable to learn large CPDAG of hundreds of variables29,30,31.Algorithm 1. The PC PC-stable algorithmConditional independence conditions for marginal causal relationPrior knowledge constitutes a set of constraints during structure learning. In this study, we take both positive marginal causal knowledge and negative marginal causal knowledge into consideration. A positive marginal causal relation between two vertices A and B indicates that there exists a directed path from A to B (i.e., \(A \in an\left( B \right)\)), denoted by \(A \dashrightarrow B\). Similarly, a negative marginal causal relation indicates no directed path from A to B (i.e., \(A \notin an\left( B \right)\)), denoted by . In structure learning algorithms, we use whitelist marginal prior causal knowledge and blacklist marginal prior causal knowledge to denote a positive marginal causal relation and a negative marginal causal relation, respectively. In this section, we will show several properties of marginal prior causal knowledge. First, the concept of the minimal separating set is introduced.
Definition 1
(minimal separating set). Given a directed graph \({ \mathcal{G}}\left( {{\varvec{V}},{\varvec{E}}} \right)\), a pair of vertices \(X,Y \in {\varvec{V}}\), and a set \(S \subset {\varvec{V}}\) such that \(X,Y \notin S\) and \(X \bot Y|S\). We say that S is a minimal separating set of X and Y if there is no subset of it \(S^{\prime} \subset S\) such that \(X \bot Y|S^{\prime}\), represented by \(msep\left( {X,Y} \right)\).
Definition 1 means that a separating set is minimal if and only if none of its subsets separates the target node pair. Note that, the separation sets outputted from PC-based algorithms are the minimal separating sets of node pairs. PC-based algorithms subsequently increase the number of vertices in a separating set during the adjacency phase and output the minimal separating set. Minimal separating sets are important for linking marginal prior causal knowledge with statistical conditional independence tests, as stated in the following lemma by Pearl32.
Lemma 1
Given a directed graph \({\mathcal{G}}\left( {{\varvec{V}},{\varvec{E}}} \right)\), a pair of nodes \(X,Y \in {\varvec{V}}\), and a set S is a minimal separating set of X and Y if \(S \cap \left( {an\left( X \right) \cup an\left( Y \right)} \right) = S\).
With Lemma 1 and the pairwise independencies provided by Koller and Friedman33, we can efficiently find a minimal separating set of node pair X and Y if we have the marginal causal relation between X and Y, and obtain the following theorems.
Theorem 1
Let \({\mathcal{G}}\) be a DAG. For any two distinct non-adjacent vertices X and Y in \({\mathcal{G}}\). X is an ancestor of Y (marginal causal relation \(X \dashrightarrow Y\)) in \({\mathcal{G}}\) if every variable in the minimal separating set of X and Y is an ancestor of Y. Specifically, when the number of nodes in the minimal separation set S is 1, then node is an ancestor of Y and a descendant of X.

Theorem 2.
Let \({\mathcal{G}}\) be a DAG. For any two distinct non-adjacent vertices X and Y in \({\mathcal{G}}\). X is not an ancestor of Y (marginal causal relation ) in \({\mathcal{G}}\) if every variable in the minimal separating set of X and Y is not a descendant of X.
Theorems 1 and 2 indicates that the marginal causal relations can be transformed into a necessary causal relation set based on the minimal separating sets, and minimal separating sets can be found using the conditional independence tests of observational data. Thus, we can incorporate marginal causal relations with observational data before learning the whole PDAG. With Theorems 1 and 2, we can iteratively decompose the marginal causal relations into necessary causal relations.
Example 1
Figure 1 is an example of Theorem 1. Figure 1a is the underlying true DAG \({\mathcal{G}}\). Given the complete undirected graph \({\mathcal{C}}\) in Fig. 1b and prior knowledge that A is an ancestor of E (\(A \dashrightarrow E\)), we would like to find the corresponding necessary causal relations. Note that, the true DAG is unknown. Suppose that we also have conditional independence, which states that A and E are independent with respect to separating set \(\left\{ {C,D} \right\}\) and that A and E are dependent with respect to only C or D or \(\emptyset\), i.e., \(\left\{ {C,D} \right\} \in msep\left( {A,E} \right)\). As shown in Fig. 1c, based on Theorem 1, the necessary causal relation of prior knowledge \(A \dashrightarrow E\) is that E is a descendant of C and D (\(C \dashrightarrow E\) and \(D \dashrightarrow E\)).Figure 1An example to illustrate Theorem 1.
Example 1 shows that we can transform marginal prior causal knowledge into necessary causal relations. In addition to leveraging the conditional independence property of marginal causal relations, we also utilize the topological order property33 inherent in marginal causal relations. By applying Causal Relation Scanning Algorithm (Algorithm 2) to the positive and negative marginal causal relations obtained through the conditional independence property, we derive a new topological order relation among the nodes.Algorithm 2. Causal Relation Scanning AlgorithmMarginal Prior Causal Knowledge PC (MPPC) algorithmWe now propose the Marginal Prior Causal Knowledge PC (MPPC) algorithm. MPPC algorithm can be treated as a modification of the PC-stable algorithm with marginal prior causal knowledge. The main modification of the MPPC algorithm is in the adjacency phase, which is encoded in the pseudo-code of Algorithm 3, lines 1–24.Originally, the adjacency phase consisted of creating a complete undirected graph and testing conditional independence of every pair of vertices based on a threshold \(\alpha\). The skeleton graph is estimated after the adjacency search phase. We modified the adjacency phase by adding the testing procedure of marginal causal relations based on the prior knowledge set \({\mathbf{K}} = \left\{ K \right\}_{i = 1}^{M}\) and Theorem 1 and 2, where each \(K_{i}\) is of the form \(X_{i} \dashrightarrow X_{j}\) or , \(X_{i} ,X_{j} \in V\). This means that for every variable-pair in K, we record the minimal separating set according to the conditional independence test and add new marginal causal relations to K based on Theorems 1 and 2. After the adjacency phase, we orient the edges in skeleton to the direction of causal relation set K according to the acyclic assumption, that is, for any \(X_{i} \dashrightarrow X_{j}\) in K, if \(X_{i} – X_{j}\) in the skeleton, then we orient \(X_{i} – X_{j}\) as \(X_{i} \to X_{j}\).It is worth noting that Algorithm 3 applies not only to marginal prior causal knowledge, but also to a mixture of marginal prior causal knowledge and direct prior knowledge, since direct prior knowledge can be treated as marginal prior causal knowledge between adjacent vertices. If prior knowledge set K includes a subset of direct prior knowledge, there will be directed arcs reflecting this direct prior knowledge in the final graph.For the v-structure phase, we find every unshielded triple \(X_{i} – X_{j} – X_{k}\) in the skeleton and test whether \(X_{j}\) is in the separating set of \(X_{i}\) and \(X_{k}\). If \(X_{j}\) does not exist in every separating set of \(X_{i}\) and \(X_{k}\) we obtained from adjacency phase, then \(X_{i} – X_{j} – X_{k}\) is marked as a v-structure \(X_{i} \to X_{j} \leftarrow X_{k}\). Finally, we apply Meek rules in the orientation phase, which is similar to the oracle PC-stable algorithm. These rules are applied repeatedly until no additional edges can be oriented, resulting in a MPDAG16,27.
Example 2
Figure 2 is an example of how MPPC works. Figure 2a shows the true DAG \({\mathcal{G}}\). Suppose that we have background knowledge that states that A is an ancestor of E (\(A \dashrightarrow E\)). The background knowledge and complete undirected graph \({\mathcal{C}}\) are shown in Fig. 2b. Figure 2c–e shows the adjacency step of MPPC algorithm, where l is the number of vertices in the separating sets. From Fig. 2c to d, three edges are removed, and two causal relations (\(A \dashrightarrow B\) and \(B \dashrightarrow E\)) are added based on prior knowledge \(A \dashrightarrow E\) and conditional independence \(A \bot E|B\). From Fig. 2d to e, two additional causal relations are added similarly based on Theorem 1. Figure 2f shows the maximal partial directed graph resulting from orienting the skeleton in Fig. 2e. In addition to \(\left\{ {A \to B,C \to E,D \to E} \right\}\), \(B – C\) and \(B – D\) are oriented as \(B \to C\) and \(B \to D\), respectively, since \(A \to B – C\) and \(A \to B – D\) are not v-structure based on separating sets. Thus, Fig. 2f and the causal relations in Fig. 2e are the outputs of the MPPC algorithm.Figure 2An example to illustrate how MPPC (Algorithm 3) works.

Algorithm 3. MPPC algorithmSimulationsIn the experimental simulations, we use the random.graph function in R package bnlearn to randomly generate the structure of DAGs and randomly choosing \(p \in \left\{ {10,{ }20,{ }30} \right\}\) percent of marginal causal relations according to the generated DAGs. For real-world networks, we use ECOLI70, MAGIC-IRRI and ARTH150 in the Bayesian Network Repository34. Then, we use linear structural equation models to generate synthetic data sets based on the structures. We compare the MPPC method with PC-stable9 (with only directed prior knowledge, denoted by PC(d_exp)) algorithm, PC-stable (with marginal and directed prior knowledge, denoted by PC(all_exp)), the PC-PDAG algorithm proposed by Borboudakis and Tsamardinos12 (denoted by PC-PDAG), and Max–Min Hill-Climbing algorithm35 (denoted by MMHC). We also use the MINOBSx24 method as a comparative approach in the instance analysis based on the ALARM network. Note that, PC-PDAG algorithm needs PDAG as input, thus we use the PDAG learned from PC-stable algorithm without prior knowledge as suggested by Borboudakis and Tsamardinos12. In all experiments, PC-stable algorithm and MMHC algorithm is called from R-package bnlearn and PC-PDAG algorithm is called from MATLAB function provided by Borboudakis and Tsamardinos12.We generate random DAGs \({\mathcal{G}}\) with n nodes and mean in-degree d, where n is chosen from \(\left\{ {50,{ }100,{ }150,200,250} \right\}\) and d is chosen from \(\left\{ {1,{ }2,{ }3,{ }4} \right\}\). For a sampled \({\text{G}}\left( {n,d} \right)\) graph, we draw an edge weight \(\beta_{ij}\) from a Uniform distribution, \({\text{U}}\left( {0.8,1.6} \right)\), for each directed edge \(X_{i} \to X_{j}\) in \({\mathcal{G}}\). We use the following linear structured equation model to simulate data,$$X_{j} = \mathop \sum \limits_{{X_{i} \in pa\left( {X_{j} } \right)}} \beta_{ij} X_{i} + \in_{j} ,{ }\;\;\;j = 1, \ldots ,{\text{n}},$$ where \(\in_{1} , \ldots , \in_{n}\) are independent \({\mathcal{N}}\left( {0,1} \right)\) noises36. We then generate \( N = 2000,10000\) samples for every simulated DAG. The marginal prior causal knowledge set K with respect to \({\mathcal{G}}\) was generated by randomly choosing \(p \in \left\{ {10,{ }20,{ }30} \right\}\) percent of marginal causal relations according to \({\mathcal{G}}\), i.e., variable pairs such as \(\left( {X,Y} \right)\), where X is an ancestor of Y (\(X \dashrightarrow Y\)) or X is not an ancestor of Y (). For each situation, we generated 100 (\({\mathcal{G}},{\mathbf{K}}\)) pairs.We use precision, recall, F1 score, Structural Hamming Distance (SHD) and Constraint Satisfaction Rate22 (CSR) to evaluate and compare the performances of different methods. For each (\({\mathcal{G}},{\mathbf{K}}\)) pair in the simulation and experimental studies, we computed the corresponding MPDAG and used the MPDAG as the gold standard for comparison with the estimated network. Given that the outputs of MMHC algorithm and MINOBSx algorithm are DAGs, we also reformulated the criterion for True Positive: for an undirected edge in an MPDAG, an edge in the estimated network with any direction is considered a True Positive. Precision is defined as \({\text{Precision}} = \frac{{\text{Ture Positives}}}{{\left( {{\text{Ture Positives}} + {\text{False Positives}}} \right)}}\) and recall is defined as \({\text{Recall}} = \frac{{\text{Ture Positives}}}{{\left( {{\text{Ture Positives}} + {\text{False Negatives}}} \right)}}\). F1 score is defined as \({\text{ F}}1 = 2 \times \frac{precision \times recall}{{precision + recall}}\). The SHD between two PDAGs, as implemented in bnlearn35, measures the distance between the underlying true graph and the learned graph. The CPU time and the total number of statistical tests (number of calls to test conditional independence) are used to evaluate the algorithm efficiency.Evaluation with real-world networks and datasetIn addition to the randomized networks, three large real-world Gaussian BN taken from the bnlearn repository were also utilized: ECOLI7037, MAGIC-IRRI38 and ARTH15039. ECOLI70 is a network with 46 nodes and 70 arcs, representing the protein-coding genes of E. coli. MAGIC-IRRI is a large network with 64 nodes and 102 arcs, modeling multiple traits in plant genetics. The third network is a large network, ARTH150, with 107 nodes and 150 arcs, capturing gene expression and proteomics data of Arabidopsis thaliana. The BN was used to generate data, with the sample size set to 2000. The marginal prior causal knowledge set K with respect to the networks was generated by randomly choosing \(p \in \left\{ {10,{ }20,{ }30} \right\}\) percent of the marginal causal relations. We generated 100 (observational data, knowledge) pairs from each of the three networks.We utilize the datasets provided by Beinlich40 to compare the effectiveness and efficacy of various structure learning methodologies including MINOBSx algorithm. ALARM is a medium network with 37 nodes and 46 arcs. The marginal prior causal knowledge set K with respect to the networks was generated by randomly choosing \(p \in \left\{ {5,{ }10,{ }20} \right\}\) percent of the marginal causal relations. To evaluate the performance of different methods, we calculate the CSR, SHD and running time.Ethics approval and patient consent statementEthical approval was not sought, because this study only involved public networks and data.

Hot Topics

Related Articles