Explainable drug repurposing via path based knowledge graph completion

Knowledge graph drug repurposing backgroundGraphs are collections of objects (nodes) and the set of interactions (edges) between pairs of these objects30. Knowledge graphs (\(\mathscr {G}\)) are a particular type of multirelational graph where the information is defined by a set of existing triples, including a head node (h), a tail node (\(t^*\)) and a relation (r) that links them:$$\begin{aligned} (h,r,t^*) \in \mathscr {G} \end{aligned}$$
(1)
Drug repurposing on knowledge graphs can be seen as a task of link prediction, where we ask the graph which diseases a certain compound treats. We can understand the problem as a query that has to be solved by the graph. The query is composed of a “compound” c as the head, and the relation “treats”. The answer to this query is a disease d that can be treated with the compound, which is the tail of the triple.$$\begin{aligned} (c,treats,d^*) \in \mathscr {G} \end{aligned}$$
(2)
The problem can be formulated in terms of the probability of success of the compound over the disease d, where the objective is that the answer equals the tail of the triple \(d=d^*\), and which is conditioned on the existing graph:$$\begin{aligned} p(c,treats,d=d^*\mid \mathscr {G})= p(d\mid \mathscr {G},(c,treats))= p(d\mid \mathscr {G},q) \end{aligned}$$
(3)
where \(q=(c,treats)\) is the query.Path-based drug repurposingPath-based drug repurposing leverages the connectivity of the graph to predict the disease that can be treated with a certain drug and also to provide a biological explanation of the prediction. These methods provide paths, which are sequences of nodes and relations that start and the head node of the query, in this case the compound, and follow different relations and nodes to arrive at the candidate disease. Another important concept is the metapath, which is a sequence of types of nodes and types of relations. For example, in the path (Epirubicin \({\mathop {\longrightarrow }\limits ^{\text { upregulates } }}\) Gene EGF \({\mathop {\longrightarrow }\limits ^{\text { regulates } }}\) Gene BRAF \({\mathop {\longrightarrow }\limits ^{\text { is associated to } }}\) Breast cancer) is a particularisation of the metapath (Compound \({\mathop {\longrightarrow }\limits ^{\text { upregulates } }}\) Gene \({\mathop {\longrightarrow }\limits ^{\text { regulates } }}\) Gene \({\mathop {\longrightarrow }\limits ^{\text { is associated to } }}\) Disease). Metapaths are also called rules in certain contexts.To generate paths, a strategy is needed. This strategy can be represented by a policy \(\mu\), which indicates the node that should follow the current node on the path. Another method of obtaining paths is the use of rules z or metapaths that applied to the graph generate paths. This strategy conditionally characterises the probability of a cadidate disease as:$$\begin{aligned} p(d\mid \mathscr {G},q,\mu ) \end{aligned}$$
(4)
in the case of policies, and$$\begin{aligned} p(d\mid \mathscr {G},q,z) \end{aligned}$$
(5)
in the case of rules.The main reason to use paths is that predictions can be interpreted by healthcare professionals, which is necessary to validate candidate diseases for further research. Paths provide information about the side effects, targets, or anatomies involved in the biological mechanism.We propose an architecture that generalises several methods for path-based graph completion, and we show the relation between them. We also present a mathematical formulation in Supplementary Data that unifies these models to understand them as a particularisation of the architecture described in this work.The objective of the model is to predict diseases that can be treated by a certain compound using paths to connect the compound to the disease. Figure 1 shows the training process where the model is optimised to generate high-quality paths between the heads and tails of the queries. In the drug repurposing case, given a compound, the model generates paths that end in a set of candidate diseases. A score function evaluates the quality of the proposed diseases. The model learns to give high scores to diseases that are known to be treated with the compound. Therefore, other diseases that have high scores are good candidates for repurposing.Figure 1General architecture to train path-based drug repurposing models. The information is presented in the form of a knowledge graph composed of triples (in yellow). The input of the model is the graph and the query that has to be solved. The model (in blue) generates paths between compounds and different diseases. A score is computed to assess the quality of the paths and diseases that they propose. The model is optimised (green block) so that the ground truth diseases have the highest score.Taking into account the concepts of path generator and score function, the probability of the candidate for repurposing can be parameterised by \(\theta\) and \(\omega\).$$\begin{aligned} p(d \mid \mathscr {G}, q) \rightarrow p_{\omega , \theta }(d \mid \mathscr {G}, q) \end{aligned}$$
(6)
where the path generation process is parameterised with \(\theta\) and the score function with \(\omega\).This expression can be decomposed into two processes: path generation and reasoning prediction. The objective of the path generator is to obtain the paths from components to diseases, and the reasoning predictor uses those paths to answer queries. As explained before, paths can be generated using policies or rules.$$\begin{aligned}&p_{\omega , \theta }(d \mid \mathscr {G}, q) = \sum _{\mu } p_\omega (d \mid \mathscr {G}, q, \mu )p_{\theta }(\mu \mid \mathscr {G},q) \end{aligned}$$
(7)
$$\begin{aligned}&p_{\omega , \theta }(d \mid \mathscr {G}, q)=\sum _{z} p_\omega (d \mid \mathscr {G}, q, z)p_{\theta }(z \mid \mathscr {G},q) \end{aligned}$$
(8)
Several models fit this description with minor modifications, as we will show in the next sections. There are different approaches, some models are based on rules, while others use reinforcement learning techniques. These approaches fit into the general model that we propose because they are based on similar ideas. Further details on this formulation can be found in Supplementary Data.Fixed path generatorThere are several ways to generate paths, and the simplest is to use a fixed path generator. We can generate paths using a fixed generator following different principles, for example, random walks. This is the approach followed in AnyBURL31. AnyBURL is a bottom-up technique for efficiently learning logical rules from large knowledge graphs inspired by classic bottom-up rule learning approaches. AnyBURL learns as many rules as possible by sampling random paths over a predetermined time interval. Then, each rule is evaluated according to the rate of correct positive predictions among all inferred predictions to obtain the confidence of the rule. The particularisation of the rules in the graph given a query generates paths between the compound and the candidate diseases. This is done in the path generator block in Fig. 1.Several rules generate the same candidate, so an aggregation of the score of each rule is required to find the final score of a candidate. This corresponds to the prediction block in Fig. 1. There are three different approaches to determine the score of each candidate: Maximum score and Noisy-OR originally proposed with AnyBURL in32, and Non-redundant Noisy-OR proposed as a framework called SAFRAN33.The optimisation in this case is very simple as the only task is to keep the rules that have a high enough confidence so that they can provide good predictions.Figure 2Description of AnyBURL based on our architecture. In this case, the input of the model is the whole graph (training set) including the ground truth. Paths are sampled based on random walk, and they are used to generate rules using a bottom-up approach. Then the confidence of the rule is computed, so only the rules with a confidence higher than a threshold are used for prediction. Rules are applied to the graph to obtain predictions which are ranked using the confidence of the rule.Reinforcement learning based path generatorThe next step is to use path generators that can be updated based on data to learn the best way to traverse the graph to make predictions. Some methods use reinforcement learning to model the trajectory on the graph as a Markov Decision Process. Starting from the head node, the agent learns to walk to the tail node, choosing intermediate nodes step by step, taking into account the path history. Paths are generated based on policy \(\mu\), which is the strategy to traverse the graph to make good predictions.This approach is followed in MINERVA28 and its variants29,34. In the drug repurposing context, the environment is the graph, and the possible actions are all the links the agent can choose from a certain node to the next. The objective of the agent is to move from a compound node to a disease node which is linked through the relations “treats”. The state includes all nodes and relations travelled through to the current node, so the next action depends on the whole path. Moreover, it is necessary to define a reward function \(R\left( \pi ^n\mid q\right)\) that indicates whether the path \(\pi ^n\) provides good predictions or not.In this case, as shown in Fig. 3, the main element is a policy generator that is trained to maximise long-term reward. Paths are sampled from the generator to connect the compound to the candidate diseases. The prediction and calculation of the score function are included in this block, because the score is directly related to the policy followed to generate the path as shown in the Supplementary Data.Figure 3Description of MINERVA based on our architecture. The core of the algorithm is the policy generator, which is trained to obtain the best policy through the reward using policy search. Paths are sampled from the generator to obtain candidate diseases which are ranked according to the path that proposes them.The long-term reward over the policy is defined as:$$\begin{aligned} \mathbb {E}_{\pi ^n \sim \mu _\theta }\left[ R\left( \pi ^n\mid q\right) \right] \end{aligned}$$
(9)
where the policy generator is parameterised with \(\theta\) and \(\pi ^n\) represent the paths sampled from policy \(\mu _\theta\). After some manipulation found in the Supplementary material, the function that needs to be optimised is the following:$$\begin{aligned} \frac{1}{N}\sum _{n=1}^N R\left( \pi ^n\mid q\right) \log p_\theta (\pi ^n\mid \mathscr {G}, q) \end{aligned}$$
(10)
which averages the paths \(\pi ^n\) sampled from policy \(\mu _\theta\) weighted by the reward \(R\left( \pi ^n\mid q\right)\) of the path. N is the number of paths.PoLo29 is a modification of MINERVA that focusses on drug repurposing. The model includes a term in the reward related to how similar the path is to a set of manually crafted metapaths considered reliable for repurposing. This model relies on the existence of expert knowledge to improve the results of MINERVA.Path generator using variational inferenceReasoning based on reinforcement learning has the problem that the action space is large and the reward is sparse, as few paths lead to the correct answer and a positive reward. For that reason, there are models that use rules as latent variables to make predictions. These rules (z) allow for the interpretability of the results and support the predictions. RNNLogic35 follows this approach.The model includes a rule generator and a reasoning predictor that apply the rules to propose candidate answers for the query, as shown in Fig. 4. The rule generator returns a set of logic rules conditioned on the query, which are given to the reasoning predictor for query answering. The reasoning predictor computes the likelihood of the answer conditioned on the logic rules and the existing knowledge graph \(\mathscr {G}\), \(p_\omega (d\mid \mathscr {G}, q, z)\). At each training iteration, a few logic rules are sampled from the generator, which are fed into the reasoning predictor to try these rules for prediction. The distribution \(p(d\mid \mathscr {G},q)\) can be calculated according to Eq. (7) as:$$\begin{aligned} p_{w, \theta }(d \mid \mathscr {G}, q)=\sum _{z} p_w(d \mid \mathscr {G}, q, z) p_\theta (z \mid q) \end{aligned}$$
(11)
which is the objective function that has to be optimised by the whole model. This task is divided as the generator and predictor use different optimisation algorithms, but both contribute to a common goal.Figure 4Description of RNNLogic based on our architecture. In addition to the graph and the query, there is another input which is a set of prior rules to initialize the generator. The model consists of a rule generator and a reasoning predictor. A set of rules is sampled and used for prediction. During training, the predictor is updated using maximum likelihood estimation (MLE). Combining information of the generation and the prediction, a score for each rule \(\mathscr {H}(z_i)\) is computed and it is using during the training of the generator which is based on expectation maximisation.The generator \(p_\theta (z\mid q)\) is updated using expectation maximisation (EM), and the optimisation of the predictor \(p_w(d \mid \mathscr {G}, q, z)\) is based on maximum likelihood principles (MLE). Given the graph and the query, a set of rules is sampled and then used for the prediction \(\hat{z} \sim p_\theta (z \mid q)\). Based on the results of the predictions, a score is calculated for each rule \(\mathscr {H}(\hat{z}_i)\). It includes information from both the generation and prediction processes, so it is possible to know which are the high-quality rules for prediction \(\hat{z}_I\).To optimise the generator \(p_\theta (z \mid q)\), a set of high-quality rules \(z_I\) is selected according to \(\mathscr {H}(\hat{z}_I)\). For each data instance, the set of rules \(\hat{z}_I\) is treated as part of the training data, and the generator is updated by maximising the logarithmic likelihood of \(\hat{z}_I\). Moreover, \(\mathscr {H}(\hat{z}_I)\) has information on the quality of the rules, so it can also be included in the generator optimisation in the form of weights of each rule:$$\begin{aligned} \mathscr {H}(\hat{z}_I) \log p_\theta \left( \hat{z}_I \mid q\right) =\sum _{z_i \in \hat{z}_i} \mathscr {H}(\hat{z}_I) p_\theta \left( \hat{z}_i \mid q\right) \end{aligned}$$
(12)
The function to be maximised is the average of the rules weighted by the score of the rules.Rules generate paths that end in candidate diseases which are ranked according to a score computed based on trainable parameters related to the importance of rules and paths. The score measures the reliability of the predictions and can be used in the drug repurposing case study to evaluate and interpret candidates for repurposing.XG4RepoWe have developed XG4Repo, a ready-to-use framework for computational drug repurposing using knowledge graphs. This framework is capable of predicting candidate diseases for repurposing and providing informative explanations to help a human expert in the research of new treatments.Our proposal is a particularisation of the described architecture that combines state-of-the-art methods for graph completion and optimises them for drug repurposing. In Fig. 2 we see the architecture of XG4Repo. The core of the framework is RNNLogic, because it provides informative rules and achieves good results. To initialise the generator, we have used the rule miner in AnyBURL, as the generated rules are good for prediction tasks. These rules need to be processed to be readable by the generator and filtered to remove those that are not general enough.We have adapted the graph completion task to repurposing. In conventional graph completion, the model is trained to predict queries that include every type of relation. In drug repurposing, we are only interested in the relation “treats”, so the model is specifically optimised to find diseases that can be treated by compounds. The training set only includes triples of “compound treats disease” but the whole graph can be traversed to find paths that include nodes and relations of any kind. Reducing the training set reduces computational time and resources, which is very interesting in the case of drug repurposing. As we see in Fig. 5, “compound treats disease” (CtD) triples are differentiated from the rest of the graph.Figure 5Description of XG4Repo architecture. The first step of the process is to generate a set of prior rules using AnyBURL rule miner. These rules are processed and used as priors in the generators. The model is trained using only the triples “compound treats disease” so the computational complexity is reduced. Once the predictions are made, the rules and corresponding scores are stored in natural language, so they can be easily understood. Moreover, our framework can generate Cypher queries to obtain the paths in Hetionet given the rules. This adds interpretability to the predictions without adding extra storage requirements.A key element in our design is a module for the interpretability of the results, where we can look for the predictions, the rules that support them, how important these rules are, and the paths that connect the compound to the disease. This is useful for those experts interested in the repurposing task, as they get predictions and explanations in natural language. Moreover, the code generates Cypher queries to obtain the paths generated by any specific rule on Hetionet. This is more efficient than storing every path generated by the rules. The code of XG4Repo is ready-to-use and available in https://github.com/AnaJimBej/XG4Repo.An important aspect of our interpretability-based contribution is that the explainability module is integrated with the prediction process. It is not just a model to make predictions, it is a framework that starting from prior knowledge and a candidate drug predicts the diseases it can treat and the explanation of why it would work in natural language. This makes it possible for it to be used by end users who want to initiate drug repurposing research.DataHetionet18 was developed within the Rephetio project with the aim of creating a knowledge graph suitable for different tasks related to drug repurposing, and is publicly available. This database has been chosen as it is public and can therefore be used for comparison with other state-of-the-art methods. In addition, it has been used for similar tasks related to drug repurposing.One notable aspect of Hetionet is its emphasis on incorporating multiple types of relations, such as drug-target interactions, gene-disease associations, and pathway connections. This comprehensive approach enables researchers to explore and prioritise potential drug repurposing opportunities, as well as gain insight into the underlying mechanisms of diseases.Among the 2,250,197 triplets that make up the knowledge graph, only 755 correspond to “compound treats disease”. An 80-10-10% split was applied to divide the data set for training, testing and validation, respectively, obtaining the triplets. Of the 755 triplets of “compound treats disease”, 598 are used to train the model, 82 for testing, and the remaining 75 triplets are used for validation.The model is specifically trained to predict “compound treats disease” relations. The rule generator learns the relations of every triplet in the graph. The rules generated to make the predictions include relations of all types, so the paths generated can include any of the nodes or relations that make up the graph.The graph has been augmented with inverse relations, which, for each triplet, go from tail to head \((t, r^{-1}, h)\). This adds flexibility to the rules and allows more connections between the nodes.Evaluation metricsOnce a model has been trained, it has to be evaluated. The output of the model is a score for each possible answer for the test. During the test, we check that the disease of the triple being evaluated (\(d^*\)) receives a high score. Candidate diseases are ordered by decreasing score and the rank is defined as the position of the disease of the ground truth (\(d^*\)) in the list of candidates.Based on the rank, several metrics are computed, which aggregate in a single number the performance of the model36. In this work, we have evaluated the models using mean reciprocal rank (MRR), Hits@1, Hits@3 and Hits@10.The metrics calculated in this research are filtered as described in36. Moreover, binomial proportion confidence intervals are applied to compare the performance of the models as the size of the test set does not have enough samples to use the Gaussian approximation. It provides an interval estimate of a success probability p when only the number of experiments n and the number of successes \(n_s\) are known.

Hot Topics

Related Articles