CMAGN: circRNA–miRNA association prediction based on graph attention auto-encoder and network consistency projection | BMC Bioinformatics

DatasetThe original CMAs were retrieved from the circBank database (http://www.circbank.cn/) [17], a comprehensive database containing human annotated circRNA from different sources. The file named “circbank_miRNA_all_v1.zip” comprised a large number of CMAs, with each association containing one circRNA, one miRNA, and one score. Following the same operation in previous studies [21,22,23, 25, 28,29,30], associations with scores no less than 1000 were selected, yielding 9589 CMAs covering 2115 different circRNAs and 821 different miRNAs. This dataset is same as those used in previous studies [21,22,23, 25, 28,29,30].Efficient computational methods can be designed on the basis of these known CMAs. For formulation, these associations can be represented by a circRNA–miRNA adjacency matrix denoted by \(AS \in \Re^{m \times n}\), where m is the number of miRNAs and n is the number of circRNAs (m = 821 and n = 2115 in this study). If the i-th miRNA and j-th circRNA constituted an association, then \(AS(i,j)\) was set to 1; otherwise, it was defined as 0. The presence of zeros in \(AS\) did not necessarily indicate the absence of association between the corresponding circRNA and miRNA; rather, it meant that the association has not been observed or confirmed at present. The goal of this study was to design an efficient computational method to pinpoint latent CMAs in \(AS\). In addition to the adjacency matrix, bipartite graph is a common way to represent CMAs. This graph has two node groups—one containing the circRNA nodes and the other consisting of the miRNA nodes. Each edge in this graph indicates a CMA, i.e., one circRNA node and one miRNA node are connected if and only if they can constitute an association. In the following text, this bipartite graph was denoted by NCM and called the CMA network.Another CMA dataset collected by Wang et al. [25] was employed in the present work, which can be obtained at https://github.com/1axin/KGDCMI. This dataset has 9905 CMAs, including 2346 cirRNAs and 962 miRNAs, and has also been used in several CMA prediction studies [23, 27,28,29,30]. Here, it was adopted to further test the performance of CMAGN. For a clear distinction between the two datasets, this dataset was labeled as CMA-9905, and the above-mentioned dataset was labeled as CMA-9589.Similarity network constructionThe similarity between circRNAs or miRNAs is essential information to construct prediction methods; similar circRNAs are always associated with similar miRNAs, and vice versa. The most basic similarity between circRNAs or miRNAs can be discerned from their sequences. Here, the Levenshtein distance was employed to measure the difference between two sequences and then calculate the similarity between them.CircRNA and miRNA sequences were downloaded from three databases: CircFunBase (https://bis.zju.edu.cn/CircFunBase/) [33], CircBase (http://www.circbase.org/) [18], and miRbase (http://www.mirbase.org/) [19]. The Levenshtein distance is an edit distance commonly used to measure dissimilarity between two strings and is defined as the minimum number of edit operations required to transform the source string into the target string, allowing only three single-character operations: insertion, deletion, and substitution. In this study, the circRNA and miRNA sequences were deemed as strings to generate the Levenshtein distance between two circRNAs or two miRNAs. For formulation, the distance between two circRNAs ci and cj was denoted by \(D_{circRNA} (c_{i} ,c_{j} )\), and that between two miRNAs mi and mj was represented by \(D_{miRNA} (m_{i} ,m_{j} )\). Afterward, the sequence similarity of two circRNAs or miRNAs can be computed as follows:$$CSS(c_{i} ,c_{j} ) = \frac{{l(c_{i} ) + l(c_{j} ) – D_{circRNA} (c_{i} ,c_{j} )}}{{l(c_{i} ) + l(c_{j} )}}$$
(1)
$$MSS(m_{i} ,m_{j} ) = \frac{{l(m_{i} ) + l(m_{j} ) – D_{miRNA} (m_{i} ,m_{j} )}}{{l(m_{i} ) + l(m_{j} )}}$$
(2)
where \(CSS \in \Re^{n \times n}\) is the sequence similarity matrix of circRNAs; \(MSS \in \Re^{m \times m}\) is the sequence similarity matrix of miRNAs; l(ci) and l(cj) are the sequence length of circRNAs ci and cj, respectively; and l(mi) and l(mj) are the sequence length of miRNAs mi and mj, respectively.The similarity networks for circRNAs and miRNAs were constructed on the basis of the matrices CSS and MSS, respectively, and 2115 circRNAs and 821 miRNAs were defined as nodes. The edges in these two networks were determined by the corresponding elements in CSS and MSS. Threshold T was set to CSS and MSS to discard the linkages with weak similarities. In CSS, if the element at the i-th row and j-th column was larger than the threshold, then the i-th and j-th circRNAs were connected by an edge. Under this operation, a similarity network for circRNAs was formed and denoted by NcircRNA. The network for miRNAs indicated by NmiRNA was also constructed using the same operation.Representations of circRNAs and miRNAsCorrect and complete representations of circRNAs and miRNAs are essential for building efficient models to predict CMAs. Here, each representation contained two parts. The first part was derived from the adjacency matrix \(AS\), where each row indicated the raw representation of one miRNA and each column stood for the raw representation of one circRNA. These raw representations were refined by GATE [31] to access the high-level features of circRNAs and miRNAs. The second part was obtained from the CMA network NCM via a popular network embedding algorithm, node2vec [32]. These two parts contained different types of information for circRNAs and miRNAs. The first part focused on the local relationships of circRNAs or miRNAs because the raw representations contained only the information of the direct neighbors of circRNAs or miRNAs. For the second part, the entire topological structure of the CMA network NCM was considered, including the global relationships of circRNAs or miRNAs. The combination of these two parts can fully represent circRNAs and miRNAs.Representations of circRNAs and miRNAs yielded by GATEThe raw feature vectors of circRNAs derived from the adjacency matrix \(AS\) were rudimentary and can be improved by some advanced computational methods. The similarity network NcircRNA in Sect. “Similarity network construction” indicated the relationships between circRNAs. Fusing the above two forms can access to the high-level features of circRNAs. The same problem occurred for miRNAs. In this study, GATE [31] was employed to reconstruct the feature vectors of circRNAs and miRNAs by using the raw feature vectors of circRNAs (miRNAs) derived from \(AS\) and circRNA (miRNA) similarity network as the input. The reconstructed feature vectors contained essential information regarding the raw feature vectors and the topological structure of the similarity networks.GATE is an unsupervised learning method that can learn new representations of nodes from the topological structure of a network and the raw representations of its nodes. It involves two procedures, namely, encoder and decoder, which are described briefly as follows:Encoder The encoder comprises several layers. The first layer adopts the raw features of nodes as input, and other layers use the output of the former layer as input. For clear descriptions, the raw feature vector of the i-th node is denoted by \(F_{i} = h_{i}^{(0)}\), and the output of the k-th layer for the i-th node is represented by \(h_{i}^{(k)}\). The main idea of GATE is to aggregate the feature vectors of neighbors for a given node to generate the new feature vector of the node. However, it further considers the weights of neighbors, that is, it assigns different weights to different neighbors. To determine the weights, GATE first calculates the correlation between any two nodes as$$b_{ij}^{\left( k \right)} = Sigmoid\left( {V_{s}^{{\left( k \right)^{T} }} \sigma \left( {W^{\left( k \right)} h_{i}^{{\left( {k – 1} \right)}} } \right) + V_{r}^{{\left( k \right)^{T} }} \sigma \left( {W^{\left( k \right)} h_{j}^{{\left( {k – 1} \right)}} } \right)} \right)$$
(3)
where \(W^{(k)} \in \Re^{d(k) \times d(k – 1)}\), \(V_{s}^{(k)} \in \Re^{d(k)}\), and \(V_{r}^{(k)} \in \Re^{d(k)}\) are three matrices that consist of trainable parameters at the k-th layer; \(d(k)\) is the dimension of the output feature vectors of the k-th layer; and \(\sigma\) is an activation function. The above correlations were then normalized by the softmax function defined by$$a_{ij}^{(k)} = Softmax(b_{ij}^{(k)} ) = \frac{{\exp (b_{ij}^{(k)} )}}{{\sum\nolimits_{l \in N(i)} {\exp (b_{il}^{(k)} )} }}$$
(4)
where \(N(i)\) stands for the closed neighbor set of the i-th node. The outcomes of Eq. 4 were regarded as the neighbors’ weights. The aggregation for the i-th node was conducted as$$h_{i}^{(k)} = \sum\limits_{l \in N(i)} {a_{il}^{(k)} \sigma (W^{(k)} h_{l}^{(k – 1)} )}$$
(5)
where \(W^{(k)}\) is same as that in Eq. 3. If the encoder contains L layers, then \(h_{i}^{(L)} = h_{i}\) is the output of the encoder for the i-th node. In this study, the feature vectors of circRNAs (miRNAs) derived from adjacency matrix \(AS\) were used as the raw feature vectors of nodes and fed into the encoder procedure of GATE. The output of the last layer was selected as the novel feature vectors of circRNAs (miRNAs) and used in the following step to construct the model. The reconstructed feature vector for the i-th circRNA and miRNA is denoted by \(C_{i}^{g}\) and \(M_{i}^{g}\), respectively.Decoder As an unsupervised learning method, GATE contains the decoder that recovers the raw features of nodes, thereby testing the quality of reconstructed feature vectors yielded by the encoder. The decoder consists of the same number of layers as the encoder. The output of the encoder is selected as the input of the decoder, denoted by \(\tilde{h}_{i}^{(L)}\), and the output of the k-th layer is indicated by \(\tilde{h}_{i}^{(k – 1)}\). Here, \(\tilde{h}_{i}^{(k – 1)}\) was updated using the following equations:$$\tilde{b}_{ij}^{\left( k \right)} = Sigmoid\left( {\tilde{V}_{s}^{{\left( k \right)^{T} }} \sigma \left( {\tilde{W}^{\left( k \right)} \tilde{h}_{i}^{{\left( {k } \right)}} } \right) + \tilde{V}_{r}^{{\left( k \right)^{T} }} \sigma \left( {\tilde{W}^{\left( k \right)} \tilde{h}_{j}^{{\left( {k } \right)}} } \right)} \right)$$
(6)
$$\tilde{a}_{ij}^{(k)} = Softmax(\tilde{b}_{ij}^{(k)} ) = \frac{{\exp (\tilde{b}_{ij}^{(k)} )}}{{\sum\nolimits_{l \in N(i)} {\exp (\tilde{b}_{il}^{(k)} )} }}$$
(7)
$$\tilde{h}_{i}^{(k – 1)} = \sum\limits_{l \in N(i)} {\tilde{a}_{il}^{(k)} \sigma (\tilde{W}^{(k)} \tilde{h}_{l}^{(k)} )}$$
(8)
where \(\tilde{W}^{(k)} \in \Re^{d(k – 1) \times d(k)}\), \(\tilde{V}_{s}^{(k)} \in \Re^{d(k – 1)}\), and \(\tilde{V}_{r}^{(k)} \in \Re^{d(k – 1)}\) are three matrices that contain trainable parameters; and \(d(k – 1)\) is the dimension of the output feature vectors of the k-th layer. The outcome of the last layer for the i-th node is defined as \(\tilde{F}_{i}\), i.e., \(\tilde{F}_{i} = \tilde{h}_{i}^{(0)}\).Loss Function The whole GATE uses the raw feature vector of the i-th node \(F_{i}\) as input. The encoder reconstructs \(F_{i}\) as \(h_{i}\), and the decoder recovers \(h_{i}\) into \(\tilde{F}_{i}\). GATE considers two types of loss to examine the quality of the reconstructed feature vector \(h_{i}\). The first is the distance between \(F_{i}\) and \(\tilde{F}_{i}\) measured by$$\sum\limits_{i = 1}^{N} {\left\| {F_{i} – \tilde{F}_{i} } \right\|_{2} }$$
(9)
where N is the total number of nodes in the network. The connections in the network should also be considered, and the neighboring nodes should be assigned to similar feature vectors. The second type of loss is formulated as$$- \sum\limits_{i = 1}^{N} {\sum\limits_{l \in N(i)} {\log (\frac{1}{{1 + \exp ( – h_{i}^{T} h_{l} )}})} }$$
(10)
where \(N(i)\) stands for the closed neighbor set of the i-th node. Accordingly, the final loss function combines the losses in Eqs. 9 and 10 as$${\text{Loss}} = \sum\limits_{i = 1}^{N} {(\left\| {F_{i} – \tilde{F}_{i} } \right\|_{2} – \lambda \sum\limits_{l \in N(i)} {\log (\frac{1}{{1 + \exp ( – h_{i}^{T} h_{l} )}})} )}$$
(11)
where \(\lambda\) is a parameter balancing the two types of loss, \(F_{i}\) is the raw feature vector of the i-th node, \(\tilde{F}_{i}\) is the recovered feature vector (output of the decoder procedure) of the i-th node, \(h_{i}\) is the new feature vector of the i-th node (output of the encoder procedure), and \(h_{l}\) is the new feature vector of the l-th node that is the neighbor of the i-th node. New high-quality representations of nodes can be accessed by minimizing the loss.Representations of circRNAs and miRNAs yielded by node2vecThe above representation of circRNAs (miRNAs) were derived from their local relationships to miRNAs (circRNAs) and the circRNA (miRNA) similarity network. However, the global relationships of circRNAs to miRNAs were ignored. The same problem occurred for miRNAs. Therefore, additional information was employed to evaluate the global associations between circRNAs and miRNAs. Known CMAs are essential for predicting novel associations. Thus, using them to measure the associations between circRNAs and miRNAs can help in setting up efficient models. As mentioned in Sect. “Dataset”, the known CMAs are represented by the CMA network NCM, from which the representation of circRNAs and miRNAs can be accessed. The powerful network embedding algorithm, node2vec [32], was employed for this task.Node2vec is the improved version of DeepWalk [34]. For a given network, it produces many paths starting from each node. Suppose the starting node is u and the current endpoint of the path starting from u is ni-1, which is the (i-1)-th node in this path. This path is extended to the i-th node, denoted by ni by selecting one neighbor of ni-1. However, the selection probability is not equal. The probability is determined using the following equation:$$P(n_{i} = w|n_{i – 1} = v) = \left\{ {\begin{array}{*{20}c} {\pi_{vw} /Z} & {if \, w{\text{ is adjacent to }}v} \\ 0 & {Otherwise} \\ \end{array} } \right.$$
(12)
,where \(\pi_{vw}\) is the transition probability from v to w and computed as$$\pi_{vw} = \alpha_{pq} (t,w) \cdot w_{vw}$$
(13)
$$\alpha_{pq} (t,w) = \left\{ {\begin{array}{*{20}c} {1{/}p} & {if \, d_{tw} = 0} \\ 1 & {if \, d_{tw} = 1} \\ {1/q} & {if \, d_{tw} = 2} \\ \end{array} } \right.$$
(14)
where \(w_{vw}\) stands for the weight on edge connecting v and w, t is the (i-2)-th node in the path, and \(d_{tw}\) denotes the distance between t and w. The symbol Z in Eq. 12 is the sum of the transition probabilities from v to its neighbors. Evidently, the paths sampled by node2vec are produced by biased selection. The next node is not selected with equal probability. This operation can efficiently capture the structure traits of the network. The selection of the next node is continuously executed until the current path reaches the predefined length. After the paths starting from each node have been sampled, these paths are deemed as sentences. Meanwhile, the nodes in the paths are considered as words. This information is fed into word2vec with skip-gram to produce the feature vectors of the nodes.This study applied the node2vec program downloaded from https://snap.stanford.edu/node2vec/ to the CMA network NCM to generate the feature vectors of circRNAs and miRNAs. For formulation, the feature vectors of the i-th circRNA and miRNA are denoted by \(C_{i}^{n}\) and \(M_{i}^{n}\), respectively.For the i-th circRNA, two representations \(C_{i}^{g}\) and \(C_{i}^{n}\) were obtained and then combined to generate the final representation of the circRNA. The same operation was conducted for each miRNA. The final representations of the i-th circRNA and miRNA, denoted by \(\hat{C}_{i}\) and \(\hat{M}_{i}\), respectively, were formulated as$$\hat{C}_{i} = [C_{i}^{g} ,C_{i}^{n} ]$$
(15)
$$\hat{M}_{i} = [M_{i}^{g} ,M_{i}^{n} ]$$
(16)
Reconstruction of similarity networksWith these new representations of circRNAs and miRNAs, their corresponding informative similarity network can be built. Here, cosine kernel was adopted to measure the associations between circRNAs (miRNAs) on the basis of their new representations. For circRNAs ci and cj, their similarity was reformulated as.$$CS(c_{i} ,c_{j} ) = \frac{{\hat{C}_{i} \cdot \hat{C}_{j}^{T} }}{{\left| {\hat{C}_{i} } \right| \cdot \left| {\hat{C}_{j} } \right|}},$$
(17)
where CS is the new circRNA similarity matrix collecting the outcomes of Eq. 17. The circRNA similarity network was built using CS as the weighted adjacency matrix.The similarity network for miRNAs was reconstructed in the same way. For miRNAs mi and mj, their similarity was measured as follows:$$MS(m_{i} ,m_{j} ) = \frac{{\hat{M}_{i} \cdot \hat{M}_{j}^{T} }}{{\left| {\hat{M}_{i} } \right| \cdot \left| {\hat{M}_{j} } \right|}}$$
(18)
where MS is the miRNA similarity matrix that stores the outcomes of Eq. 18. The miRNA similarity network was then established using MS as the weighted adjacency matrix.NCPWith the above circRNA similarity network (represented by CS), miRNA similarity network (represented by MS), and CMA network (represented by adjacency matrix AS), a simple and efficient network-based method named NCP was utilized to calculate the association score for circRNAs and miRNAs. NCP has wide applications in association predictions [35,36,37,38,39]. The score yielded by NCP integrates two subscores, which are obtained by projecting the respective circRNA and miRNA similarity networks onto the CMA network. For circRNA ci and miRNA mj, the subscore for projecting the circRNA similarity network onto the CMA network was calculated as.$$CSP(c_{i} ,m_{j} ) = \frac{AS(i,:) \times CS(:,j)}{{\left| {AS(i,:)} \right|}}$$
(19)
where \(AS(i,:)\) represents the i-th row of AS, \(CS(:,j)\) denotes the j-th column of CS, and \(\left| {AS(i,:)} \right|\) is the length of the vector \(AS(i,:)\). Meanwhile, the subscore for projecting the miRNA similarity network onto the CMA network was computed as.$$MSP(c_{i} ,m_{j} ) = \frac{MS(i,:) \times AS(:,j)}{{\left| {AS(:,j)} \right|}}$$
(20)
where \(MS(i,:)\) is the i-th row of MS, \(AS(:,j)\) indicates the j-th column of AS, and \(\left| {AS(:,j)} \right|\) indicates the length of \(AS(:,j)\). Finally, NCP integrated the above subscores as follows:$$NSP(c_{i} ,m_{j} ) = \frac{{CSP(c_{i} ,m_{j} ) + MSP(c_{i} ,m_{j} )}}{{\left| {CS(i,:)} \right| + \left| {MS(:,j)} \right|}}$$
(21)
where NSP is the final recommendation matrix collecting all the outcomes of Eq. 21.Outline of CMAGNThis work designed a new model named CMAGN for predicting CMAs. The architecture of this model is illustrated in Fig. 1. CMAGN consists of several modules. First, the similarity networks for circRNAs and miRNAs were constructed on the basis of their sequence similarities in module A as illustrated in Fig. 1(A). Second, the raw feature vectors of circRNAs (miRNAs) were extracted from the circRNA–miRNA adjacency matrix and fed into GATE to access new representations of circRNAs (miRNAs) in module B, as illustrated in Fig. 1(B). Third, the feature vectors of circRNAs and miRNAs were generated by applying node2vec to CMA network in module C, as displayed in Fig. 1(C). Fourth, two representations of circRNAs (miRNAs) were combined to reconstruct the circRNA (miRNA) similarity network in module D, as shown in Fig. 1(D). Finally, NCP was applied to the reconstructed similarity networks and the CMA network to yield the recommendation matrix in module E, as shown in Fig. 1(E).Fig. 1Entire procedures of CMAGN. A The circRNA (miRNA) similarity network is built on the basis of sequence similarities between them. B The similarity network and raw representations of circRNAs (miRNAs) derived from the adjacency matrix are fed into the GATE to generate new representations of circRNAs (miRNAs). C Representations of circRNAs (miRNAs) are extracted from the CMA network via node2vec. D The above two representations are combined to encode circRNAs (miRNAs), which are used to reconstruct circRNA (miRNA) similarity network. E The reconstructed similarity networks and association network are analyzed by NCP to produce a recommendation matrix

Hot Topics

Related Articles