Drug repositioning based on residual attention network and free multiscale adversarial training | BMC Bioinformatics

Data preparationWe employ two benchmark datasets established by investigators. The first dataset is the F-dataset, which corresponds to Gottlieb’s gold standard dataset [45]. The F-dataset contains 1933 known associations between diseases and drugs, including 313 diseases collected from the OMIM database [46] and 593 drugs obtained from the DrugBank database [47]. The second dataset is the C-dataset [24], which includes 2532 known associations between 409 diseases collected from the OMIM database and 663 drugs obtained from the DrugBank database. Table 7 summarizes the benchmark datasets in our proposal.Table 7 Details of the two benchmark datasetsIn this study, we calculated the drug structure similarity matrix Xdr via the simplified molecular input line entry system (SMILES) chemical structure [48], which is represented as the Tanimoto index of chemical fingerprints of the drug pair via the Chemical Development Kit [49]. The disease semantic similarity matrix Xdi is computed from the semantic similarity of the disease phenotypes via information from the medical descriptions of the disease pairs [50].RAFGAEAfter collecting the required data from different sources, we propose a prediction model with three individual modules to predict potential candidate diseases for drugs of interest. We first design the Re_GAT framework, which captures global structural information from a bipartite network. For the second module, we employ GAEs that use known associations between diseases and drugs to simulate label propagation to guide and predict unknown associations. On the basis of the above, we utilize the FMAT module for adversarial training to improve the input quality and increase the prediction accuracy. Figure 7 shows the overall workflow of RAFGAE.Fig. 7Flow chart of the RAFGAE calculation methodRe_GAT frameworkGraph attention networks use a self-attention hidden layer to assign different attention scores to different neighbors, thus extracting the features of neighboring nodes more effectively.The initial input to the Re_GAT framework can be described as follows:$$ h = \{ h_{1} ,h_{2} , \cdot \cdot \cdot h_{N} \} ,h_{i} \in R^{F} $$
(1)
where N represents the node count, F represents the dimension of the feature and hi ϵ RF represents the initial feature matrix of all the nodes. GATs calculate attention scores on the basis of the importance of neighbors and then aggregate neighbor features on the basis of the attention score.The attention score is calculated as follows:$$ e_{ij} = \sigma \left( {a^{T} \left[ {Wh_{i} \parallel Wh_{j} } \right]} \right) $$
(2)
To adjust for the influence of different nodes, we use the softmax function for attention score normalization score:$$ a_{ij} = soft\max_{{\text{j}}} (e_{ij} ) = \frac{{\exp (e_{ij} )}}{{\sum\nolimits_{{k \in N_{i} }} {\exp (e_{ik} )} }} $$
(3)
By combining Formulas (3) and (4), the calculation formula for the attention score can be expressed as:$$ a_{ij} = \frac{{\exp \left( {\sigma (a^{T} [Wh_{i} \parallel Wh_{j} ])} \right)}}{{\sum\nolimits_{{k \in N^{i} }} {\exp \left( {\sigma (a^{T} [Wh_{i} \parallel Wh_{k} ])} \right)} }} $$
(4)
where aij is the attention score, W is a learnable linear transformation matrix, a vector denotes the weight vector, σ() represents the LeakyReLU activation function, and ║ denotes the connection operation. After normalization, the following formula can be used to calculate the final output feature:$$ h_{i} = \sigma \left( {\sum\limits_{{{\text{j}} \in N_{i} }} {a_{ij} Wh_{{\text{j}}} } } \right) $$
(5)
In this study, the drug-disease association matrix is given by matrix A, where the columns represent diseases and the rows represent drugs. The matrix A(j, k) = 1 if drug j is associated with disease k and 0 otherwise. Matrix A and its transposition matrix AT define the bipartite network G:$$ G = \left[ {\begin{array}{*{20}c} 0 & A \\ {A^{T} } & 0 \\ \end{array} } \right] \in R^{{(N_{{{\text{dr}}}} + N_{{{\text{di}}}} ) \times (N_{{{\text{di}}}} + N_{{{\text{dr}}}} )}} $$
(6)
We create the initial input embedding H(0) as follows:$$ H^{(0)} = \left[ {\begin{array}{*{20}c} {X_{{{\text{dr}}}} } & A \\ {A^{T} } & {X_{{{\text{di}}}} } \\ \end{array} } \right] \in R^{(Ni + Nr) \times (Nr + Ni)} $$
(7)
When combined with the bipartite network adjacency matrix G above, the graph attention network is defined as:$$ H^{{{(}l{)}}} = \sigma (GATs(G,H^{(l – 1)} )) $$
(8)
where H(l) represents the node embedding of the l-th layer, where l = 1, …, L, and GATs() represents a single attention layer, whereas the entire Re_GAT framework consists of multiple attention layers.This study proposes a Re_GAT framework through two main strategies for forward propagation: (I) initial residual connection and adaptive residual connection; and (II) attention mechanism layer aggregation.To facilitate the learning of feature information from higher-order neighbors, multiple attention layers are typically used, easily homogenizing the data and thus leading to oversmoothing problems. To alleviate the oversmoothing problem of deep CNNs, residual connections, also known as skip connections was first proposed for ResNet. Inspired by ResNet [51], recent studies have attempted to apply various residual connections to GATs to alleviate the oversmoothing problem. Several studies have shown that residual connections are necessary for deep GATs [52], not only to alleviate the oversmoothing problem, but also to give GATs a more stable gradient.We sum the H(l) weights with H(0) and H(l−1) according to the scale coefficients α and β, respectively. We use the initial skip connection and the adaptive skip connection to mitigate the oversmoothing problem and accelerate the convergence of the GATs. The GAT formula of our model can be rewritten as:$$ H^{(l)} = \sigma \left( {GATs(G,H^{(l – 1)} )} \right) + \alpha H^{(0)} + \beta H^{(l – 1)} $$
(9)
where α and β are hyperparameters.Inspired by LAGCN [35], the embedding of each layer captures structural information from different orders of the heterogeneous network. For instance, the initial layer obtains direct connection information, whereas the higher-order layers collect information about multihop neighbors through iterative update embedding. To fuse all useful information from multiple GAT layers, we use the attention mechanism. Since the Re_GAT framework calculates the embedding of different layers and the embeddings contain different information, we define the resulting GAT layer embedding as:$$ H_{l} = \left[ {\begin{array}{*{20}c} {H_{l}^{dr} } \\ {H_{l}^{di} } \\ \end{array} } \right] \in R^{(Ndr + Ndi) \times kl} $$
(10)
where Hdr l ϵ RNdr×kl is the embedding of the drug in layer l and Hdi l ϵ RNdi×kl is the embedding of the disease in layer l. We use attention mechanism layer aggregation to integrate multiple embedding matrices, and the final fused embedding matrix is as follows:$$ C_{dr} = \sum\limits_{i = 1}^{L} {a_{i} H_{i}^{dr} } $$
(11)
$$ C_{di} = \sum\limits_{i = 1}^{L} {{\text{b}}_{i} H_{i}^{di} } $$
(12)
where, Hdr i and Hdi i are the l-layer embeddings of drugs and diseases, respectively, ai and bi are the attention factors that can be calculated via Formulas (2), (3) and (4), and L is the number of layers.Constructing the feature similarity graphA previous study showed that a similarity graph constructed using drug and disease features can be used to propagate labels [53]. We use the features Cdr and Cdi to construct feature similarity graphs for diseases and drugs, respectively. These features are used for label propagation in the disease and drug spaces. The feature similarity graphs are constructed as follows. First, the Euclidean distance between nodes is calculated and ranked. Second, for each node i, its 10 nearest neighbors are selected. Finally, the adjacency matrix is defined as M, and the set of neighbors of node i is defined as N(i). The matrix M satisfies Mij = 1 when j belongs to N(i); otherwise, Mij = 0.The self-loop adjacency matrix for the similarity graph S is constructed as follows:$$ S = M^{T} \odot M + I $$
(13)
where ⊙ is the Hadamard product. This method can be used to obtain both the drug similarity graph Sdr and the disease similarity graph Sdi.Graph autoencoderPrevious studies have shown that the graph autoencoder may simulate label propagation by iteratively propagating label information on the graph [54,55,56]. The association matrix A can be considered initial label information. The initial label information and the similarity graph S calculated via the above method are input to the GAE. The encoder layer produces a hidden layer Z, whereas the decoder outputs the score F. The encoder of the GAE can be defined as:$$ GAE_{enc} (S,A) = \tanh (S \cdot ReLu(SA\Phi^{(0)} )\Phi^{(1)} ) $$
(14)
where Φ denotes the weight matrix. Here, we use two GAEs to propagate label information on the drug and disease graphs. We can obtain the drug hidden layer Zdr and the disease hidden layer Zdi, which are expressed as follows:$$ Z_{dr} = GAE_{enc} (S_{dr} ,A) $$
(15)
$$ Z_{di} = GAE_{enc} (S_{di} ,A^{T} ) $$
(16)
where Sdr and Sdi denote the drug similarity graph and the disease similarity graph, respectively, and A denotes the association matrix.The decoder of the GAE is applied to decode the hidden layer representation, which is defined as follows:$$ GAE_{dec} (S,Z) = sigmoid(S \cdot {\text{Re}} Lu(SZ\Phi^{(2)} )\Phi^{(3)} ) $$
(17)
Therefore, the score matrices Fdr and Fdi can be obtained by decoding Zdr and Zdi, respectively.$$ F_{dr} = GAE_{dec} (S_{dr} ,Z_{dr} ) $$
(18)
$$ F_{di} = GAE_{dec} (S_{di} ,Z_{di} ) $$
(19)
Since Fdr and Fdi are both low rank matrices [57], they need to satisfy the rank-sum inequality:$$ rank(\alpha F_{dr} + (1 – \alpha )F_{di}^{T} ) \le rank(F_{dr} ) + rank(F_{di}^{T} ) $$
(20)
By performing a linear combination of Fdr and Fdi, the final integrated score is obtained as follows:$$ F = \alpha F_{dr} + (1 – \alpha )F_{di}^{T} $$
(21)
where α ϵ (0,1) represents the balanced weight between the drug space and the disease space.The GAE reconstruction error is the loss of cross-entropy between the final prediction and the true value:$$ L_{r} = – \sum\limits_{i,j} {A_{ij} \log F_{ij} } $$
(22)
As the information from the disease space and the drug space influences the predicted outcome, we use a cotraining approach to train the above two GAEs. The cotraining training loss Lco is defined as:$$ L_{co} = \frac{1}{2}\left\| {Z_{dr} Z_{di}^{T} – A} \right\|_{F}^{2} $$
(23)
The combined loss function can be rewritten as:$$ L = \alpha L_{rdr} + (1 – \alpha )L_{rdi} + L_{co} $$
(24)
where Lrdr and Lrdi denote the reconstruction errors of the two GAEs in the drug space and the disease space, respectively.Free multiscale adversarial trainingIn this section, we investigate how to effectively improve the input quality through data augmentation [58]. When neural networks are trained, the quality of the data is far more important than the quantity. By searching for and stamping out small perturbations that cause the classifier to fail, one may hope that adversarial training could benefit standard accuracy. Adversarial training is a well-studied method that increases the robustness and interpretability of neural networks. When the data distribution is sparse and discrete, the beneficial effect of adversarial perturbations on generalizability is prominent [59]. Inspired by this, we introduce free multiscale adversarial training (FMAT) to augment the node features [60].Adversarial training first generates adversarial perturbations, which are then integrated into the training node features. Given a learning model fθ with parameters θ, we denote the perturbed feature as Hadv = H + δ. Adversarial learning follows the min–max formulation:$$ \mathop {\min }\limits_{\theta } E_{(H,A)\sim D} [\mathop {\max }\limits_{{\left\| \delta \right\|_{p} \le \varepsilon }} L(f_{\theta } (H + \delta ),A)] $$
(25)
where A represents the real value, D represents the data distribution, L represents the objective loss function, ε represents the perturbation budget, and ║║p represents an lp-norm distance measure.The saddle-point optimization problem can be solved via projected gradient descent (PGD), which implements inner maximization, and stochastic gradient descent (SGD), which implements outer minimization. The parameter δ is updated after each step:$$ \delta_{t + 1} = \prod {_{\left\| \delta \right\|\infty \le \varepsilon } (\delta_{t} + e \cdot sign(\nabla_{\delta } L(f_{\theta } (H + \delta_{t} ),A)))} $$
(26)
where ∏║δ║≤ε is projected onto the ε-sphere under the l∞-norm. The initial layer of the Re_GAT framework can be rewritten as:$$ H^{(1)} = \sigma (GAT(H^{(0)} + \delta_{t} ,G)) $$
(27)
To effectively exploit the generalizability of adversarial perturbations and improve their diversity and quality, Chen et al. emphasized the importance of adapting to different types of data enhancements [61]. To achieve this, we introduce a ‘free’ training approach [62].The calculation of δ is inefficient because the N-step update requires N forward and backward channels. This update runs N times completely forward and backward to obtain the worst perturbation δN. However, the model weight θ is updated once to use only δN. Model training is N times slower because of this process. In contrast, the ‘free’ training outputs the model weights θ on the same backward channel while calculating the δ gradient, allowing model weight updates to be calculated in parallel with perturbation updates.’Free’ training has the same robustness and accuracy as standard adversarial training does. However, the training costs are the same as those of clean training. The ‘free’ strategy accumulates a gradient of \(\nabla_{\theta } L\) in each iteration and updates the model weight θ through this gradient. During training process, the model runs the inner circle T times, each time calculating the gradient of θt-1 and δt by taking a step along the average gradient at H(l) + δ0, …, H(l) + δT-1. Formally, the optimization step is$$ \mathop {\min }\limits_{\theta } E_{(H,A)\sim D} \left[ {\frac{1}{T}\sum\limits_{t = 0}^{T – 1} {\mathop {\max }\limits_{{\left\| \delta \right\|_{p} \le \varepsilon }} L(f_{\theta } (H + \delta_{t} ),A)} } \right] $$
(28)

Hot Topics

Related Articles