Inferring gene regulatory networks with graph convolutional network based on causal feature reconstruction

Data set and evaluation indicatorsThe DREAM5 dataset provided by the DREAM CHALLENGES37 and the mDC networks (Mouse dendritic cell)38 are used in this paper. The specific information about the dataset is presented in Table 1. The S.cerevisiae network has more genes, fewer samples and TFs, the true-positive edges are less than the true-negative edges, which induce class imbalances.
Table 1 The details of the DREAM5 dataset.The implementation of gene regulatory network inference by GCN link prediction involved several steps. The hyperparameter for the Gaussian kernel function was set by several experiments. The Autoencoder hidden nodes were set to 805, 536 and 383 corresponding to the number of samples in the E.coli network, S.cerevisiae and mDC network, respectively. The Adam optimizer was used, with a learning rate of 0.001 which is chosen by experiments. Additional L2 regularization was applied during training to prevent parameter overfitting, the L2 rate is 0.001 which is chosen by experiments. After obtaining the features, they are fed into the GCN. The dataset was divided into a training set (70%) and a test set (30%). The Adam optimizer parameters is same as Gaussian-kernel Autoencoder.In this paper, AUROC (Area Under the Receiver Operating Characteristic Curve) and AUPRC (Area Under the Precision–Recall Curve) are used as evaluation metrics for link prediction. AUROC represents the area under the curve with the axes of True Positive Rate (TPR) and False Positive Rate (FPR), while AUPRC represents the area under the curve with the axes of Precision and Recall.$$\begin{aligned} TPR=\frac{TP}{TP+FN} \end{aligned}$$
(9)
$$\begin{aligned} FPR=\frac{FP}{FP+TN} \end{aligned}$$
(10)
$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(11)
$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(12)
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of true negatives.In order to verify the effectiveness of the method, three experiments are set up to verify the effectiveness of the feature extraction method proposed in this paper, the effectiveness of causal feature reconstruction and the effectiveness of link prediction, separately.Experiment 1: validating the effectiveness of feature extraction methodsTo evaluate the effectiveness of the proposed feature extraction method in this paper, we chose a range of input features, including the original gene expression data, the sample expression features extracted solely by the Gaussian kernel function (GKF), the features obtained through singular value decomposition (SVD)29, the features obtained through non-negative matrix factorization (NMF)28, the fusion with the features extracted by the 1DCNN method22, and the features extracted by the Gaussian-kernel Autoencoder (gAE). For the link prediction task, a two-layer GCN was selected as the network model, and the number of iterations for network training was determined by E.coli, S.cerevisiae and mDC networks, and comparative results are depicted in Figs. 5, 6, 7.Fig. 5Comparison of results from different feature extraction methods in the E.coli network.Fig. 6Comparison of results from different feature extraction methods in the S.cerevisiae network.Fig. 7Comparison of results from different feature extraction methods in the mDC network.Figures 5, 6,7 demonstrate that the original gene expression features resulted in the lowest AUROC and AUPRC metrics for the E.coli network, the S.cerevisiae network and the mDC network. The NMF and SVD methods achieved higher metrics by compressing and filtering of the expression data. Using the GKF, the AUROC and AUPRC metrics improved to 0.804 and 0.801 in the E.coli network, 0.801 and 0.711 in the S.cerevisiae network, 0.656 and 0.642 in the mDC network, these metrics are higher than using the original data, 1DCNN, NMF, and SVD, indicating that the separable features can improve the accuracy of inferring gene regulatory networks.In the E.coli network, using the gAE to extract features achieved the highest AUROC and AUPRC metrics, surpassing the GKF method by approximately 3% in the AUPRC. In the S.cerevisiae network, the AUROC metric for feature extraction by the gAE was 5.8% higher than using the GKF, however, the AUPRC metrics were slightly lower due to the imbalance in the categories of this network, which has more negative edges. In the mDC network, the AUROC metric and AUPRC metric surpassing the GKF method approximately by 13% and 19%, which achieved the highest AUROC and AUPRC metrics. Therefore, the gAE is able to mine deeper, more complex features to improve prediction accuracy compared to GKF. The result shows that the gAE feature extraction method is effective in the E.coli, the S.cerevisiae and the mDC network, which provides sufficient guidance for subsequent link prediction tasks.In order to assess the reliability of the Gaussian-kernel Autoencoder in the separable features, and the effect of the Gaussian kernel parameter $\sigma$ (in eqution(1)) in separable features and prediction results, the parameter $\sigma$ was taken to be 0.1, 0.5, 1, 2, and 5, tested by a two-layer GCN and a GCN based on causal feature reconstruction (CRGCN), the results of the tests are shown in Table 2.
Table 2 Comparison of the Gaussian kernel parameter.From the Table 2, it can be seen that when the parameter $\sigma$ of the Gaussian kernel is taken as 1, the E.coli, the S.cerevisiae and the mDC network have the highest AUROC and AUPRC metrics.The T-SNE method39 is used to visually analyse the original features and the separable features. Figures 8, 9, 10, 11, 12 demonstrating that the deep and separable features are extracted by the gAE. In each Fig, the blue represents the E.coli network, the green represents S.cerevisiae network, the orange represents mDC network, and in each sub-graph of the Fig, the raw features are shown on the left and the separable features are shown on the right.Fig. 8Visualization of the network features, when $\sigma =0.1$.Fig. 9Visualization of the network features, when $\sigma =0.5$.Fig. 10Visualization of the network features, when $\sigma =1$.Fig. 11Visualization of the network features, when $\sigma =2$.The parameter $\sigma$ determines the distribution of the data in feature space, the larger $\sigma$, the features are leaded into more sparse space, made the features over-separated, conversely, the smaller $\sigma$, the features are leaded into more denser space, made the features unseparated. Both the larger $\sigma$ and the smaller $\sigma$ ineffectively extract separable features, resulting in lower AUROC and AUPRC metrics on CRGCN and GCN. The more properly separable features are extracted by Gaussian kernel when $\sigma =1$, therefore $\sigma =1$ is selected in the subsequent experiments.Fig. 12Visualization of the network features, when $\sigma =5$.Overall, the method of extracting gene expression data into separable expression features is effective. Additionally, using the Autoencoders to combine these two features can better preserve the underlying information of the original expression data. This allows the graph neural network to obtain a more precise and comprehensive representation of node features during the node aggregation stage, ultimately enhancing the accuracy of the subsequent link prediction task.Experiment 2: validating the effectiveness of causal feature reconstructionIn order to validate the effectiveness of the causal feature reconstruction method, the SVD, NMF, GKF, and gAE are selected as the methods for feature extraction. A two-layer GCN and a GCN based on causal feature reconstruction (CRGCN) are used as the network models for the link prediction task, and tested on E.coli and S.cerevisiae networks.Fig. 13Comparison of results from different network models in the E.coli network.The former four groups in Figs. 13, 14, 15 display the results of different feature extraction methods combined with GCN in the link prediction task, the latter four groups show the results of the methods combined with CRGCN.Fig. 14Comparison of results from different network models in the S.cerevisiae network.Fig. 15Comparison of results from different network models in the mDC network.As shown in Figs. 13, 14, 15, both the AUROC and AUPRC metrics showed significantly higher values in the latter four groups compared to the former four groups. In the E.coli network, compared to the gAE-GCN method, the gAE-CRGCN method improved the AUROC metrics by 9.5% and the AUPRC metrics with 3.2%. In S.cerevisiae network the AUROC metric improved with 7.4% and the AUPRC metric improved by 26%. Similarly, in mDC network the AUROC metric improved with 17.3% and the AUPRC metric improved by 15.4%.The results illustrate that using causal feature reconstruction can lead to a deeper causal features, which in turn improves the accuracy and precision of preferential connection prediction.Overall, causal feature reconstruction enables the GCN model to obtain a more comprehensive representation of node features by enhancing the causal relationship between neighboring nodes at each order. It is able to capture deeper details from the gene expression features, ultimately improving the accuracy of link prediction, when combined with an effective feature extraction method for gene expression data.Experiment 3: validating the effectiveness of link prediction using GCN based on Causal feature reconstructionTo make the model more accurate, the learning rate was chosen as 0.01, 0.005, 0.001, 1e−4 and 1e−5, the results are shown in Figs. 16 and 17.Fig. 16The AUROC with different learning rate.Fig. 17The AUPRC with different learning rate.From Figs. 16 and 17, it can be seen that when the learning rate is chosen to 0.001, the model achieves the best AUROC and AUPRC, therefore, the learning rate is chosen to 0.001 in subsequent experiments.To further validate the reliability and effectiveness of the gAE-CRGCN, 10-fold cross-validation is performed on the E.coli, the S.cerevisiae and the mDC networks, the results are shown in Figs. 18, 19, 20.Fig. 1810-fold cross-validation on the E.coli network.Fig. 1910-fold cross-validation on the S.cerevisiae network.Fig. 2010-fold cross-validation on the mDC network.Table 3 Comparison of Inference Results from Different Algorithms.In the 10-fold cross-validation experiment analysis, the E.coli, the S.cerevisiae, and the mDC networks are divided into 10 equal folds. The gAE-CRGCN is trained on 9 folds and tested on the remaining fold, the process is repeated 10 times, each time with a different fold to test, which helps to assess the performance and generalisation ability of the gAE-CRGCN. The Figs. 18, 19, 20 shown that the performers of the gAE-CRGCN model is stable within certain intervals.Table 3 displays the AUROC and AUPRC scores for both existing methods and the methods proposed in this paper. As shown in Table 3, it can be seen that the SVM method40 demonstrates poor performance on large-scale biological networks and is unable to learn complex regulatory relationships. The RF method41 achieves better results by constructing multiple decision tree models to infer biological networks. However, the AUPRC metric for the S.cerevisiae network is only 0.691, which is lower than that of the GNN method. This indicates that it is difficult for the RF algorithm to reliably infer the class-imbalanced networks. VGAE obtains new node feature representations by sampling the distribution of node feature representations using VGAE, however, data regeneration from the latent space has KL vanishing36, resulting in a poor metric. The GRGNN method combines the network skeleton predicted by known regulatory relationships, Pearson coefficients, and the network skeleton predicted by mutual information to obtain the input neighborhood matrix, GRDGNN42 uses a multi-order neighborhood graph additionally. The GENELink20 is composed of the Graph Attention Network, which incur huge computational and memory overhead than GCN, due to its graph-based attention mechanisms. The GNNLink23 is a GCN-based interaction encoder, by capturing interdependencies between neighbors in the network to infer GRN.The time consumption (second) of methods based on graph network is as shown in Table 4.
Table 4 Time Consumption (second) of Methods Based on Graph Network.From Table 4, it can be seen that the proposed method has the second lowest running time, which means that the proposed method achieves better performance with less computational cost.Network inference for the E.coli, the S.cerevisiae and the mDC was completed using a Gaussian-kernel Autoencoder with GCN based on causal feature reconstruction (gAE-CRGCN). The GRGNN and GRDGNN methods achieve higher AUROC metrics on E.coli and S.cerevisiae networks, by attaching extra network skeletons and obtaining input neighborhood matrices, however increasing the additional demand for data. The AUROC metric for the gAE-CRGCN method on the E.coli network was slightly lower than the GRGNN, however, the AUPRC metric was 6% higher than the GRGNN, 4.1% and 5.5% higher than GNNLink and GENELink, which are the state-of-the-art methods. Similarly, the AUROC metric of the gAE-CRGCN method on the S.cerevisiae network was slightly lower than GRDGNN, however, the AUPRC metric was 2.8% higher than GRDGNN, 6% and 4% higher than GNNLink and GENELink, the AUPRC is more valued in the GRN inference. The AUROC metric for the gAE-CRGCN method on the mDC network was 0.23% and 2% higher than GRGNN and GRDGNN, the AUPRC metrics was 8.6% and 3.5% higher than GNNLink and GENELink achieved the highest metrics. The gAE-CRGCN achieved the highest AUPRC in the three datasets, indicating that the proposed method has better prediction accuracy, due to the Causal Feature Reconstruction and Gaussian-kernel Autoencoder. The gAE-CRGCN method does not have any additional data requirements and improves the accuracy of node representations through causal reconstruction, which is capable of generating more accurate prediction results for class-imbalanced gene regulatory networks, with improved recall and precision.Fig. 21Sub-graph of the E.coli inferred network.Fig. 22Sub-graph of the S.cerevisiae inferred network.The sub-graph network are extracted from the inferred network and visualised as shown in are shown in Figs. 21, 22, 23, which intended to show the details of the inferred network sub-graphs. It can be seen the different GRN have different densities of regulatory relationships. Figure 21 shows that a number of gene regulatory relationships in E.coli are dispersed among one another. As shown in Fig. 22, a number of gene regulatory relationships in the E. coli network are dispersed among one another. As shown in Fig. 22a, some genes like YLR121C, YJR141W, YNL156C, and YGR165W have rather more regulatory relationships, and as shown in Fig. 22b gene YNL167C has the most regulatory relationships.Overall, the gAE-CRGCN method has higher AUPRC scores, which implies the model has better precision and is more suitable for inferring the GRN. The gAE-CRGCN method enhances the node aggregation at each order, resulting in more detailed and comprehensive node feature representations. This is achieved by combining the fusion features extracted by a Gaussian-kernel Autoencoder. The enhanced node feature representations lead to higher similarity in predicting link priority connections, ultimately improving the accuracy of network inference. Experiments have confirmed that the method proposed in this paper is effective.Fig. 23Sub-graph of the mDC inferred network.

Inferring gene regulatory networks with graph convolutional network based on causal feature reconstruction

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models