CGMega: explainable graph neural network framework with attention mechanisms for cancer gene module dissection

Overview of CGMega frameworkWe proposed a new framework, CGMega, for studying cancer gene modules based on graph attention and graph interpretation technologies (Fig. 1a). CGMega leverages a combination of multi-omics data across genomics, epigenomics, protein–protein interactions (PPIs), and especially 3D genome architecture. In CGMega, we first removed the potential effects of structural variation on Hi-C contact map and normalized it with iterative correction and eigenvector decomposition (ICE)41, and calculated the spatial distances between genes (see “Methods”). Then, singular value decomposition (SVD) was applied on the normalized matrix to get condensed Hi-C features (see “Methods”). Simultaneously, we calculated the SNV and CNV frequencies for each gene and calculated epigenetic densities within each gene promoter (see “Methods”). Then, we constructed a multi-omics information combination graph, in which the nodes represent genes and the edges are obtained from PPIs. The features of nodes are the concatenation of condensed Hi-C features, SNV and CNV frequencies, and epigenetic densities. Notably, based on the detailed evaluation in the following section, we deployed Hi-C data as node features instead of edge features. Further, we constructed a transformer-based GAT model to predict cancer genes in a semi-supervised manner (see “Methods”). Finally, CGMega implemented the model-agnostic approach GNNExplainer to detect cancer gene modules. GNNExplainer utilizes a masking approach to detect the compact subgraph structure and a small subset of node features that have a crucial role in GNN prediction40. Applying GNNExplainer, we identified the subset of genes that are most influential for the prediction of cancer genes, together forming the cancer gene modules (Fig. 1b). These genes are one-hop or two-hop neighbors to cancer genes, and GNNExplainer also provides important features for each gene. To examine the robustness of interpretation results in CGMega, we repeated GNNExplainer and obtained high consistent cancer gene modules (Supplementary Fig. S1a, S1b). In sum, the output of CGMega is the probability of each gene being a cancer gene and their influential genes interpreted from GATs. Gene-specific features are also assigned to these genes and together formed the gene modules.Fig. 1: Overview of CGMega framework.a CGMega pipeline. First, condensed Hi-C features were obtained by removing SV effects, ICE normalization, and SVD on raw Hi-C contact map step-by-step. Simultaneously, we calculated omics features, including SNVs and CNVs frequencies for each gene as well as epigenetic densities within each gene promoter. To combine multi-omics information, we created a graph, where nodes represent genes and edges are derived from CPDB PPIs. Node features were the concatenations of condensed Hi-C features and omics features. Next, a cancer gene prediction model consisting of two Graph Transformer layers, two Layer-Norm layers, one max-pooling layer, and two fully connected linear layers was constructed. Finally, the model-agnostic approach GNNExplainer was employed to detect cancer gene modules. b GNNExplainer interpretation. Given a gene (represented as a node in a graph), GNNExplainer identified a subgraph G that contains the relevant features crucial for the prediction. G is a connected subgraph where the gene nodes cover at most a two-hop region with no more than 20 edges.CGMega is effective in cancer gene predictionCGMega identified gene modules based on the accurate prediction of cancer genes, and we thus tested the performance of CGMega in cancer gene prediction on the MCF7 cell line (see “Methods”), a human breast cancer cell line with high-confidence multi-omics data. CGMega achieved 0.9140 AUPRC (Fig. 2a, Source Data file) and 0.9630 area under the receiver operating characteristic curve (AUROC) (Supplementary Fig. S2a). To demonstrate the advances of CGMega in cancer genes prediction task, we compared CGMega with various methods (see “Methods”), encompassing both universal models GCN, GAT, MLP, SVM, and as well as specific models designed for cancer gene classification, including MTGCN42, EMOGI25, and MODIG43. Most of the models were evaluated using the same input features, while SVM and MLP had additional PPI features generated by node2vec (N2V). By computing AUPRC, AUROC, accuracy (ACC), and F1 score, CGMega outperformed all other methods across these four metrics (Fig. 2b).Fig. 2: CGMega performance in cancer gene prediction task.a AUPRC on breast cancer cell line MCF7. b Methods comparison on MCF7 cell line. N2V represents node2vector. MLP and SVM were tested with and without (w/o) PPIs. c AUPRCs of non-pretrained CGMega and pretrained CGMega on datasets with different numbers of labeled genes. d Ablation experiments. AUPRCs of CGMega with random or without omics features and Hi-C features. e AUPRC of CGMega on Hi-C data under different resolutions and down-sampling rates. Source data are provided as a Source Data file.Accurately predicting cancer genes often necessitates a substantial number of labeled genes, a resource that is often limited in rare cancer research scenarios. Thus, it becomes crucial to leverage the knowledge acquired from well-studied cancer genes and apply it to the context of rare cancers, thereby enhancing their prediction. To this end, we adopted a two-step approach with CGMega. In the initial stage, CGMega was pretrained on the MCF7 cell line, allowing it to grasp fundamental patterns and characteristics prevalent in cancer genes. Following pretraining, we performed fine-tuning on other cancers, enabling CGMega to adapt and fine-tune its learned representations to the specific context of those rare cancers.To assess the performance of transfer learning, we conducted tests on the non-pretrained CGMega (trained from scratch) and the pretrained CGMega using all labeled genes (597 positives and 1839 negatives) on the K562 cell line. The pretrained CGMega demonstrated superior accuracy and F1 score, while also exhibiting comparable AUPRC and AUROC values (Supplementary Fig. S2b). Subsequently, we evaluated the non-pretrained CGMega and pretrained CGMega using downsampled labeled genes. Here, we also tested CGMega models without Hi-C features. As the number of labeled genes decreased, the performance of non-pretrained CGMega dropped sharply while the pretrained CGMega continued to have high performance (Fig. 2c, Source Data file). Moreover, the Hi-C features exhibited powerful improvements in prediction especially when the labeled genes were less than 200. Further, we compared the performance of few-shot transfer learning in CGMega with other methods, and pretrained CGMega had the highest value (Supplementary Fig. S2c).CGMega leverages 15-dimensional gene features including 10-dimensional omics features and 5-dimensional condensed Hi-C features derived from dimensionality reduction of the Hi-C data. We performed ablation experiments by removing or shuffling gene features (Supplementary Fig. S2d), and we observed that both omics and Hi-C features made contributions for model prediction (Fig. 2d). Moreover, CGMega with 5-dimensional condensed Hi-C-only features was not as good as CGMega with 10-dimensional omics features, suggesting that the structural features may have a compensatory effect on the quality of omics features.We tested CGMega on Hi-C data with different resolutions and read depths, CGMega maintained its stable performance using Hi-C data with resolutions from 5-kb to 25-kb, and the AUPRC slightly dropped while the Hi-C read depth decreased (Fig. 2e, Source Data file), demonstrating that our approach is robust in its adaptation to scenarios with lower data quality and holds promise for a wide range of application settings. We also tested CGMega on datasets with different ratios of positive to negative. CGMega can still achieve stable well performance with extreme ratios (Supplementary Fig. S2e) and is better compared to other methods (Supplementary Fig. S2f). In addition, CGMega is effective for majority of the well-known PPI datasets (Supplementary Fig. S2g) and achieved better than most other methods (Supplementary Fig. S2h). We observed that the relatively poor performance of CGMega on the Multinet PPI dataset was due to its severe sparsity, and the AUPRC increased from 0.8062 (Multinet) to 0.8991 (condensed Multinet, See “Methods”). Furthermore, CGMega was also evaluated on an external dataset built with entirely new data for MCF7 cell line and achieved stable performance (Supplementary Tables 1 and 2).CGMega provides a new strategy for multi-omics data integrationThe outperformance of CGMega benefits from the effective integration of multi-omics information, including genome, epigenome, PPIs, and especially the 3D genome architecture. Hi-C is currently the most widely used assay for investigating the 3D genome organization. However, measuring Hi-C data together with other omics data is often limited by its noise, sparsity, and variable resolution. To obtain the best performance on the cancer gene prediction task, we tested integration approaches with different Hi-C data embeddings (Fig. 3a).Fig. 3: Performance evaluation of multi-omics data integration approaches.a Design of multi-omics data integration. Left: Hi-C data were regarded as graph edges with two types. In the unweighted graph, edges were determined by the existence of Hi-C contacts or not. In the weighted graph, edge weights were interaction values calculated as log10 or tenth root of Hi-C contact maps. In either graph, node features were omics features. To combine Hi-C with PPIs, we performed GAT-based networks on three graph inputs, including (1) GAT on Hi-C graph alone, (2) two GATs on Hi-C and PPI, respectively, and then combined embeddings, and (3) first combined Hi-C graph and PPIs then performed GAT. Right: Hi-C data were regarded as node features. Raw and normalized Hi-C data as well as different dimensionality reduction methods were tested. Then, condensed Hi-C data were concatenated with omics features, and complete node features were formed. Graph edges were determined by PPIs. For either graph structure (left or right), a GAT-based cancer gene prediction model was made, as described in Fig. 1. b AUPRC of cancer gene prediction model with Hi-C input as edge features. c AUPRC of cancer gene prediction model with Hi-C input as node features. d AUPRC of cancer gene prediction model with raw and normalized Hi-C.Regarding Hi-C data as gene linkages. Molecular networks are important issues in biological studies2,11,29. For example, EMOGI has demonstrated the utility of PPIs in cancer gene prediction25. Hi-C data measure the interactions that connect different genomic loci and thus enables the construction of gene interaction networks. Using Hi-C contact maps, we constructed unweighted and weighted networks, respectively. In the unweighted network, interactions between genes were determined by the existence or nonexistence of contacts. For weighted networks, interaction values were log10 or tenth root of contact strength. Then, epigenetic information was assigned as gene features. Finally, we combined gene interaction network with the PPI network and constructed three types of graphs: a Hi-C only graph, a Hi-C/PPI independent graph, and a Hi-C/PPI combined graph. In these, nodes represent genes and the node features are epigenetic information. We trained GAT-based neural networks on these graphs. Among these methods, the Hi-C-only graph was ineffective for predicting cancer genes (AUPRC < 0.5). The Hi-C/PPI independent graph exhibits only a marginal improvement over the PPI-only strategy. It is solely when Hi-C is combined with PPI that a modest increase, of roughly half a point, is observed in the two-edge construction methods (Fig. 3b). This result does not offer compelling support for the inclusion of Hi-C as graph structure information within the model.Regarding Hi-C data as gene features. Hi-C data are intuitively used for measuring gene interactions. However, due to the noise and sparsity of Hi-C data, gene interaction networks based on Hi-C data tend to be incomplete and flawed. For this reason, we tested different methods of obtaining condensed Hi-C features, including Node2Vec, SVD, locally linear embedding (LLE), isometric feature mapping (ISOMAP), non-negative matrix factorization (NMF), and t-SNE. The condensed Hi-C features were concatenated with epigenetic information as gene features. PPI networks were still used to measure the interactions between genes. We also trained GAT-based neural networks on these graphs, and the situation improved significantly. Generally, incorporating Hi-C features using dimensionality reduction methods improved the prediction of cancer genes. The best-performing method, SVD, achieved an AUPRC of 0.9140, while Node2Vec, NMF, and t-SNE also demonstrated promising results (Fig. 3c). In addition, we compared the impact of different dimensions of condensed Hi-C features for model prediction (Fig. 3d). Combining the four metrics, all methods with the Hi-C feature received a performance improvement compared to those without the Hi-C feature (Supplementary Fig. S3a). SVD-based reduction of Hi-C data to a condensed five-dimensional feature set was found to be the optimal solution based on both results.Taken together, by systematically comparing different integration approaches with Hi-C data embedding, we showed that, in cancer gene prediction task, using Hi-C latent features as gene features outperforms measuring Hi-C data as the gene interactions directly. SVD is an effective dimensionality reduction method for combining Hi-C data with other omics data.Gene modules with multi-omics features in human breast cancer cell lineCGMega detects gene modules based on a model-agnostic neural network interpretation approach (Fig. 1b), and these gene modules consist of two parts: i) a core subgraph consisting of the most influential pairwise relationships for the prediction of cancer gene, and ii) 15-dimensional importance scores that quantify the contributions of each gene feature to cancer gene prediction. We applied CGMega to the human breast cancer MCF7 cell line and examined the modules of 358 known cancer genes. These cancer genes were not randomly scattered throughout gene modules; they tended to be co-located in the same modules (Supplementary Fig. S4a). This is consistent with previous reported as so-called disease modules8. Among these gene modules, TP53 showed the highest enrichment and participated in 139 cancer gene modules, followed by ESR1 (63 participations) and AKT1 (61 participations) (Fig. 4a). In addition to these well-known cancer genes, we observed another 12 highly module-participating genes such as XPO1, NCOR2, and PPM1A. These genes may be the collaborators of well-known cancer genes. We also examined the structural features of gene modules with respect to their graphical metrics, including transitivity, clustering coefficients, degree centrality, and betweenness centrality, and we found that the topological structure of cancer gene modules were significantly more consistent than that of non-cancer gene modules (P < 2.47e-5, paired t test) (Supplementary Fig. S4b).Fig. 4: Gene modules in breast cancer cell line.a Scatter plot shows the gene participation in cancer gene modules. In gene modules of 358 well-known breast cancer genes, 22 known cancer genes (blue dots) and another 12 genes (red dots) which were not known as breast cancer genes were highly involved (participated in over 20 cancer gene modules). Gray dots donated known cancer genes which were not highly involved in cancer gene modules. b In total, 347 positive cancer genes of breast cancer were generally divided into five clusters (by K-means clustering) based on feature importance scores. c An example for illustrating gene representation features (RFs). For a given gene, if a feature is assigned with an importance score (calculated by GNNExplainer) 10 times higher than the minimum score, it will be referred to as the RF of this gene. d Illustrations of BRCA1 and BRCA2 gene modules. e Western Blot analysis after 24 h, 48 h, and 72 h treatment. Each experiment was repeated three times independently. f, g Half maximal inhibitory concentration (IC50) value of olaparib treatment (f) and olaparib/RKI-1447 combination treatment (g) after 24 hr. h The inhibition rate of olaparib combined with RKI-1447 is significantly higher than that of olaparib alone after 24 h treatment. Paired t test of two-sided was used to analyze the mean inhibition rates from two groups and the P value = 0.0023. f–h Data are presented as mean values +/− SEM and n = 3 biologically independent experiments were carried out for each to derive statistics. Source data are provided as a Source Data file.Beyond the topology of gene modules, we next investigated the feature importance scores. CGMega utilized 15-dimensional multi-omics features as inputs and generated an importance score for each feature. It is necessary to examine whether the importance scores were just related to the corresponding input. We thus examined the distributions of these two values, and found that feature importance is irrelevant to input data (Pearson correlation coefficients r < 0.26, Supplementary Fig. S4c), suggesting that the importance score is the interpretation of neural networks instead of simple determination due to its input. Besides, the feature importance scores were not evenly distributed; instead, one or several features were dominant (Supplementary Fig. S4d). The feature importance scores measure the joint effect of multiple factors and help guide cancer gene classification (Fig. 4b, Source Data file). Many cancer-driven genes (class-5) were as reported to be dominated by genetic mutations. For genes in other classes, the five Hi-C features, condensed by SVD, have provided extended supplements based on their participation in each cluster: 1st, 4th and 5th Hi-C features showed joint effect with other regulatory factors on cancer driver genes (cluster-3), while 2nd and 3rd Hi-C features showed joint effect with genetic mutations (cluster-1). Some previous studies have verified these observations: (i) Gene MYB (in cluster-1) was reported to form fusion genes with NFIB due to the recurrent chromosomal translocation, which serves as a clear example of genotypic–phenotypic correlation for triple-negative breast cancer44. (ii) Dysregulation of gene ADIPOR1 (in cluster-3) is widely observed in many cancers, but its genomic alteration frequencies are low45. This is consistent with CGMega that attributes HiC-1, HiC-5, chromatin accessibility and active histone modification H3K4me3 to ADIPOR1. (iii) Similar to ADIPOR1, gene ALOX12 (in cluster-3) were significantly upregulated in multiple breast cancer cell lines, which protect breast cancer cell from chemotherapy-induced growth arrest and apoptosis46,47, suggesting the importance of transcription regulation to ALOX12. (iv) Despite these isolated evidences, we collected RNA-seq data of breast cancers from TCGA project and identified differentially expressed genes (DEGs). The proportion of DEGs is the highest in cluster-3 (Supplementary Fig. S4e). Based on CGMega prediction, Hi-C together with other active regulatory elements have joint effect on these genes.Based on the feature importance score, we proposed the representative features (RFs) as features that have top-ranked importance scores (Fig. 4c, see “Methods” for details). For example, the RF of the gene TP53 is SNV while gene PIK3R1’s RF is Hi-C. Generally, 1158 genes had only one RF and 149 genes had multiple RFs (Supplementary Fig. S4f). We next focus on the gene modules of BRCA1 and BRCA2, which are the most commonly encountered genes in breast cancer. As previously reported, these two cancer genes play different roles in the common pathway of genome protection48. We also observed topological differences between their gene modules. In brief, BRCA1, which is a pleiotropic DNA damage response (DDR) protein working in several stages of DDR, was also found to be widely connected with another 20 genes (Fig. 4d). By contrast, BRCA2, as a mediator of the core mechanism of homologous recombination (HR), was connected with other genes via ROCK2, an important gene that directly mediates HR repair49. Based on gene expression data from TCGA project, we found that ROCK2 expression was positively correlated with BRCA2 expression in breast tumor donors while there is no such correlation in normal breast tissue (Supplementary Fig. S4g). The co-expression of BRCA2 and ROCK2 in breast cancer suggest the joint effect in tumorigenesis, which may guide the effect enhancement of BRCA2 inhibitors on tumor cells. To test this hypothesis, we treated MCF7 cells with BRCA2 inhibitor olaparib50,51 and with both BRCA2 inhibitor olaparib and ROCK2 inhibitor RKI-144752. Western Blot results have demonstrated the protein inhibition after 24-, 48-, and 72-h treatment (Fig. 4e, Source Data file). Then, we determined the half maximal inhibitory concentration (IC50) of olaparib (Fig. 4f and Supplementary Fig. S5a, Source Data file) and olaparib/RKI-1447 combination (Fig. 4g and Supplementary Fig. S5b, Source Data file). IC50 value of inhibitors combination was lower than that of BRCA2 inhibitor alone. Moreover, the inhibition rates of olaparib combined with RKI-1447 were significantly higher than those of olaparib alone after 24-h treatment (Fig. 4h, P value = 0.0023, paired t test, Source Data file). But it was comparable between two groups after 48 h and 72 h treatment (Supplementary Fig. S5c). These results showed that the combination of BRCA2 and ROCK2 inhibitors was more effective than using BRCA2 inhibitor alone in inhibiting MCF7 tumor cells after 24-h treatment, suggesting a potential strategy for enhancing BRCA2 inhibitor sensitivity. In addition, SNV was the RF for both BRCA1 and BRCA2. We also observed a high-order gene module combined from the BRCA1 gene module and the BRCA2 gene module through three shared genes including TP53, SMAD3, and XPO1 (Supplementary Fig. S5d). Taken together, these indications mean that CGMega is capable of detecting the interpretable and high-order gene modules with multi-omics features.The complex gene module formed by ErbB familyThe ErbB family, including ERBB1 (also known as EGFR), ERBB2 (also known as HER2), ERBB3 (also known as HER3) and ERBB4, plays a central role in the tumorigenesis of many types of solid tumor. The members of the ErbB family are receptor tyrosine kinases (RTKs), which have an analogous structure53. However, their gene modules exhibit heterologous structures and none of these four ErbB genes were hubs in their gene modules (Fig. 5a). In the ERBB1 gene module, PTPRD located at the center and connected ERBB1 and other genes. The ERBB2 and ERBB4 gene modules shared the same center gene DLG2. NRG1 located at the center of the ERBB3 gene module and ERBB4 was also present in this gene module. We examined the representative features of the ErbB family. Hi-C and SNV were major RFs for ERBB2, ERBB3, and ERBB4 (Fig. 5b, Source Data file). The mechanisms of genetic alteration such as SNV in cancer development have been demonstrated previously54,55,56. The Hi-C features uncovered by CGMega suggest new insights regarding the signal and crosstalk between the ErbB family genes in the context of the chromatin structure in tumor progression.Fig. 5: Gene modules of ErbB family.a Illustrations show the gene modules of the ErbB family. Blue dots indicate query genes, namely ERBB1, ERBB2, ERBB3 and ERBB4. Yellow dots indicate genes located at the center of gene modules. Green boxes show the RFs of query genes. b Feature importance scores of the ErbB family. c The high-order gene module formed by the ErbB family gene modules. d Hypothetical model of the high-order gene module formed by the ErbB family gene modules in maintaining protein phosphorylation homeostasis. Details were described in the main text. Source data are provided as a Source Data file.In spite of the difference among the gene modules of the ErbB family, we observed several shared genes connecting the ErbB members and forming a complex module (Fig. 5c). NRG1, PPM1A, and DLG2 were key connectors in this high-order module. Previous studies have demonstrated the importance of these three genes for cancer development. NRG1 is a main physiological ligand to ErbB family and, together with ERBB2 and ERBB3, can form a potent pro-oncogenic heterocomplex57. DLG2 is a member of a family of membrane-associated guanylate kinase (MAGUK), and DLG2 overexpression will affect the level of protein phosphorylation58,59. The protein serine/threonine phosphatase PPM1A is a crucial regulator of cell cycle progression in triple-negative breast cancer60, and PPM1A is also an important factor in protein dephosphorylation61,62,63. By combining these isolated evidence with the high-order gene module, we proposed a hypothetical model of the gene module in maintaining protein phosphorylation homeostasis (Fig. 5d). The NRG1 ligand binds to homo- or hetero-dimers of ErbB proteins, leading to the activation of ErbB-mediated downstream signaling pathways that mediate the activity of serine/threonine (Ser/Thr) protein kinases. Ser/Thr protein kinases and proteins encoded by DLG2 modulate the phosphorylation of Ser/Thr proteins, while PPM1A mediates their dephosphorylation, together maintaining protein phosphorylation homeostasis.Gene module dissection in acute myeloid leukemia patientsWe applied CGMega to acute myeloid leukemia (AML), a myeloid neoplasm that is characterized by differentiation blockade and clonal proliferation of abnormal myeloblasts in the bone marrow64. We collected multi-omics data for eight AML patients from a previous study64. Unlike the case of those cell lines, gene modules are heterogeneous across different patients65, and the clinical course of AML is also highly heterogeneous66. Thus, we studied both patient-common and patient-specific cancer gene modules (Fig. 6a). First, we used CGMega to predict cancer genes and identified 2746 new genes in total (Supplementary Table 3). Among these, 396 were predicted to be cancer genes in all AML patients (referred as “candidate AML genes”, Supplementary Table 4). We next investigated gene functions and found that these candidate AML genes contained many essential genes and TFs, and the pan-cancer genes were significantly enriched in these 396 genes (P = 1.32e-22, hypergeometric test) (Fig. 6b). Moreover, Gene Ontology (GO) analysis showed that candidate AML genes together with known AML genes participated in 15 hematopoietic and blood diseases biology processes such as leukocyte migration and T-cell receptor signaling pathway (Fig. 6c). This enrichment could not be retrieved using known AML genes alone.Fig. 6: Gene modules in AML patients.a Application of CGMega on AML. Multi-omics data of eight AML patients were obtained from a previous study. b In total, 396 candidate AML genes contained essential genes, transcription factors, and pan-cancer genes. Gray boxes show genes in two categories. c Gene ontology (GO) enrichments. We performed GO analysis on 597 known AML genes (first line), and on 993 genes (597 known AML genes and 396 candidate AML genes), respectively. GO analysis was conducted using DAVID and GO terms with p value lower than 1e-5 were shown. d Illustration of DLX4 gene module. Blue dots indicate known AML genes, while yellow dots indicate candidate AML genes. e Illustration of KLF4 gene modules in separate patients. SETD7 was located at the center in Patient 168 while it was just a participant in Patient 027 and Patient 270. In other patients, SETD7 did not appear in KLF4 gene modules. f We totally identified 142 neighbor-cancer gene pairs, these gene pairs were conserved in over four AML samples. Gene pairs in red box were detected in all eight AML samples and gene pairs in yellow box were detected in seven AML samples. Source data are provided as a Source Data file.We then examined the AML gene modules. As with MCF7 cell line, cancer genes were also enriched in same AML gene module. This enrichment was observed not only in known AML genes but also in candidate AML genes (Supplementary Fig. S6a). In addition, 10.5% of these pairwise relationships in cancer gene modules were conserved over half of total patients. For example, in the DLX4 gene module, connections among DLX4, the known cancer gene ABL1, and four candidate AML genes (SP1, FYN, GRB2, and SMAD2) co-occurred in multiple patients (Fig. 6d). Beyond the enrichment and co-occurrence of AML gene modules, we observed that some candidate AML genes were shared by dozens of known AML gene modules (Supplementary Table 5). For example, ESR1 was predicted to be candidate AML gene and it existed in modules of various known AML genes, such as EGFR, PIK3CA, and FOS (Supplementary Fig. S6b). This hub location implies a high-order pattern of cancer gene modules. A total of 12 known driver genes and 5 candidate AML genes were identified as hub genes, which participate in more than 20 cancer gene modules (Supplementary Fig. S6c). Among these genes, EGFR, MYC, TP53, MAPK1, and PIK3R1 were well-known genes in cancer pathway67,68,69,70,71. EP300, CREBBP, and STAT3 were used as clinical testing gene panel for myeloid tumors72,73,74. The detection of these hub genes in all AML samples demonstrates the reliability of CGMega interpretation, and suggests the potential usage as AML gene panels of those five new hub genes, including ESR1, HDAC1, FYN, LYN, and GRB2.The AML patients used in this study come from seven different mutation types and CGMega achieved good performance (AUPRC = 0.8528 on average). We next identified patient-specific candidate AML genes for each patient (Supplementary Table 6). Examining the modules of these genes, we also observed patient-specific patterns. For example, in the KLF4 gene module drawn from patient 168, the candidate AML gene SETD7 connected KLF4 with other known AML genes including TP53, STAT3, DNMT1, PCNA, and MDM2. However, this two-hop gene module did not appear in other patients (Fig. 6e). Moreover, we found that the two-hop pattern was widespread in AML samples, covering about 1/3 of all AML gene modules (Supplementary Fig. S6d). The key neighbor genes, which formed neighbor-cancer gene pair in two-hop module (such as ROCK2-BRCA2 pair in BRCA2 gene module, Fig. 4d), provide new insights to understand tumorigenesis and drug combination strategy. We totally identified 142 such gene pairs, which were conserved in over four samples (half of the total AML samples), and found several pairs were highly conserved in all AML samples (Fig. 6f). We then performed GO analysis using both the cancer genes and key neighbor genes in these 142 pairs, and found that, different from cancer genes, the key neighbor genes were significantly enriched in signaling processes such as signal transduction and signaling pathway (Supplementary Fig. S6e), suggesting that genes participating in signal processes may be the regulator or collaborator of known cancer genes.

Hot Topics

Related Articles