Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses

Machine learning prediction modelThe current study introduces innovative classification models designed to predict new druggable proteins that drive cancer. These predictions are based on three sets of protein sequence descriptors (amino acid composition, di-amino acid composition, and tri-amino acid composition), calculated using Rcpi. These descriptors were chosen for their proven ability to capture essential information about protein sequences that are critical for predicting druggability88.AC effectively represents a protein’s primary structure by highlighting the frequency of each amino acid within the sequence, helping to identify general trends and patterns associated with druggable proteins. DC captures the local interactions between pairs of amino acids, providing insight into the secondary structure and local folding patterns, which are crucial for understanding functional regions and binding sites. TC considers interactions between triplets of amino acids, offering a more detailed view of the amino acid sequence, which is essential for accurately predicting protein interactions with drugs and other ligands30,89.Focusing on these features ensures computational efficiency and reduces the risk of overfitting, which can occur with an excessive number of features. Our comprehensive benchmarking demonstrated that these descriptors consistently provided robust performance across various machine learning classifiers. While the inclusion of additional features, such as secondary structure elements or solvent accessibility, might offer incremental benefits, the chosen descriptors strike an optimal balance between model performance, computational feasibility, and biological relevance. This balance allows for effective and interpretable predictions while maintaining the practicality of the computational framework. Furthermore, the identified amino acid sequence patterns will inform future studies on protein properties.Subsequently, we utilized Jupyter notebooks built on Python and scikit-learn to construct 13 types of ML classifiers (GNB, KNN, LDA, SVM linear, SVM, LR, MLP, DT, RF, XGB, GB, AdaB, and Bagging), along with five types of feature selection methods with various parameters (Fig. 1). All scripts used the mean AUROC values from threefold cross-validation to quantify classification performance. We tested models using 20, 100, 200, and 400 features30.Figure 2 illustrates the AUROC values for a classifier using only 20 features: AC descriptors without feature selection, DC descriptors with LinearSCV feature selection (DC-LinearSVC20), PCA features from DC (DC-PCAn20), TC descriptors selected by SelectPercentile(f_classif, percentile = 0.25) (TC-Percn20), and TC descriptors selected with LinearSVC (TC-LinearSVC20). Notably, using only 20 AC descriptors with SVM yielded an AUROC of 0.926. The best performance was achieved using SVM (RBF) with 20 PCA components from 400 DC descriptors, resulting in an AUROC of 0.958. Additional results can be found in Supplementary Table 6.Figure 2Mean AUROC values for classifiers obtained with 20 selected features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; AC, amino acid composition; DC, di-amino acid composition; TC, tri-amino acid composition.Figure 3 displays AUROC values for a classifier using 100 features: PCA transformed of 400 DC descriptors (DC-PCAn100), TC descriptors selected with SelectPercentile(f_classif, percentile = 1.25) (TC-Perc1.25), TC descriptors with LinearSVC (TC-LinearSVC100), and 100 features selected by LinearSVC from 200 PCA components of 8,000 TC descriptors (TC-PCA200LinearSVC100). Increasing the number of features to 100 (five times more than 20) improved the AUROC to 0.976 using the same SVM (RBF) with TC-PCA200LinearSVC10030.Figure 3Mean AUROC values for classifiers based on 100 selected features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; DC, di-amino acid composition; TC, tri-amino acid composition.Figure 4 shows the AUROC values for classifiers using 200 selected features (double the number from 100): PCA transformation of 400 DC descriptors (DC-PCAn200), DC descriptors selected with SelectPercentile (DC-Perc50), 200 PCA components of 8,000 transformed TC descriptors (TC-PCAn200), TC descriptors selected with SelectPercentile(f_classif, percentile = 2.5) (TC-Perc2.5), and TC descriptors with LinearSVC (TC-LinearSVC200). The combination of PCA and SVM for DC-PCAn200 resulted in the best classifier, achieving an AUROC of 0.981 (Supplementary Table 6). Further, using all 400 DC descriptors with SVM, the mean AUROC reached 0.982 ± 0.0021. Additionally, with 8,000 pure TC descriptors and SVM linear, the mean AUROC was 0.992 ± 0.0028. It is important for a classification model to avoid having more input features than data instances. We also sought to prioritize pure descriptors over PCA transformations. As a compromise, we selected the following as the best model for subsequent protein-related cancer predictions: 200 TC descriptors selected with LinearSVC, a non-linear SVM classifier with an AUROC of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The list of the 200 selected features is available in the Jupyter notebooks.Figure 4Mean AUROC of classifiers based on 200 input features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; DC, di-amino acid composition; TC, tri-amino acid composition.Selected features analysisThe following is the list of selected features for the best model: NRA, QRA, INA, MCA, YEA, THA, CSA, VYA, KNR, WDR, TER, PQR, YGR, EHR, LIR, VSR, ERN, MDN, SDN, LHN, YIN, FFN, RSN, QSN, FWN, ACD, WCD, MED, CHD, SHD, MLD, SMD, WPD, SSD, HTD, DWD, VYD, KNC, NDC, IHC, VHC, GYC, MCE, NHE, ALE, HME, LPE, AWE, EYE, QYE, GVE, FVE, SAQ, FNQ, MDQ, PCQ, WEQ, RQQ, NGQ, HLQ, RMQ, DFQ, GPQ, DSQ, YSQ, AWQ, RVQ, QRG, HGG, TGG, KLG, NKG, FPG, SSG, RTG, PTG, IVG, CDH, FDH, PDH, TQH, KHH, FHH, IFH, NSH, WSH, FWH, WRI, NDI, EDI, FEI, WEI, WQI, MGI, PMI, AAL, EKL, IKL, FKL, GPL, ESL, DVL, MVL, VVL, GNK, HNK, HDK, HCK, EQK, DHK, QLK, EKK, SMK, FFK, QSK, EWK, AVK, WRM, WNM, REM, WQM, SHM, LLM, SMM, NFM, TSM, RWM, GYM, KYM, VYM, HVM, IVM, LDF, YQF, NGF, HGF, FWF, FAP, FNP, PEP, SQP, QGP, VHP, PLP, HKP, NPP, QPP, STP, TTP, KWP, YWP, SRS, HDS, WDS, HCS, LES, DHS, SHS, PSS, SSS, LWS, LAT, DRT, GRT, IRT, INT, VQT, NLT, CLT, KKT, YTT, QWT, FYT, KCW, QGW, VGW, MIW, IKW, RFW, DFW, HVW, KVW, NRY, CHY, DMY, YPY, YAV, SRV, ENV, HNV, GEV, QGV, HGV, TGV, WHV, LLV, IMV, DSV, TSV, QYV. The normalized importance for the 10% selected features is presented in Table 1. The most important amino acid patterns for this classification are HME, NSH, SSS, HTD, DHK, ERN, NDI, DRT, VYD, FFN, SHM, NDC, RFW, WRI, GYC, MGI, PEP, GVE, DSQ, and LLV. The HME pattern is the most important feature for druggable proteins, while the NSH pattern has only half the importance of HME.Table 1 Feature importance for 10% of the selected features of the best classification model.In Table 2, the frequencies of the amino acids in all selected features demonstrate the importance of H (histidine), S (serine), D (aspartic acid), and Q (glutamine) in classifying druggable proteins. Additionally, H and S appear in the first five most important tri-amino acid patterns. The biological significance of the amino acids in these patterns is outlined below: (a) HME (histidine–methionine–glutamic acid): Histidine is essential for protein synthesis and enzyme catalysis, methionine is the initiator amino acid for protein translation, and glutamic acid is involved in neurotransmission and protein folding; (b) NSH (asparagine–serine–histidine): Asparagine is crucial for glycoprotein synthesis and serine is involved in phosphorylation and protein structure; (c) SSS (serine–serine–serine): serine is essential for cell signaling, protein synthesis, and metabolism; (d) HTD (histidine–threonine–aspartic acid): Threonine is important for protein stability and immune function, and aspartic acid contributes to protein structure and function; and (e) DHK (aspartic acid–histidine–lysine): Lysine is essential for protein synthesis and collagen formation5,90,91.Table 2 Frequencies of the amino acids in the selected tri-amino acids groups for the best classification model.Cancer-driving proteinsWe transformed 2,339 cancer-driving proteins into molecular descriptors using the best model to predict their druggability. Consequently, these protein sequences were converted into 200 selected TC descriptors. As a result, 2,080 (88.9%) of these cancer-driving proteins were predicted to have druggable activity (Fig. 5A and Supplementary Table 5). For validation, we compared the ChEMBL evidence scores of proteins involved in clinical trials54, distinguishing among the positive set of druggable proteins (mean score = 0.712), druggable cancer-driving proteins (class 1, mean score = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean score = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean score = 0.414). As expected, the Bonferroni correction revealed no significant difference between the positive set and druggable cancer-driving proteins, nor between the negative set and ‘hard-to-drug’ proteins. Interestingly, it did reveal a significant difference between druggable cancer-driving proteins (class 1) and ‘hard-to-drug’ proteins (class 0) (P < 0.001) (Fig. 5B). This indicates that druggable cancer-driving proteins are distinctively more validated as potential targets compared to ‘hard-to-drug’ proteins, underscoring the relevance and accuracy of the classification method used. These findings validate the effectiveness of the prediction model in distinguishing between truly druggable targets and those that are more challenging to target therapeutically, highlighting its potential utility in the drug discovery process.Figure 5Target-disease evidence score for predicted druggable cancer-driving proteins. (A) A bean plot illustrating the distribution of prediction scores (mean = 0.796) for 2,339 (100%) cancer-driving proteins. Out of these, 2,080 (88.9%) proteins were classified as druggable (class 1), while 259 (11.1%) were predicted as ‘hard-to-drug’ (class 0). (B) Bean plots present the distribution of ChEMBL evidence scores (https://www.ebi.ac.uk/chembl)53 for various categories: the positive set of druggable proteins (mean = 0.712), druggable cancer-driving proteins (class 1, mean = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean = 0.414). These plots show the distribution of ChEMBL evidence scores that represent the involvement of proteins in clinical trials. (C) A heat map displaying druggable cancer-driving proteins in clinical trials with ChEMBL evidence scores exceeding 0.9. The map also incorporates the target-disease evidence scores from ten unique bioinformatic tools: Open Targets Genetics (https://genetics.opentargets.org)55, ClinVar (germline) and ClinVar (somatic) (https://www.ncbi.nlm.nih.gov/clinvar/)56,57. Genomics England PanelApp (https://panelapp.genomicsengland.co.uk)58, Cancer Gene Census (https://cancer.sanger.ac.uk/census)59, IntOGen (https://www.intogen.org)60, Cancer Biomarkers (http://www.cancergenomeinterpreter.org)61, SLAPenrich (https://saezlab.github.io/SLAPenrich/)62, Reactome (https://reactome.org)63., and IMPC (http://www.sanger.ac.uk/resources/databases/phenodigm)64. (D) Another heat map showcases druggable cancer-driving proteins not yet participating in clinical trials with ChEMBL evidence scores equal to 0. It too incorporates target-disease evidence scores from the aforementioned bioinformatic tools. (E) Box plots provide a ranking of bioinformatic tools based on their mean target-disease scores. This analysis focuses on druggable cancer-driving proteins that have not been part of clinical trials and have ChEMBL scores of 0.Following the prediction and validation of the 2,080 druggable cancer-driving proteins, we extracted the target-disease evidence scores from the Open Targets platform. This was done to prioritize the most relevant druggable cancer-driving proteins already involved in late-stage clinical trials (ChEMBL score > 0.9) and those not yet involved in clinical trials (ChEMBL score = 0)52,53,54. The target-disease evidence score was encompassed data from various bioinformatic tools including Open Target Genetics55, ClinVar (covering germinal and somatic variants)56,57, Genomics England PanelApp58, Cancer Gene Census59, IntOGen60, the Cancer Biomarkers database61, SLAPenrich62, the Reactome Knowledgebase63, and PhenoDigm64. This overall score, derived from an integration of these bioinformatic approaches, enabled us to identify proteins strongly associated with cancer traits. Of these, 52 were druggable cancer-driving proteins involved in late-phase clinical trials (Fig. 5C and Supplementary Tables 7 and 8), and 296 were druggable cancer-driving proteins not yet involved in clinical trials (Fig. 5D and Supplementary Tables 7 and 9). Furthermore, the five bioinformatic approaches yielding the highest target-disease evidence scores for the 296 druggable proteins not yet in clinical trials were Cancer Gene Census (mean = 0.90), SLAPenrich (0.88), Reactome (0.84), Genomics England PanelApp (0.79), and Cancer Biomarkers (0.77) (Fig. 5E).Drugs involved in late-phase clinical trialsFigure 6 presents an update on phase III and IV clinical trials involving drugs that target cancer-driving proteins, as cataloged by the Open Targets Platform52. The Sankey plot in the figure reveals a total of 257 clinical trial events, involving 94 drugs with 38 different mechanisms of action, which target 52 key cancer-driving proteins across 26 types of cancer (Supplementary Table 10). The most frequently involved drugs in these late-phase clinical trials were regorafenib, binimetinib, pazopanib, and sorafenib. The mechanisms of action most common in these trials included FGFR inhibitors, FLT3 inhibitors, MEK inhibitors, and EGFR inhibitors. The cancer-driving proteins most frequently targeted in the trials were GABRB2, MAP2K1, and MAP2K2. Additionally, the cancer types most commonly evaluated in these late-phase clinical trial events were liver cancer, lung cancer, breast cancer, leukemia, and colorectal cancer. This comprehensive therapeutic landscape has enabled us to identify key patterns and trends in cancer treatment research.Figure 6Panoramic landscape of the druggable cancer-driving proteins and the drugs currently in phase III and IV clinical trials. The Sankey plot displays the 257 late-stage clinical trial events. These encompass 52 druggable cancer-driving proteins (with ChEMBL evidence score exceeding 0.9) that are targeted by 94 distinct drugs. These drugs operate through 38 different mechanisms of action and are tested across 26 cancer types. The proteins with the most clinical trial events are GABRB2 (n = 14), MAP2K1 (n = 10), and MAP2K2 (n = 10). The drugs most frequently involved in trial events are regorafenib (n = 24), binimetinib (n = 10), and sorafenib (n = 10). The mechanisms of action most represented in trials are FGFR inhibitors (n = 37), FLT3 inhibitors (n = 24), and MEK inhibitors (n = 20). The cancer types with the highest number of clinical trial events were liver cancer (n = 40), lung cancer (n = 36), and breast cancer (n = 31). Data of clinical trials and mechanisms of action were taken from the Open Targets Platform (https://platform.opentargets.org/)52, and the Drug Repurposing Hub (https://clue.io/repurposing)44. Lastly, Sankey plots were designed using the SankeyMATIC software (https://sankeymatic.com/ and https://github.com/nowthis/sankeymatic).Shortest pathways to cancer hallmark phenotypesAfter identifying 296 druggable proteins not yet involved in clinical trials, we conducted multi-omics analyses to prioritize the most relevant cancer-driving proteins as potential therapeutic targets across various cancer types70,80,92,93,94. In this context, we employed the CancerGeneNet software and found that 184 (62%) of these proteins showed distance scores indicative of their involvement in the shortest pathways leading to cancer hallmark phenotypes66,67, as detailed in Supplementary Table 11. Figures 7A and B illustrate these druggable proteins and their shortest paths to cancer hallmarks. The top three hallmarks are cell proliferation (with a mean distance score of 1.27 and 154 proteins involved), cell differentiation (1.51; 160), and resistance to cell death (1.55; 157) (Supplementary Table 12). Utilizing the Bonferroni correction test, we observed that these druggable proteins had significantly shorter paths to these cancer hallmark phenotypes (P < 0.001). These findings are highly relevant because the prioritized druggable proteins in this analysis could be crucial targets for focusing new therapeutic strategies on processes such as cell proliferation or resistance to cell death.Figure 7Prioritization of key druggable cancer-driving proteins through multi-omics analyses. (A) Box plots that display the mean distance scores of the shortest pathways associated with each cancer hallmark phenotype. Additionally, the Bonferroni correction, a method for multiple comparison testing (P < 0.001), was employed to highlight significant differences among the cancer phenotypes. Analysis of the shortest paths to cancer hallmark phenotypes reveals that 184 druggable proteins are closely associated with cell proliferation, cell differentiation, resistance to cell death, glycolysis, metastasis, inflammation, genome instability, immortality, and angiogenesis. (B) The analysis further indicates that out of these druggable proteins, 64 (34.8) have the shortest paths to nine cancer hallmarks, 63 (34.2) to eight hallmarks, 29 (15.8%) to seven hallmarks, 17 (9.2%) to six hallmarks, 6 (3.3%) to five hallmarks, 4 (2.2%) to four hallmarks, and 1 (0.5%) to one hallmark. These shortest paths to cancer hallmark phenotypes were analyzed using data from CancerGeneNet (https://signor.uniroma2.it/CancerGeneNet/)66. (C) A box plot is shown to demonstrate the percentage of chemistry-based score for the 184 druggable cancer-driving proteins. The ligandability analysis reveals that 79 (43%) of these proteins have high scores (> 69.9%). This chemistry-based score was analyzed using data from canSAR (http://cansar.icr.ac.uk)72. (D) This dot plot highlights the prioritization of 23 key druggable cancer-driving genes/proteins, identified based on a prediction of druggability higher than 0.7, a chemistry-based score above 70%, and unfavorable prognostic significance (significant log-rank P-value < 0.001) across 16 TCGA PanCancer types, according to data from the Human Protein Atlas platform (https://www.proteinatlas.org/)74. (E) Functional enrichment analysis of these 23 key druggable cancer-driving proteins is visualized through a Manhattan plot. This analysis demonstrates the most significant (Benjamini–Hochberg method, FDR q-value < 0.001) biological processes and Reactome signaling pathways involved in cancer. The enrichment analysis was conducted using g:Profiler software (https://biit.cs.ut.ee/gprofiler/gost)78.Chemistry-based scorecanSAR is a comprehensive knowledgebase dedicated to drug discovery and offers an extensive structure-based ligandability assessment72. Consequently, we retrieved the chemistry-based scores for the previously prioritized 184 proteins. The mean chemistry-based score of these 184 proteins was 69.9%. In our analysis, we considered all proteins with a ligandability score higher than the mean (cutoff > 69.9%), encompassing all proteins with the very high scores and the best proteins with high scores. This analysis enabled us to identify 79 (43%) druggable cancer-driving proteins with the highest ligandability, as shown in Fig. 7C and Supplementary Table 13. Ligandability analysis refers to a protein’s ability to bind efficiently to a drug. High ligandability helps identify and prioritize proteins that can be effective targets for new drugs, thereby increasing the specificity of the drug’s action and reducing the time and cost associated with pharmaceutical development95.A pathology atlas for human cancerWe explored the Human Pathology Atlas, developed by the Human Protein Atlas program, and subsequently conducted a Kaplan–Meier analysis to examine the correlation between mRNA and protein expression and patient survival74,75,76,77. This analysis aimed at determining the prognostic significance of 79 highly ligandable, druggable cancer-driving genes/proteins (Supplementary Table 14). Our findings underscore the effectiveness of large-scale system biology projects that utilize publicly available resources. In this study, we identified the 23 key druggable cancer-driving genes/proteins that demonstrated unfavorable prognostic significance (significant log rank P-value < 0.001) across 16 TCGA PanCancer types. These genes/proteins were CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4 (Fig. 7D and Supplementary Table 15).Functional enrichment analysisWe conducted a functional enrichment analysis of the 23 key druggable cancer-driving proteins using g:Profiler software78. The Manhattan plot enabled us to identify 64 GO biological processes79 and 2 Reactome signaling pathways63 (Fig. 7E and Supplementary Table 16). The most significant annotations, adjusted with the Benjamini–Hochberg correction and an FDR q-value < 0.001, included cell cycle, cell communication, phosphorylation, immune system process, programmed cell death, cell differentiation, cellular senescence, endocrine resistance, G1 phase, and cyclin D events in G1. Interestingly, it is important to highlight that these 23 key druggable cancer-driving proteins are involved in biological processes associated with various therapeutic strategies. These strategies include the inhibition of cellular proliferation96, the inhibition of phosphorylation97, cancer immunotherapy98, activation of programmed cell death99, regulation of senescence100, and evasion of endocrine resistance101.The oncogenic variome of key druggable cancer-driving proteins and their deleterious effectsFigure 8A presents the analysis of 22,320 variants using OncodriveMUT and boostDM to determine the oncogenic variome in the 23 key druggable cancer-driving genes. This analysis identified 1,598 oncogenic variants, with 11 (1%) being previously known and 1,578 (99%) newly predicted. The analysis of deleteriousness scores revealed that 252 (16%%) of these oncogenic variants had very high CADD scores, 788 (49%) had high CADD scores, and 506 (32%) had medium CADD scores. The most common types of genetic alterations were missense variants (81%), followed by frameshift (6%), and stop-gained variants (5%). Figure 8B displays box plots that illustrate the deleteriousness scores of the oncogenic variants according to their consequence types. Stop-gained variants exhibited the highest mean CADD score (37.2), followed by splice donor (31.2), splice acceptor (30.9), missense (25.9), frameshift (25.8), start lost (21.2), stop lost (17.7), inframe deletion (17.4), splice region (16.8), and inframe insertion variants (16.7). Lastly, Fig. 8C presents bean plots that rank the key druggable cancer-driving genes based on the highest number of oncogenic variants and their deleteriousness scores (Supplementary Table 17).Figure 8Oncogenic variome. (A) Identification of oncogenic variants in the 23 key druggable cancer-driving genes. This identification is achieved through the use of oncodriveMUT and boostDM machine learning methods. Following this, the analysis includes examining their CADD deleteriousness scores and consequence types. (B) It also features a ranking of the consequence types based on the highest mean CADD scores. (C) Bean plots are used to illustrate the cancer-driving genes that possess the highest number of oncogenic variants, along with the CADD deleteriousness scores associated with these genes. The analysis of the oncogenic variome was conducted using the CGI platform (https://www.cancergenomeinterpreter.org)61,83, while the assessment of their deleteriousness was carried out using the CADD tool (https://cadd.gs.washington.edu/)84.Identifying oncogenic variants in cancer-driving genes is crucial for developing targeted therapies102,103,104,105. These therapies are specifically designed to inhibit or modify the function of proteins produced by mutated genes, offering more effective treatment options with potentially fewer side effects compared with traditional chemotherapy106. Moreover, this approach enables personalized precision medicine. By understanding specific genetic and epigenetic alterations in a patient’s tumor, treatments can be tailored to target these changes107,108,109,110. In this context, the identification of oncogenic variants in druggable cancer-driving genes is a fundamental aspect of modern oncology, influencing everything from individual patient treatment to broader aspects of cancer research, ethnicity, and public health initiatives106,111,112.This integrative approach has identified 23 key druggable cancer-driving proteins (CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4), setting the stage for improved therapeutic targets that could significantly boost the efficacy of clinical trials.Testing the model’s limitationsLike any model, there are limitations when using it for prediction. Due to the limited data on druggable proteins, all 666 druggable proteins were used as class 1 to train the model. This makes it impossible to obtain an external dataset with druggable proteins to confirm the predictive power of the best model. One way to test the model’s limitations is to plot the best protein predictions within the space of the selected features, alongside the druggable proteins and hard-to-drug proteins. Since plotting in 200 dimensions (the number of selected features in the best model) is impractical, we approximate by transforming these 200 dimensions into just 2 PCA components for visualization. Class 0 descriptors (hard-to-drug proteins), class 1 descriptors (druggable proteins), and the descriptors corresponding to the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units as in the original dataset for TC descriptors and transformed into 2 PCA components for visualization in Fig. 9. In the figure, druggable proteins are shown in blue, hard-to-drug proteins in red, and the best predicted proteins in green. The plot indicates that even though the negative class (class 0) contains phosphatase proteins, there is no clear separation between the training classes 1 and 0 within the space of the selected TC descriptors, indicating a complex descriptor space.Figure 9Principal component analysis to test the model’s limitations. This figure shows a plot of the best protein predictions within the space of the selected features, alongside the druggable proteins (class 1) and hard-to-drug proteins (class 0). Descriptors for class 1, class 0, and the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units, as in the original dataset for TC descriptors, and transformed into 2 PCA components. Druggable proteins are shown in blue, hard-to-drug proteins in red, and the best-predicted proteins in green.Prediction points that fall within regions containing mixed points (both class 1 and class 0 points) may be the most trustworthy. In these regions, the model has been exposed to a more diverse dataset, enabling it to learn to better distinguish the patterns and characteristics that differentiate the two classes. Consequently, predictions in these regions are more likely to be accurate and reliable, as the model has learned more robust and generalizable features for data classification. Therefore, predictions made in these mixed regions are likely to be the most robust and trustworthy. The majority of the predicted proteins are located in these mixed regions, suggesting they have a higher potential to be future drug targets. In the supplementary material, a researcher can choose another model with a mean AUROC value greater than a specific cutoff (e.g., 0.9), with fewer features and possibly better PCA representation of the predictions. Future studies should use artificial intelligence and docking tools to predict a list of potential current drugs or new ligands.Repurposing drugs and metabolitesAn additional step to confirm the 23 key druggable cancer-driving proteins involves predicting interactions with ChEMBL-approved drugs (2,466 molecules with masses between 100 and 500) through drug repurposing113,114. Using pairs of drug SMILES codes and protein sequences as inputs, a deep learning model called PLAPT evaluated the binding affinity (or negative log10 affinity)86. The model employs pre-trained transformers like ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model. Supplementary Tables 18 and 19, along with the GitHub file titled Supplementary_interactions_gene-drug(byPLAPT), present the affinity values for each drug-protein pair. The mean affinity values (minimum affinities or maximum negative log10 affinities) for all 23 proteins indicate that the top drugs clinically relevant to cancer treatment that can interact with these proteins include: mifepristone (targeting CASP8), pentostatin (BCL10, CASP8, CCNE1, and CDKN2A), afatinib (ACVR1, CDKN2C, and HRAS), alitretinoin (ACVR1, CDKN2C, HRAS, and PREX2), talazoparib (ACVR1, CDKN2C, and HRAS), alpelisib (ACVR1, CDKN2C, HRAS, NBN, PREX2, and SMARCA4), ulipristal acetate (ACVR1, ASXL1, CDKN2C, HRAS, NBN, PREX2, RB1, and SMARCA4), lorlatinib (ACVR1, ASXL1, ATG7, DNM2, HRAS, JAG1, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), piflufolastat (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MYTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), pyrvinium pamoate (ASXL1, ATG7, BUB1B, DNM2, HRAS, JAG, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC, and VAV1), and tepotinib hydrochloride (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MUTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1 (Fig. 10).Figure 10Drug repurposing. The Sankey plot shows the best interactions between the 23 key druggable cancer-driving proteins and the clinically relevant ChEMBL-approved drugs for cancer treatment (https://www.ebi.ac.uk/chembl/)113. The interactions are based on binding affinity values greater than 0.9 for each drug-protein pair. Lastly, Sankey plots were designed using the SankeyMATIC software (https://sankeymatic.com/ and https://github.com/nowthis/sankeymatic).Mifepristone, a progesterone receptor antagonist, has been explored for its potential in treating glioblastoma, breast cancer, and uveal melanoma due to its ability to act on multiple receptor types, including glucocorticoid and androgen receptors115,116,117. Pentostatin is a chemotherapy drug primarily used for treating hairy cell leukemia and T-cell prolymphocytic leukemia. It is a purine analog that works by inhibiting the enzyme adenosine deaminase, crucial for DNA synthesis and cell replication, leading to the accumulation of deoxyadenosine triphosphate and ultimately causing cell death, particularly in rapidly dividing cancer118. Afatinib is an oral medication primarily used for treating non-small cell lung cancer. It functions as a tyrosine kinase inhibitor, targeting and blocking the EGFR protein as well as other members of the ErbB family, including HER2 and ErbB4119. Alitretinoin, a derivative of vitamin A, is used in cancer treatment primarily for Kaposi sarcoma. It binds to and activates retinoid receptors (RAR and RXR), which regulate gene expression involved in cell differentiation and proliferation, helping to inhibit the growth of Kaposi sacroma cells120. Talazoparib works by inhibiting PARP enzymes, which play a crucial role in DNA repair. By blocking these enzymes, talazoparib prevents cancer cells from repairing their DNA, leading to cell death, especially in cells with BRCA1/2 mutations that already have compromised DNA repair mechanisms43,121. Alpelisib is an oral medication used in combination with fulvestrant to treat hormone receptor-positive, HER2-negative advanced or metastatic breast cancer with PIK3CA mutations. It works as a PI3K inhibitor, specifically targeting the alpha isoform of the enzyme, which is crucial in the PI3K/AKT signaling pathway involved in cancer cell growth and survival122. Ulipristal acetate is a progesterone receptor modulator implicated in the proliferation and growth of certain cancer cells. It competes with progesterone, thereby inhibiting the progesterone-induced proliferation of breast cancer cells, making it a candidate for reducing breast cancer risk, especially in individuals with BRCA1/2 mutations123. Lorlatinib inhibits ALK and ROS1 kinases, which are involved in cancer cell growth and survival. It is effective against multiple ALK mutations that confer resistance to first- and second-generation ALK inhibitors124. Piflufolastat F-18 binds to the prostate-specific membrane antigen, a protein overexpressed on the surface of most prostate cancer cells. Once bound, the radioactive tracer emits positrons detected by a PET scanner, revealing the location of PSMA-positive lesions in the body125. Pyrvinium pamoate is an androgen receptor antagonist that targets multiple cellular pathways. It disrupts mitochondrial function by inhibiting electron transport chain complexes I and II, reducing mitochondrial fitness and increased glycolysis, especially under hypoglycemic conditions often found in tumors. It also reduces WNT and Hedgehog signaling pathways, crucial for cancer cell proliferation and survival126,127,128,129. Lastly, tepotinib hydrochloride is a tyrosine kinase inhibitor targeting the MET receptor. By inhibiting this receptor, it interferes with cancer cell growth and survival pathways, which are crucial for the proliferation and metastasis of MET-altered cancer cells.The last screening for interactions was conducted for the HRAS protein (P01112) using 217,776 molecules from the HMDB (see all affinities in the Supplementary Table 20 and the GitHub file titled Supplimentary_affinities_hmdb_HRAS-P01112). Among the best potential interactions between HRAS and metabolites, the following were identified: cyanidin 5-O-beta-d-glucoside (HMDB0304305), chlorophyll (HMDB0303604), delphinidin 3-(3″-p-coumaroylglucoside) (HMDB0030099), cis-neoxanthin (HMDB0302969), verteporfin (HMDB0014603), pinotin A (HMDB0029240), benztropine (HMDB0014390), adapalene (HMDB0014355), inulin (HMDB0014776), and ceftriaxone (HMDB0015343). Future studies involving molecular docking, molecular dynamics, or other AI-based interaction prediction models will be needed to further confirm these interactions.

Hot Topics

Related Articles