Machine learning optimized DriverDetect software for high precision prediction of deleterious mutations in human cancers

Selection of prediction tools used in this studySeven prediction tools were used to provide features for training the DriverDetect algorithm (Table 1) (Fig. 1a, S1a). Population-based prediction tools, including VEST (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/)25,26 and the ensemble methods CONDEL (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/)27, PredictSNP (version 1.0, https://loschmidt.chemi.muni.cz/predictsnp1/)21, and MutPred (version 2.0, http://mutpred.mutdb.org/)22, were used to discern deleterious mutations from neutral ones (Fig. S1a). Cancer-based prediction tools FATHMM (version 2.3, https://fathmm.biocompute.org.uk/index.html)28 and CHASM (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/)29,30, along with the ensemble method TransFIC (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/)31, were employed to differentiate driver mutations from passenger mutations (Fig. S1a). Breast cancer, which performed best elsewhere for mutation prediction experiments32 was selected for analysis using CHASM (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/), which required specification of a cancer type. These tools were also selected based on their accessibility and usability—the results they generate can be easily downloaded from web interfaces like CRAVAT, which provides user-friendly interfaces for both VEST (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/) and CHASM (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/)25,33.Table 1 List of prediction tools used.The output formats vary among the different prediction tools. PredictSNP (version 1.0, https://loschmidt.chemi.muni.cz/predictsnp1/)21, CONDEL (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/)27, FATHMM (version 2.3, https://fathmm.biocompute.org.uk/index.html)28, and TransFIC (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/)31 provide a prediction along with a score that defines the accuracy of the prediction and the likelihood that the mutation is deleterious or driver (Fig. S1a). MutPred, VEST (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/), and CHASM (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/), on the other hand, do not provide predictions and, instead, yield scores ranging from 0 to 121,22,25,26,29. We adhere to the default ranges from previous literature, using 0.5 as a cutoff, with scores above considered deleterious/driver mutations, and scores below as neutral/passenger mutations (Fig. S1b)21,22,25,27,28,29. TransFIC (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/) provided three different scores, siftTransFIC, maTransFIC, and pph2TransFIC. Using the TransFIC tools, mutations were thereafter further classified into three categories: low (with siftTransFIC score < − 1, pph2TransFIC score < − 1, maTransFIC score < − 1), medium (siftTransFIC score from − 1 to 2, pph2TransFIC score from − 1 to 1.5, maTransFIC score from − 1 to 2), and high impact (siftTransFIC score > 2, pph2TransFIC score > 1.5, maTransFIC score > 2)31. Mutations labeled primarily as “low impact” by the TransFIC scores were deemed passenger mutations, while those with “medium” and “high” impacts as driver mutations (Fig. S1b and S1c)31.Curation of mutation datasetsTo evaluate population-based algorithms, a training dataset comprising 763 deleterious and 607 neutral mutations, along with a testing dataset containing 100 deleterious and 100 neutral mutations, were curated through the ClinVar database34 (Fig. 2a and S2). Curation was achieved in accordance with the supervised learning concept established in machine learning and in the development of other prediction tools21,24,35,36. Mutations classified as “Pathogenic” or “Likely pathogenic” by multiple tools were considered deleterious, while those classified as “Benign” or “Likely benign” by multiple tools were annotated as neutral. The full list of mutations used as training and testing data for population-based algorithms are available in Supplementary Tables 1 and 2 respectively.Fig. 2Training and Testing dataset split. (a) Split of 763 deleterious and 607 neutral mutations within the training dataset, and 100 deleterious and 100 neutral mutations within the testing dataset for population-based algorithms. (b) Split of 709 driver and 910 passenger mutations within the training dataset and 100 driver and 100 passenger mutations within the testing dataset for cancer-based algorithms.For the cancer-based algorithm, a training dataset of 709 driver and 910 passenger mutations and a testing dataset of 100 driver and 100 passenger mutations were curated from four databases, namely cBioPortal37,38, OncoKB39, IntOGen2, and GnomAD40 (Fig. 2b and S2). Curation was performed in accordance with the supervised learning method established in machine learning and in the development of other prediction tools21,24,35,36. cBioPortal is an open-source resource of gene mutations along with categorization. OncoKB categorizes driver mutations based on clinical data. IntOGen is a database that collates mutations in tumor genomes and label possible driver ones. GnomAD processes raw data from large-scale sequencing projects2. The ClinVar database was used to assess the clinical significance of mutations from the training and testing datasets. Mutations categorized as “Benign” were regarded as passenger mutations. The total numbers of drivers, passengers, deleterious and neutral mutations provided by each database are listed in Fig. S2a and S2b. Mutations obtained from the IntOGen database were only used for the training dataset, as the criteria for their inclusion in the database are solely based on bioinformatics analysis without clinical/experimental support (Fig. S2c)2. Mutations were selected from previously published cancer/driver genes16 for analyses using both algorithms, as well as mutations from known breast cancer genes for validation of cancer-based algorithm (Table S3). Mutations that were classified incorrectly by all prediction tools were excluded from the testing dataset. The full list of mutations used in the training and testing data for cancer-based algorithms can be found in Supplementary Tables 3 and 4 respectively.All the databases used in this study are publicly available, with clear labeling and citations. In this way, we excluded mutations without any literary support. Mutations listed in the selected databases are used with clear consent from the original authors and curated without gender or age bias to allow for a random and fair selection.Combining results of different prediction toolsThe raw scores generated from the population- and cancer-based tools are summarized in Supplementary Tables 5 and 6 respectively. The prediction results generated from individual tools were combined using “either-or” and “at-least” methods. For “either-or”, consensus among all tools in the combination was required to categorize a mutation as deleterious/driver. For “at-least”, a specified minimum number of prediction tools must identify the mutation as deleterious/driver for it to be categorized accordingly. For example, “at least 2 of PredictSNP, CONDEL, and MutPred” means that at least two of the three prediction tools must predict the mutation as deleterious.TransFIC (FannsDB version 2.0, https://bbglab.irbbarcelona.org/fannsdb/) uses three parameters to classify candidate mutations, namely siftTransfFIC, pphTransFIC and maTransFIC. Each parameter assigns a separate label of “low impact”, “medium impact” or “high impact” (Fig. S1a and S1b)31. We adopted two approaches for this paper: “TransFIC Low” and “TransFIC High”. The “TransFIC Low” method classifies candidate mutations as passenger mutations if all three, or at least two, TransFIC parameters are labeled as “low impact”, or if two mutations are labeled as “medium” and one as “low impact”. Mutations with any other combinations of labels are considered driver mutations. Conversely, the “TransFIC High” method identifies candidate mutations as drivers if all three, or at least two TransFIC parameters are labeled as “high impact,” or if two mutations are labeled as “medium” and one as “high impact”. The others are categorized as passenger mutations (Fig. S1c). We adopted the TransFIC Low approach, as this method provides the most consistent results. The combined prediction results from population-based and cancer-based algorithms were consolidated and compared with scores from DriverDetect, as shown in Supplementary Table 7.Development and optimization of DriverDetect algorithmThe Tree-based Pipeline Optimization Tool (TPOT) of the Python library (version 0.3) was used to develop and optimize the DriverDetect algorithm reported herein. TPOT is an automated machine learning tool that tests and selects the most optimized model and parameters it can find. During the automation process, it selects a random classifier or combination of classifiers that it has not yet used and applies it to the training dataset. TPOT continuously selects 100 different parameters/ratios for up to 100 different classifiers or combinations. Each time a parameter is selected for a particular classifier, TPOT conducts a “tenfold” cross-validation to estimate the accuracy of the current algorithm (Fig. 1). This results in approximately 100,000 runs across 10,000 different algorithms/configurations, ultimately providing the algorithm that perform best during the search41. The raw scores from the DriverDetect are summarized in Supplementary Table 7.Training featuresWe used six training features for the population-based algorithms and seven for the cancer-based algorithms during training. PredictSNP (version 1.0, https://loschmidt.chemi.muni.cz/predictsnp1/) generated two outputs per mutation, categorized as either “deleterious” or “neutral,” with an expected accuracy ranging from 0 to 1. These classifications were transformed into a unified scoring system of one score. Mutations labeled as “deleterious” were assigned a score equal to the negative value of their accuracy (e.g., 0.7 transformed to − 0.7), whereas mutations labeled as “neutral” retained their accuracy score.The outputs from VEST (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/) and CHASM (CRAVAT version 5.2.4, https://www.cravat.us/CRAVAT/) comprise scores and p-values, indicating the probability that a passenger mutation is erroneously classified as a driver, along with False Discovery Rates (FDR) that indicate the likelihood of a positive result being a false positive. Consequently, four new features, namely ‘VEST p’, ‘VEST FDR’, ‘CHASM p’, and ‘CHASM FDR’, were derived to represent the scores assigned to mutations post-correction with the respective p-value or FDR. These scores were employed for training DriverDetect (Fig. 1a). Details of the correction formula, along with illustrative examples, are provided in Fig. S3, while the impact on the accuracy, F1 score, and Matthew Correlation Coefficient (MCC) of raw VEST and CHASM scores are shown in Fig. S4a and S4b.Metrics used for performance evaluationThe accuracy, F1 Score, and Matthew Correlation Coefficient (MCC) were employed to compare the performance of each prediction tools and their combinations relative to the DriverDetect algorithm on the testing datasets. The different prediction tools were compared using these parameters, with accuracy being prioritized first, followed by F1 Score and then MCC (Fig. 1b). Precision, recall, and sensitivity were also considered on a case-by-case basis, if necessary. These metrics were calculated with Microsoft Excel 360 (version 16.78.3) (Redmond, USA), using the parameters of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) as per the following formulae:$${\text{Accuracy}} = \frac{TP + TN}{{TP + TN + FP + FN}}$$$${\text{Precision}} = \frac{TP}{{TP + FP}}\quad {\text{Recall}} = \frac{TP}{{TP + FN}}$$$$F1 Score = 2\times Precision \times Recall\div (Precision+Recall)$$$${\text{MCC}} = \frac{TP \times TN – FP \times FN}{{\sqrt {(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)} }}$$ Predicted positivePredicted negativeActual positiveTrue positive (TP)False negative (FN)Actual negativeFalse positive (FP)True negative (TN)In cross-validation of the training dataset, single and combinations of common prediction tools were used to assess the performance of DriverDetect, with average accuracy (Fig. 1b), as previously reported41. Receiver Operating Characteristic (ROC) curves on TP and FP (%) were plotted with R programming software (version 4.3.2; R Foundation, Auckland, New Zealand) to determine the performance of different prediction tools at the same classification thresholds. The area under curve (AUC) values of the ROC curves were used to compare the performance of different machine-learning based tools in the driver mutation predictions42. McNemar’s test was used to evaluate significant improvements in accuracy and in other parameters for DriverDetect, as compared with other single or combinations of prediction tools43.

Hot Topics

Related Articles