A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data

This section consists of three blocks. First, we investigated the most informative molecular biomarkers based on the distribution of cases in each glioma grade and checked whether these results agree with the results of four feature ranking algorithms. The second block analyzes the performance of six standard prediction models and classifier ensembles for glioma grading. Finally, we apply some resampling techniques to handle class imbalance and verify if this leads to increased performance.Most informative featuresValues in Table 6 show the distribution of cases according to glioma grades for the mutated molecular biomarkers (i.e., feature value = 1). As can be seen, IDH1 mutations are the most common, being detected in 404 patients (48.15% of the total cases studied). However, these mutations occur in 94.31% of cases with LGG and only in 5.69% of cases with GBM, confirming previous findings that this is a very informative molecular biomarker for glioma grading17,34,48: IDH1/2 mutations have been largely associated with grade II and III gliomas and secondary glioblastomas49. Looking at the biomarkers with 50 or more cases, similar conclusions can be drawn for the molecular biomarkers ATRX with 84.33% of LGG, PTEN (phosphatase and tensin homolog) with 82.27% of patients affected by GBM and CIC (capicua transcriptional repressor) with 96.40% LGG. In the case of biomarkers with a low percentage of patients, we find NOTCH1 (notch receptor 1) (100% of LGG), FUBP1 (far upstream element binding protein 1) (95.56% of LGG), IDH2 (isocitrate dehydrogenase 2) (91.30% of LGG), SMARCA4 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4) (85.19% of LGG) and RB1 (retinoblastoma transcriptional corepressor 1) (85.00% of GBM).Table 6 Distribution of cases (N (%)) according to glioma grades based on mutated molecular biomarkers (bold values indicate the most discriminating molecular biomarkers, that is, those with the greatest difference between LGG cases and GBM cases).To support the conclusions drawn from the values in Table 6, we ran four feature ranking algorithms on the normalized data set with the aim of checking which are the most informative predictors: information gain, Gini index, Chi-squared, and RF. Note that identifying the most informative clinical factors and glioma molecular biomarkers can be valuable in obtaining relevant biological information. On the other hand, in some practical cases, having small feature sets with high prediction accuracy can become paramount to minimize response time.Information gain (infGain) estimates the relevance of a predictor based on the amount by which the entropy of the class decreases when considering that feature. The Gini index (Gini) estimates the distribution of a predictor in different classes and can be interpreted as a measure of impurity for a feature. Chi-squared (Chi2) measures the relationship strength between each variable and the class label. Note that Chi-squared applies to categorical predictors, and therefore, numerical attributes (as is the case for Age at diagnosis) must first be discretized into several intervals. In the case of RF as a feature ranking method, each tree in the forest calculates the importance of a predictor based on its ability to decrease the weighted impurity in the tree.Table 7 Results of feature ranking methods. The ranking of each feature is shown in brackets (bold values indicate the five most informative predictors based on the multiple intersection method).Since each feature ranking algorithm could yield different results (rankings), fusing them using a multiple intersection method was necessary to find out which features got the highest rankings in the output of the four algorithms. Thus, looking at the rankings of each algorithm, it was possible to determine which were the five most relevant attributes, while there were discrepancies in establishing the most informative attributes from the sixth position onwards. From the outputs of the multiple intersection method, Table 7 shows that the four feature ranking algorithms agreed to define IDH1 as the most informative attribute, followed by Age at diagnosis, PTEN, CIC and ATRX. These results are interesting because they are consistent with the findings of various studies conducted in neuroscience and neuro-oncology17,34,48,49 in which the mutated molecular biomarkers that best discriminate LGG from GBM were determined, as reported in Table 6. The relevance of this lies in the fact that feature selection or ranking algorithms could be used to discover molecular biomarkers with the greatest discriminating power instead of other more expensive, time-consuming and difficult to carry out methods.We ran multidimensional scaling50 to visualize in Fig. 3 the samples from both classes as a function of the attribute Age at diagnosis against each of the four most informative biomarkers (IDH1, PTEN, CIC, and ATRX). Each blue dot represents an LGG sample, and each red dot is a GBM sample. The regions belonging to each class are shaded in blue or red depending on whether they correspond to the LGG or GBM class, respectively. These graphs allow us to see how the age of the patients and mutations are related to the grade of glioma. For example, Fig. 3a reveals that most LGG cases require IDH1 mutations and occur at younger ages than GBM cases. For PTEN (Fig. 3b), LGG occurs when there is no mutation, while GBM does not appear to depend on this molecular biomarker since approximately the same number of cases is seen both with and without PTEN mutations.Figure 3Scatter plot of Age at diagnosis (X-axis) vs. the most informative molecular biomarkers (Y-axis).Results of the prediction modelsTable 8 reports the results of each of the six evaluation metrics achieved by the prediction models applied to the normalized data set (with all predictors) using the experimental protocol described above. The results revealed that RF was the best performing model, although closely followed by CatBoost and SVM. In contrast, kNN. MLP and LR obtained the lowest values regardless of the performance evaluation metric used.Table 8 Prediction performance of the machine learning models (the best values are in bold).For better analysis of these results, we performed a pairwise comparison of models using a correlated Bayesian t-test51 for each evaluation metric to check whether the difference in scores between each pair of models was significant or not. Unlike the frequentist correlated t-test, where the inference is a p-value, the inference of the Bayesian t-test is a posterior probability. Additionally, this test considers the correlation and the uncertainty (i.e., the standard error) of the results generated by cross-validation. The outputs of the statistical test are summarized in Table 9, where the number in a cell denotes the probability that the model corresponding to the row had a significantly higher score (posterior probability greater than 0.5) than the model corresponding to the column. Values in this table indicate that the results obtained by RF and CatBoost were significantly better than those of kNN, MLP, and LR, regardless of the metric used. When comparing RF and CatBoost with SVM, it can be seen that the differences were not statistically significant when using Prec (0.492 and 0.399) and Spec (0.460 and 0.416). Finally, posterior probabilities of RF being significantly better than CatBoost revealed that the performance differences between both ensembles were very small, so one should not conclude that RF performed better than CatBoost.Table 9 Pairwise comparison of models.Figure 4 plots the ROC curves for the RF and CatBoost ensembles separately for each of the two classes (LGG and GBM). The diagonal dotted line represents the behavior of a random classifier, while the full diagonal line represents iso-performance in the ROC space so that all the points on the line give the same profit/loss. The closer to the top and further to the left this full diagonal line is, the better the classifier result. The AUC was 0.923 for RF and 0.924 for CatBoost, that is, the difference between both classifiers was negligible.Figure 4ROC curves for the classifier ensembles.Figure 5 shows the confusion matrix corresponding to each of the six prediction models. Although it was seen that the imbalance ratio of the data set was moderately low (1.38), the confusion matrix allows us to discover the behavior of the models in each of the classes, that is, analyze the number of successes and errors individually by class in order to identify whether or not there were differences between predicting samples belonging to the majority class and samples of the minority class. Thus, it can be observed that the three models with the best performance (RF, CatBoost and SVM) made a lower number of errors than the other three classifiers (kNN, MLP and LG) on the minority class (GBM). In contrast, the number of misclassifications on the majority class (LGG) was similar in all classifiers.Figure 5Confusion matrices of the classifiers.Explainability of predictionsDue to the “black box” nature of most machine learning models, one of the main problems is their insufficient interpretability or the difficulty in understanding the predictions they make. To shed light on these limitations, some methodologies belonging to the eXplainable Artificial Intelligence (XAI)52 paradigm have been proposed in order to provide a reasonable understanding of the output of machine learning models. In particular, we analyzed the effect of the attributes on the prediction performance using two explainability approaches: global feature importance and SHAP.Global feature importance estimates the contribution of each individual feature to the prediction by measuring the increase in the prediction error of the model after performing permutations on the feature values across the data set, which breaks the relationship between the feature and the target variable44,53. A feature is important if permuting its values increases the model error, while a feature is of little or no importance if permuting its values does not change the error of the model.Bar charts in Fig. 6 show feature importances in descending order for each classifier, indicating that the IDH1 biomarker was the most important attribute contributing to the target variable (i.e., glioma grade), regardless of the model used. The second most important feature was Age at diagnosis in all cases except when applying the MLP neural network (note that even in this case the attribute Age at diagnosis was the third most important). It is worth highlighting that these results mostly agree with those reported in Table 7, where these two features were also identified as the most relevant when applying the multiple intersection method.Figure 6Feature importance of the top 5 variables according to the AUC of the model.It should be noted that the global feature importance approach reveals the absolute importance of each attribute, but it does not indicate the direction of the change given by the permutations, that is, it does not report whether the feature increases or decreases the prediction performance of the model. To overcome this limitation, we also employed the SHAP method introduced by Lundberg and Lee54, which is based on the principles of cooperative game theory and can provide broad explanations of model predictions at both local and global levels. This method computes Shapley values, which quantify the average marginal contribution of a feature to the prediction made by the model after considering all possible combinations with other features55, that is, it provides information about whether the influence of each characteristic on the prediction value of the model is positive (increase) or negative (decrease). The Shapley value of a feature, is calculated as the difference between the prediction when the feature is present and the prediction when the feature is absent.Figure 7 shows the SHAP summary plot for each model, which represents the positive or negative impact of each feature on the prediction of one class. On the X-axis is the Shapley value, which denotes how much the features contribute to the prediction of a patient diagnosed with GBM across all possible combinations. A value less than 0 indicates a negative contribution (i.e., low importance for the prediction of the minority class GBM), equal to 0 indicates no contribution, and greater than 0 indicates a positive contribution (i.e., high importance for prediction). The left vertical axis (Y-axis) is for features ranked in descending order of their relevance to the prediction of class GBM, while the right vertical axis indicates the value of the features from lowest to highest. Each dot represents the Shapley value of a sample (patient) plotted horizontally and is colored red or blue depending on whether the value is high or low, respectively.Figure 7From these plots, it can be seen that Age at diagnosis was the most important feature for the prediction in class GBM when the KNN and LR models were used, and the second most relevant with the rest of the classifiers. Samples with higher values of this feature (red color) had higher Shapley values, meaning that they contributed to the prediction of class GBM. Lower values of this attribute (blue) contributed against the prediction of this class. The IDH1 biomarker (categorical attribute) contributed the most to the prediction of GBM class when using the MLP, SVM, RF and CatBoost models. As IDH1 is a categorical attribute, its impact on the prediction depends on its value (0 = non-mutated, 1 = mutated). Thus, it can be seen that this biomarker with the non-mutated value for the patient (red color) contributed to the prediction of the GBM class, while the mutated value of this attribute contributed negatively.Addressing class imbalanceConsidering the differences in misclassifications between the majority class and the minority class, we decided to address the class imbalance in order to see if any performance improvement could be obtained. It is well known that training a machine learning algorithm with imbalanced data can favor the majority class, typically leading to higher misclassification rates over the minority class (GBM). Among the various strategies to address imbalanced data, resampling techniques are by far the most widely used approach because they have been proven to be efficient, classifier-independent, and can be easily implemented for any problem56. These are designed to change the composition of the training data set by adjusting the number of majority and/or minority samples until both classes are represented by an approximately equal number of samples. Many researchers have argued that over-sampling is generally superior to under-sampling because under-sampling algorithms can discard potentially useful data and increase classifier variance57. It should be noted that, to avoid overoptimistic results, resampling should be applied only to the training set, not to the entire data set58. In the case of over-sampling, for instance, this means that the testing samples are neither over-sampled nor seen by the machine learning model during training.Experiments in this section were carried out with two resampling algorithms. The first is an over-sampling algorithm proposed by Chawla et al.59 called SMOTE, which generates artificial samples of the minority class (GBM) by interpolating existing samples that are close together. It first finds the k minority nearest neighbors for each minority sample, and then synthetic samples are generated in the direction of some or all of those nearest neighbors. Depending on the amount of over-sampling required, a certain number of samples are randomly chosen from the k nearest neighbors. The second is random under-sampling (RUS), which balances the data set by randomly removing samples that belong to the over-sized class (LGG).Table 10 reports the performance results obtained after preprocessing the normalized data set with SMOTE and RUS. The first issue worth mentioning is that oversampling performed better than undersampling, except when Recall was used. Secondly, unlike the results obtained with the normalized data set without preprocessing (Table 8), now the best model after up-sampling the data set was SVM, although the differences concerning RF and CatBoost were really negligible.Table 10 Prediction performance of the machine learning models using the resampled data sets (the best values are in bold).To check whether or not the difference in the means of the results with the normalized training set without preprocessing and those preprocessed with over-sampling and under-sampling were significant, a two-tailed t-test60 was performed for a significance level of 5% (\(\alpha = 0.05\)), whose t-values and p-values are shown in Table 11. Thus, when comparing the means of Table 8 with those of over-sampling (upper part of Table 10), we obtained that the differences were statistically significant in all cases, except when using the specificity to evaluate the performance of the models. On the other hand, when comparing them with the means obtained with under-sampling (bottom of Table 10), we found that the differences in precision and specificity on the non-preprocessed set were significantly better than those of the downsized set. Therefore, despite the low imbalance index, the test indicated the convenience of over-sampling the normalized data set with the SMOTE algorithm to increase the performance of the prediction models.Table 11 Statistical comparison between the non-preprocessed data set and the resampled data sets. The first line of each method is the t-value, and the second line corresponds to the p-value (italic values indicate no significant differences, while underline values indicate that the results without resampling were better than those with resampling).As a further confirmation of the findings using SMOTE, in Fig. 8 we plotted precision-recall curves for the best predictiion models (SVM, RF and CatBoost) when applied to the original training sets and the over-sampled training sets. The area under the precision-recall curve was 0.838, 0.860 and 0.872 for SVM, 0.873, 0.91 and 0904 for RF, and 0.872, 0.908 and 0.898 for CatBoost using the original, over-sampled and under-sampled training sets, respectively. These values confirm some performance improvements as a result of addressing class imbalance with SMOTE.Figure 8Precision-recall curves for SVM and the classifier ensembles applied with the original training sets (a–c) , the oversampled training sets (d–f), and the undersampled training sets (g–i).The last experiment focused on analyzing the behavior of the prediction models on the upsized data set using the feature vector with the five most relevant attributes according to the multiple intersection method. Table 12 shows that the best performing models were LR and SVM, which is quite surprising because these results differed from those obtained on the data set containing all attributes. On the other hand, when comparing the results of the upper part of Table 10 with those of Table 12, one can see that the performance of all the prediction models worsened when applied to the reduced sets. To check whether or not the differences were statistically significant, we again ran a two-tailed t-test for a significance level of 0.05: t-value \(= -\)6.898545, p-value \(=\) 0.00098.Table 12 Prediction performance of the machine learning models on the oversampled data set using the top five attributes (the best values are in bold).

Hot Topics

Related Articles