Comparative analysis of feature selection techniques for COVID-19 dataset

4778 COVID-19 patients were included in the study, which examined 116 routine clinical, laboratory, and demographic features. The fatality rate in this cohort was 22% (N = 1050, 59.6% male). The mean ± SD age for the death and surviving groups were 70.8 ± 15.6 and 58.3 ± 16.9 years, respectively.After splitting the data and creating a 70%/30% train/test split, we discovered that in our training dataset, 22% of instances belong to the “Death” class and 78% are labeled as “alive”. This is significant because it indicates a moderate imbalance in the data. To address this, we utilized rebalancing techniques to prevent the minority class from being disproportionately represented in the training set35 before applying machine learning algorithms. We then used the balanced dataset to make predictions. Using the ROSE package36, we implemented both oversampling of the minority class with replacement and under sampling of the majority class without replacement to balance the distribution while maintaining the integrity of the dataset. The resulting data had the same size as the original dataset with a 1:1 ratio of “Death” to “Alive”.Number of selected features in every FS methodUsing the VIF, the dataset’s features were reduced to 109 predictors. The CMIM approach identified six key features, whereas the correlation filter method discerned significant correlations for eight numerical and strong Phi Correlation Coefficients for three categorical features.The Gini impurity embedded method, utilizing 700 trees, with five random variables each, selected twenty important features, achieving a minimum out-of-bag error rate of 5.29%.The Hybrid ABL method eliminated 39 features based on ANOVA results, and further feature reduction through backward selection and Lasso resulted in 15 predicting features.Using the Boruta-VI hybrid method identified 55 attributes were initially deemed significant, which were then narrowed down to 20 based on importance scores across eleven prediction models (Supplementary Tables S1–S4, Supplementary Figs. S1, S2).Comparison of selected featuresThe Jaccard index37 was utilized to compute the similarity between pairs of FS methods. As anticipated, the Jaccard index for Hybrid Boruta-VI—MDG Random Forest (0.6) and Correlation–CMIM (0.54) is high among the combinations, while Hybrid ABL demonstrates low similarity with other FS methods (Fig. 1).Figure 1The similarity of the features selected per each feature selection method.The most important features that included in every FS method were age and neutrophil count (NEUT), followed by oxygen saturation (O2sat), Albumin, UREA, and blood urea nitrogen (BUN) (Table 1).Table 1 Selected features in every FS method.To investigate the relationship between the occurrence of Death and the predictor included in each FS method, we employed multivariable binary logistic regression analysis. In the Hybrid ABL dataset, at a significance level of 5%, all predictors with the exception of CKD, FBS, and ESR were found to be significantly correlated with Death (p-values < 0.05). Similarly, in the correlated dataset, at the 5% significance level, all predictors except LYMPHH, BUN, and CR were determined to be significantly associated with Death (p-values < 0.05). In the CMIM dataset, all predictors, except BUN, were identified as significantly associated with Death at the 5% significance level. For the MGD dataset, at the 5% significance level, all predictors, excluding WBC, CR, LDH, PT, HB, and TIBC, were deemed significantly associated with Death. Lastly, in the Boruta-VI dataset, at the 5% significance level, all predictors except CR and UREA were found to be significantly associated with Death (p-values < 0.05) (Table 2).Table 2 The odds of Death for each predictor included in each feature selection method.In comparing the binary logistic regression models for each FS dataset using the anova() function, significant differences in predictive performance with respect to Death were observed at the 5% significance level. Specifically, the Hybrid Boruta-VI and MGD models were found to be significantly superior to the CMIM, correlated, and Hybrid ABL models in predicting Death (Pr(> Chi): 2.2e−16).On the other hand, the CMIM and Hybrid ABL models (Pr(> Chi): 26.525), the correlated model and the Hybrid ABL models (Pr(> Chi): − 159.16) were deemed to be statistically equivalent in predictive performance for Death. Additionally, the Hybrid Boruta-VI and MGD models also showed no significant difference in predictive performance (Pr(> Chi): − 124.74) for Death at the 5% significance level.To establish relationships between the selected features and the outcome class and to explore potential interactions among these features, a graphical representation of the correlation matrix was generated for the Hybrid Boruta-VI dataset utilizing the corrplot package38. This visualization highlighted the most correlated variables, with correlation coefficients represented by color gradients. Positive correlations were depicted in shades of blue, while negative correlations were visualized in varying intensities of red. The intensity of the color and the size of the circles were indicative of the strength of the correlation coefficients, providing a visually intuitive representation of the relationship between variables. It is pertinent to note that correlations with a p-value exceeding 0.01 were considered statistically insignificant and were included in the graphical representation.Of particular significance within the correlation matrix is the robust positive correlation of 0.63 observed between UREA and CR, indicative of a strong positive relationship between these parameters. Additionally, the moderate positive correlation coefficient of 0.48 between UREA and BUN suggests a moderate positive association between these variables. These observed correlations substantiate the close interrelation among UREA, BUN, and CR, underlining their relevance in assessing kidney function within the dataset. Furthermore, the moderate positive correlations between UREA and both P (0.42) and PROBNP (0.30) unveil potential connections between UREA levels and phosphate levels as well as heart function, considering PROBNP as a recognized marker for cardiac strain. These findings illuminate possible interdependencies among these variables, hinting at intricate physiological relationships that could offer valuable insights into the underlying mechanisms influencing the observed dataset trends (Fig. 2).Figure 2Correlation between the selected features and the outcome class in the Hybrid Boruta-VI dataset.Evaluation of classification models’ performance in different feature selection methodFigure 3 and Supplementary Table S5 show a rigorous evaluation of the model performance estimation using 10 repeated tenfold cross validation for each algorithm across various feature selection methods. The presentation of accuracy outputs, along with confidence intervals, showcases that the Random Forest algorithm consistently outperformed other algorithms in terms of accuracy across all FS methods.Figure 3Comparison of performance of machine learning algorithms for different feature selection methods using 10 repeated tenfold cross validation.To further delve into the model’s predictive powers, ROC curves were generated to summarize the trade-of between the true positive and false positive rates. Random Forest, C5.0, GBM, XGBOOST and Bagged CART models exhibited ROC curves that trended towards the upper left corner, indicating optimal sensitivity and specificity levels across different probability thresholds within our resampling method (Fig. 4).Figure 4ROC curve for cross-validated models on the different feature selection method’s data.Calibration plots focusing on the Random Forest algorithm (our best predictive model) across various feature selection methods (Fig. 5) indicated minimal bias, with point estimates closely centered on zero for each feature selection method (the calibration intercepts obtained from the val.prob function within the rms library5 in R).Figure 5The performance of Random Forest model in different FS methods using repeated 10 repeated tenfold Cross Validation dataset.The model’s performance was further validated through external validation using independent test datasets distinct from the training database. Table 3 indicated that the predictive performance of the Classification and Regression Trees (CART) model was notably low, whereas the Random Forest model demonstrated superior performance across all feature selection methods. Additionally, Fig. 6 highlighted the consistent excellence in discrimination, calibration, and predictive performance of the Random Forest model across all feature selection methods.Table 3 Performance comparison between six FS methods for different classifiers.Figure 6The performance of Random Forest model in different feature selection methods using test data (external validation).When comparing the F1-score, Accuracy, Area Under the Curve (AUC), and precision of various machine learning models across different feature selection methods in Fig. 7, it is evident that Hybrid Boruta-VI exhibits competitive performance but does not consistently outperform other methods across all evaluated metrics and feature selection strategies. Specifically, Hybrid Boruta-VI demonstrates a commendable F1-score, although comparable or marginally superior scores are achieved by other feature selection methods such as MDG and the absence of feature selection (Without FS). In terms of accuracy, while Hybrid Boruta-VI performs well, similar or higher levels of accuracy are observed with methods like Correlation and Without FS. Moreover, the AUC values obtained utilizing Hybrid Boruta-VI are not the highest among the compared feature selection techniques; instead, methods like Correlation and Without FS showcase superior AUC scores. Regarding precision, Hybrid Boruta-VI exhibits good results, yet alternative methods like Correlation and Without FS present comparable or enhanced precision metrics.Figure 7Comparing the F1-score, accuracy, AUC and precision of various machine learning models across different feature selection methods.In light of these nuanced findings, statistical analyses were conducted to thoroughly evaluate the efficacy of Hybrid Boruta-VI across feature selection methods, while also examining the performance of the Random Forest model among different models.The roc.test function from the pROC R package39 is utilized to conduct DeLong’s test for comparing two ROC curves and evaluating the statistical significance of the true difference in the AUC for two models. In Table 4, we compare the AUCs of various ML algorithms under two distinct FS methods. The results indicate that there is a notable difference in AUC values between the hybrid Boruta-VI method and other FS techniques across most ML algorithms. Particularly, in models such as Random Forest, C5.0, GBM, XGBOOST, and Bagged CART, where high AUC values were observed, significant differences were identified. In the case of the hybrid Boruta-VI method versus the absence of feature selection (utilizing all features), the AUC values for C5.0, GBM, XGBOOST, and Bagged CART exhibited statistically significant disparities. However, when comparing hybrid Boruta-VI with the MGD feature selection method, the AUC values were found not to be significantly different. Furthermore, when contrasting hybrid Boruta-VI with the CMIM approach, noteworthy differences in AUC values were specifically observed for C5.0 and GBM models. Moreover, for the hybrid Boruta-VI method compared to Hybrid ABL, significant disparities in AUC were identified for Bagged CART and GBM models. Lastly, when examining Hybrid Boruta-VI versus the correlated FS method, a significant difference in AUC was noted specifically for the GBM model. Notably, when applying the Random Forest model, the difference in AUC was significantly equal to 0 for Boruta-VI in comparison to various other FS methods, except for CMIM.Table 4 Comparing AUCs of machine learning models for two feature selections method with DeLong’s test (Roc.test).Table 3 and Fig. 6 present comparative performance metrics indicating that the Random Forest algorithm consistently achieved the highest accuracy across all FS methods.To rigorously assess the significance of these results from a statistical perspective, we conducted hypothesis tests to compare the performance of Random Forest with other machine learning algorithms using the Hybrid Boruta-VI dataset (Table 5).Table 5 Statistical tests for pairwise performance comparison between different models and Random Forest on the hybrid Boruta-VI dataset.The chi-square test results from Table 5, provide strong evidence that the predictions of each model are not attributable to random chance, indicating a meaningful relationship between the predictors and the outcomes. Subsequently, the McNemar’s Chi-squared test revealed a statistically significant difference in the paired predictions of Random Forest, Bagged CART, Decision Tree (C5.0), and Extreme Gradient Boosting (XGBoost). The notably low p-values derived from these tests further reinforce the assertion that this differentiation is not arbitrary, thereby affirming a substantial association between the prediction sets of the models under examination.Moreover, the comparison of area under the curve (AUC) values among Random Forest, Bagged CART, and Decision Tree (C5.0) models, as indicated by their relatively large p-values, suggests that their predictive performance is not significantly distinct. Consequently, based on our analytical findings, we can confidently assert that Random Forest outperforms ten alternative models, encompassing Generalized Linear Model, LDC, Regularized Regression, k-Nearest Neighbors, Naive Bayes, SVM, CART, Stochastic Gradient Boosting, Neural Network, and XGBoost. Notably, while Random Forest demonstrates superior performance over the majority of models considered, it does not exhibit a statistically significant performance advantage relative to the Bagged CART and C5.0 models.Furthermore, the most important predictors identified by the Random Forest model on the Boruta-VI dataset include age, O2sat, UREA, ALBUMIN, CR, and LDH (Fig. 8, Supplementary Fig. S3).Figure 8Feature importance of the Random Forest model on Boruta-VI dataset.

Hot Topics

Related Articles