Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing | BMC Bioinformatics

The global population’s growth has coincided with a concerning surge in cases of brain strokes, leading to a notable increase in annual fatalities by 2023. With the number of stroke-related deaths on the rise, the imperative to address this crisis has become increasingly urgent. This alarming trend has propelled stroke research to the forefront of medical exploration.Machine learning algorithms have shown promise in revolutionizing stroke prediction by analyzing extensive datasets encompassing demographic information, medical histories, and physiological markers like age, blood pressure, and glucose levels [1, 2]. However, the deployment of these algorithms in clinical settings presents challenges that must be addressed. One significant concern is the potential bias embedded within training data, which can lead to skewed predictions and inequitable healthcare outcomes [3]. Biases may arise from incomplete or unrepresentative datasets, socioeconomic factors, or disparities in healthcare access.To mitigate these challenges, ensemble learning methods such as stacking have emerged as a robust approach.Our model which involves stacking integrates predictions from the base classifiers-random forest, Decision Tree and final estimator which is KNN-to enhance predictive accuracy and robustness. By combining multiple classifiers, stacking can mitigate the impact of biases inherent in individual models and improve the generalization capability of the overall predictive system.Additionally, principal component analysis (PCA) is a powerful dimensionality reduction technique which is used for transforming complex datasets into a lower-dimensional space while retaining most of the essential information, PCA aids in simplifying data representations by linearly transforming the original features into orthogonal features known as principal components, which are ordered based on the variance they explain. Through the identification of eigenvectors and eigenvalues from the covariance matrix of the data, PCA captures the directions of maximum variance and their respective magnitudes [4,5,6]. PCA finds applications in various domains, including data visualization, noise reduction, and feature extraction [7, 8]. Through a pioneering method for predictive analysis in ischemic brain stroke utilizing advanced machine learning techniques i.e, diverse ML algorithms and ensemble learning strategies, proposed research has achieved exceptional predictive accuracy, reaching an impressive 98.6%.Ensemble learning has become a focal point in the machine learning and computational intelligence fields because it offers a way to enhance prediction accuracy by pooling together multiple classifiers. While initially used to improve classification accuracy, ensemble methods have evolved to tackle a wide range of real-world issues such as adapting to changing concepts, correcting errors, selecting the most relevant features, learning incrementally, and estimating confidence levels. Researchers have delved deeply into various fusion techniques and the components that make up ensembles, leading to significant advancements in recent years [9,10,11,12].The benefits of this research are multifaceted: enhanced prediction accuracy by combining multiple machine learning algorithms, efficient data utilization through proper data preprocessing and dimensionality reduction, early detection of high-risk individuals for timely intervention, support for personalized medicine by tailoring treatment plans, elucidation of key risk factors driving further research. Clinically, this method enables early detection of high-risk individuals, allowing for timely intervention and better resource allocation, and supports personalized medicine by tailoring treatment plans to individual risk profiles. Additionally, the approach aids research by elucidating key risk factors, driving further investigations into stroke prevention and treatment. Overall, this comprehensive method significantly contributes to early detection and prevention efforts, improving patient outcomes and addressing stroke-related healthcare challenges [13, 14].This paper seeks to bridge the gap between machine learning and brain stroke identification. By harnessing the power of ensemble methods and classifier fusion, it aims to not only improve predictive accuracy but also streamline the process of identifying strokes early on. If successful, these advancements could revolutionize medical practices, paving the way for more effective interventions and ultimately saving lives.MotivationWe propose a pioneering approach to stroke prediction, leveraging advanced machine learning techniques and introducing a novel stacking methodology. Our research stands out for its innovative contribution in showcasing the robust performance of this stacking technique across a spectrum of crucial healthcare metrics. We demonstrate the potential of our proposed approach, thereby enhancing patient outcomes and healthcare management strategies.Literature surveyStroke prediction research has witnessed significant advancements through the application of machine learning (ML) techniques, contributing to improved accuracy and timely interventions. This review synthesizes findings from recent studies focusing on ML approaches for stroke prediction, emphasizing algorithmic performance, feature selection methodologies, model interpretability, and key results.In [15], an innovative stroke detection algorithm is presented, employing various ML classifiers such as Naïve Bayes, logistic regression, XgBoost, and support vector machines (SVM). Notably, the support vector machine algorithm outperformed other models, achieving exceptional accuracy (98.6%) and precision (99.9%). However, the paper lacks explicit discussions on feature selection and data preprocessing strategies.In [16], researchers develop an ML-based stroke prediction algorithm utilizing readily available data from patients’ hospital presentations and investigating the impact of social determinants of health (SDoH) variables. The study reports high sensitivity and reasonable specificity of the ML stroke prediction algorithm, with significant improvements observed upon the inclusion of individual-level SDoH features. Importantly, experimental results demonstrate consistent outperformance of ML classifiers over logistic regression, with AUC improvements from 0.694 to 0.823 with the inclusion of SDoH features.Moreover, [17] employs logistic regression (LR) with recursive feature selection (RFE) to predict stroke and Transient Ischemic Attack (TIA) diagnosis, highlighting the predictive utility of patient-reported symptoms. ML techniques achieve impressive performance metrics, with AUC exceeding 0.94 for stroke outcome prediction and notable enhancements upon incorporating follow-up data.In [18], the stacking classification method emerges as a superior approach, showcasing high performance across multiple metrics, including an impressive AUC of 98.9% and an accuracy of 98%. The study underscores the efficacy of the stacking ensemble method, comprising base classifiers such as naive Bayes and random forests, with a logistic regression meta-classifier.Additionally, [19] explores the interpretability of ML models for stroke prediction using SHAP and LIME techniques. Notably, Random Forest emerges as the top-performing algorithm with an accuracy score of 90.36%, followed closely by the XGB Classifier with an accuracy score of 89.02% [20,21,22].In [23], machine learning (ML) is applied to predict early signs of ischemic stroke in emergency settings, although its predictive accuracy is constrained by the area under the receiver operating characteristic (AUC). The study highlights the XGBoost-based model’s superior predictive power for pre-screening ischemic stroke, particularly emphasizing the effectiveness of ML-based models using clinical laboratory features. Results showcase the XGBoost-based model’s highest accuracy in predicting ischemic stroke, alongside robust validation across multiple datasets. Additionally, the study demonstrates the XGBoost-based model’s ability to achieve high average sensitivities and specificities across training, internal validation, and external validation datasets, indicating its reliability for screening patients with ischemic stroke.In [24], deep learning models are employed to forecast major adverse cerebrovascular events following acute ischemic stroke, furnishing personalized outcome predictions at an individual level. By leveraging clinical data and brain imaging, these models exhibit enhanced predictive accuracy for major adverse cerebrovascular events (MACEs) after acute ischemic stroke (AIS). Notably, deep learning techniques like DeepSurv and Deep-Survival-Machines surpass traditional survival models, marking a significant advancement in stroke prediction methodologies. Furthermore, the study provides comprehensive validation results, including AUC values and performance metrics such as sensitivity, specificity, classification accuracy, precision score, F1 score, and log loss across training, internal validation, and external validation datasets. These results underscore the reliability and robustness of deep learning models in predicting outcomes for AIS patients, thereby offering valuable insights for clinical decision-making and patient management [21, 25,26,27].The reviewed literature also shown in Table 1 highlights the diverse ML approaches utilized in stroke prediction and their substantial results. These findings underscore the potential of ML techniques to enhance stroke risk assessment, thereby facilitating proactive interventions and improving patient outcomes. However, further research is warranted to address challenges related to feature selection, model interpretability, and real-world validation.Table 1 Summary of machine learning approaches for stroke predictionAimThis research aims to pioneer a pioneering approach to predictive analysis of Ischemic brain stroke with machine learning techniques. Initially, the study focuses on utilizing preference algorithms to discern the key traits using several machine learning techniques such as Logistic regression, support vector machine, decision tree and K-nearest neighbor. We utilized PCA for the reduction the dimensionality of the dataset.Contributions of our study as follows:

Demonstrated the effectiveness of Principal Component Analysis (PCA) in optimizing model accuracy for stroke prediction.

Identified an optimal PCA configuration, specifically with 16 components, achieving a significant improvement in predictive performance.

Implemented a stacking ensemble method combining Random Forest, Decision Tree, and K-Nearest Neighbors (KNN), resulting in a high accuracy of 98.6%.

Showcased the potential of advanced machine learning techniques in enhancing stroke risk assessment and guiding preventive healthcare strategies.

The subsequent sections of this paper are organized as follows: in Sect. 2, we elaborate on the feature Selection method and Classifier. Following that, in Sect. 3, we present the experiment and results of our study, including a comparative analysis of our model with both the proposed model and other state-of-the-art methods.

Hot Topics

Related Articles