Improving prediction of blood cancer using leukemia microarray gene data and Chi2 features with weighted convolutional neural network

This section contains details on the dataset, techniques, and methods applied to blood cancer prediction. Step by step complete proposed methodology diagram is shown in Fig. 1.Figure 1Proposed methodology diagram.Description of datasetThis study uses the Leukemia_GSE28497 dataset23. There are seven classifications, 22,283 genes (features), and 281 samples in total. Class-wise count of the dataset is given in Table 2.Table 2 Class-wise count of the dataset.In Table 3, the original dataset example is provided. This explains the real data that was obtained by applying the microarray gene technique. Blood cancer types are described by attributes, while the properties of the genes that are utilized to distinguish affected individuals from unaffected individuals are shown in another column. A microarray test is used to determine the values of these features. Every instance is described in a single row and contains information about the blood sample monitored by the microarray.Table 3 Sample from the dataset.Data preprocessingThe dataset was made available via the National Centre for Biotechnology Information (NCBI) website. Subsequently, preprocessing steps are implemented to enhance the effectiveness of learning models24. This includes data resampling for balancing the dataset using the SMOTE-Tomek technique, and generating data for minority classes, as detailed in Table 4.Table 4 Count of the target after SMOTE-Tomek.The Chi-square method is then used for feature selection in order to reduce complexity. The top 300 gene features for the best learning model fitting are given in Table 5.Table 5 Count of features after chi-2 feature extraction technique.The selection of features is guided by empirical findings. The preprocessed dataset is then divided in an 85:15 ratio into training and testing sets given in Table 6.Table 6 Training testing ratio of all features.With 85% dedicated to training, the models receive ample data for effective learning, given the dataset’s overall size. Post-training, the remaining 15% is utilized to assess the model’s performance using measures like F1 score, recall, accuracy, and precision. This preprocessing ensures the dataset is well-prepared for training and evaluation, enhancing the learning models’ performance.SMOTE-TomekSMOTE-Tomek is a technique that addresses the issue of overlap by combining SMOTE and Tomek Links. Tomek Links identifies pairs of instances from different classes and introduces new instances randomly25. If the distance between the new instance and either selected pair is smaller than the distance of the pair itself, the chosen pair is eliminated. In essence, Tomek Links can eliminate instances situated at the boundaries of multiple classes, considering the adjacencies between instances near these boundaries. It targets instances considered to belong to multiple classes due to SMOTE, thereby eliminating those responsible for overlap. However, in this scenario, the instance being removed is not likely to be adjacent to its proper class, potentially increasing the chance of introducing unwanted noise.Chi-squareChi-square is a non-parametric statistical method used for feature selection, specifically selecting the top n features. It is widely employed in data analysis tasks. Chi-2 assesses the independence between a given phrase and the presence of a specific class. In a given document D, the score for each term is estimated and ranked using the Eq. (1)$$\begin{aligned} X^2=(D,t,c)=\sum _{e_t\varepsilon \left\{ 0,1\right\} } \sum _{e_c\varepsilon \left\{ 0,1\right\} } \frac{(Ne_t e_c)^2}{Ee_t e_c} \end{aligned}$$
(1)
In this context, let N represent the observed frequency and E denote the predicted frequency. When t is present, the variable \(4_t\) is assigned a value of 1; otherwise, it is assigned a value of 0, and \(e_c\) is allocated a value of 1 if the document is part of the c class and 0 otherwise. The null hypothesis H0, which asserts independence (i.e., that the document class does not influence the term’s frequency), should be rejected, according to a significant Chi2 score for each characteristic. This implies that there is interdependence between the feature and the class. Consequently, in such instances, it is advisable to select the microarray gene feature for model training.Supervised machine learning modelsThis study uses a variety of machine learning models, such as KNN, RF, LR, ETC, SVC, ADA, NB, DT, and the proposed approach, WVCNN, to predict blood cancer. Table 7 provides implementation details about these machine learning models and their hyperparameter settings for all of them. To find the best parameters, a method called grid search is used. This involves trying different values for each parameter within a specified range and evaluating how well the model works. Every parameter goes through the procedure, and the values that optimize the model’s performance are selected at the conclusion.Table 7 Hyper-parameter tuning of all supervised learning models.Random forestRF is a model that predicts very accurately by using a group of decision trees26. It combines the results from many decision trees. This model uses a technique called bagging, where it trains various decision trees on different groups of data. In each group, the training data is sampled with replacement, which means some data might be repeated. The size of this sample is like the size of the original training data. When making predictions, RF and other classifiers follow similar processes for creating decision-making groups. A key difficulty in developing these models is deciding the attributes of the main decision point at each step, determined using Eq. (2).$$\begin{aligned} F(x_t)= \frac{1}{B}\sum _{i=0}^{B} F_i (x_t) \end{aligned}$$
(2)
Logistic regressionLR is a method in mathematics that processes information using one or more variables to find a solution27. It’s specifically used for predicting probabilities of class membership when dealing with categorical target variables. LR uses a logistic function to assess the likelihood of a relationship between the dependent variable and one or more independent variables. It is considered the most suitable learning model when dealing with categorical target variables. Equation (3) shows the logistic function used by LR$$\begin{aligned} f(x)=\frac{L}{1+ e^{-m(v-v_0)}} \end{aligned}$$
(3)
Support vector classifierClassification involves organizing a dataset into categories using specific criteria to provide a more meaningful classification28. The classification approach known as SVC is based on the support vector methodology. SVC’s primary goal is to identify the best-fitting hyperplane that effectively divides or arranges the given data. You may enter characteristics into the classifier to ascertain the anticipated class after the hyperplane has been constructed. This algorithm is well-suited for various applications, including our purpose, as it can be employed in different contexts.K-Nearest neighborsKNN is a fundamental machine learning model that is used for problems involving both regression and classification29,30. This model assigns a data point to a class based on its closest neighbors, determining proximity through a distance attribute. In this experiment, the effectiveness of KNN is demonstrated, particularly when k is set to five (k = 5). This signifies that the model considers the five nearest neighbors and selects a class according to the closest or majority distance.Naive BayesThe NB algorithm is an algorithm for classification problems that focuses on the Bayes theorem31. It is a supervised learning algorithm known for its efficiency and scalability in training with a restricted set of information. As a probabilistic classifier, the likelihood that an item will belong to a specific class is predicted by NB. One key assumption of the NB classifier is that each feature’s likelihood is independent of others, and they don’t overlap. This suggests that every attribute has an equal role in identifying whether a sample is a member of a certain class. The NB classifier is simple to use, computes rapidly, and performs well on large, highly dimensional datasets.Extra trees classifierThe ETC operates similarly to the random forest, with the main difference lying in the tree-building process32. Every decision tree in ETC is built with the first training sample. The Gini index is used to find the optimum way to split the data inside the tree, and k samples from the best solutions are used to make judgments. This results in the creation of multiple decision trees Using these examples of random function indicators, aiming to ensure they are not highly correlated. The decision tree algorithm is versatile, suitable for both categorical and numerical data, and it performs effectively in various scenarios.Decision treeThe DT is a widely used and potent data-mining technique that has been extensively developed and tested by numerous researchers33. Despite its effectiveness, it is essential to acknowledge the presence of data errors during the learning process. Consequently, working with substantial volumes of data is crucial to creating a decision tree algorithm capable of producing a straightforward tree structure with high classification accuracy. For this study, a dataset from Kaggle was chosen to implement the decision tree algorithms. The strategic division of data significantly impacts the accuracy of the tree, with different decision criteria utilized for classification and regression tasks.The entropy is expressed mathematically for 1 attribute by Eq. (4)$$\begin{aligned} Entropy(S)= \sum _{i=1}^{c} – P_ilog_2P_i \end{aligned}$$
(4)
Entropy is expressed mathematically for multiple attributes by Eq. (5)$$\begin{aligned} Entropy(T,X) = \sum _{c \varepsilon X} P(c) E (c) \end{aligned}$$
(5)
IG is defined mathematically by Eq. (6)$$\begin{aligned} Information Gain (T,x) =Entropy (T) – Entropy (T, X) \end{aligned}$$
(6)
AdaBoostThis group learning model uses a method called boosting to teach unskilled students, like decision trees. ADA, short for adaptive boosting, is important in history34. It was the first system that could change and help weak learners. The ADA method brings together ’weak learners,’ teaching them one after the other back on copies of the first data set. All weaker learners concentrate on tough data points or unusual numbers. It works like an overall model, using N repeats of weak learners trained on the same set of features. These are given different weights. ADA shows it works with strong math reasons. Many experiments have shown that ADA, a type of machine learning system, is often better than other systems.The study used the ADA method with different values. It adjusted these values well to get high accuracy. The ’n_estimator’ value was set to 300. This means that the ADA machine used 300 little learners to make guesses. It’s important to point out the difference between RF and ADA. RF uses a method called bagging, while ADA uses a boosting strategy. Boosting should be understood as combining weak learners into one strong learner by putting weight on them in a specific way. ’Random_state’ is another factor used. It controls how random the samples are during training for predicting models.Artificial neural networksArtificial neural networks (ANNs) are computational models inspired by the human brain’s neural networks, designed to recognize patterns and solve complex problems35. ANNs consist of interconnected layers of neurons, including input, hidden, and output layers, where each neuron processes input data and passes it through an activation function. These models are widely used in various applications such as image and speech recognition, natural language processing, and predictive analytics due to their ability to learn and generalize from data. Training an ANN involves adjusting the weights of the connections between neurons using algorithms like backpropagation to minimize error. Despite their power, ANNs can require large amounts of data and computational resources to achieve high performance.Multi-layer perceptronMulti-layer perceptron (MLP) is a type of artificial neural network that consists of multiple layers of neurons, typically including an input layer, one or more hidden layers, and an output layer30. Each neuron in an MLP uses a nonlinear activation function to process input data, allowing the network to capture complex patterns and relationships in the data. MLPs are particularly effective for supervised learning tasks such as classification and regression. The backpropagation algorithm is commonly used to train MLPs by adjusting the weights of the connections to minimize the error between the predicted and actual outputs. MLPs have been successfully applied in various fields, including image and speech recognition, finance, and healthcare.Proposed methodologyThe architecture of the proposed technique using the weighted CNN for blood cancer prediction is depicted in Fig. 2. The relevant details of how it works are provided in the subsequent sections.Figure 2Weighted voting ensemble using convolutional neural networks.Weighted voting ensembleA CNN model’s weighted voting ensemble (W.V.E.) is essential for improving robustness and prediction accuracy36. Using ensemble learning techniques is very beneficial in prediction, where accuracy and reliability are critical. This is because the approaches combine the benefits of numerous CNN models, each of which has a unique advantage in collecting different aspects of the data. In addition to improving accuracy, this variety strengthens predictions against noise and outliers by allowing models that are susceptible, by giving lower weights. These ensembles also aid in promoting improved generalization, mitigating overfitting, lowering prediction variance, and balancing biases within individual models. They are a flexible tool for modeling intricate decision limits and adjusting to shifting data patterns because of their versatility and capacity to fine-tune model weights37. To sum up, weighted voting ensemble CNN models are a useful tool for achieving more precise and reliable predictions in a variety of applications. In this paper, an ensemble model WV-CNN is proposed which is derived from three 1-D CNN models to attain high accuracy and leverage the robust decision-making strength of multiple models for enhanced cancer prediction. The following lists the model structures of the three contributing CNN models38.ECN-1 (ensemble convolutional network 1)Ensemble convolution network 1 (ECN – 1) is a proposed sequential convolution model fed with a 1-D feature vector v’ of dataset D. According to Table 8, the ECN-1 model is composed of a 1-D convolution layer with a filter number Fn of 4 and a kernel size Ks 3. Except for the last layer, every layer of the model uses ReLU as an activation function. The final layer, which has as many neurons as the length of the classes being trained on with a Softmax Activation to obtain a probability distribution of input labels, comes after the feature extraction layer. This dense layer has six neurons.ECN-2 (ensemble convolutional network 2)To get class probability values, the second Sequential convolution model, ECN-2, is fed the feature vector v’. Table 8 details the model design, which includes a 1-D convolutional layer with Fn and Ks both set to 3. To prevent over-fitting, a flatten layer, a dense layer with eight neurons, and a dropout layer with a value of 0.3 are added next. With as many neurons as the class length, the final layer is dense.Table 8 CNN ensemble models’ recommended architecture.ECN-3 ensemble convolutional network 3To get classification scores, the proposed Sequential network ECN-2 receives the input feature vector v’, just like in the first two models. The model’s layer structure is represented by a 1-D Convolutional Layer with Fn = 5 and Ks = 3, with ReLU serving as the activation function, as shown in Table 8. The next set of layers uses a softmax classifier and consists of an output-dense layer and a flatten layer with as many neurons as there are classifications that need to be predicted. Table 9 provides the parametric setting for every model in the ensemble. Adam is utilized as the optimizer for all models in the ensemble because of its quicker computing and less complicated parametric configuration requirements. The gradient’s scaling term (\(\beta\)2) and momentum (\(\beta\)) are set at 0.999 and 0.9, respectively, while the learning rate (\(\alpha\)) is set at 0.001. Likewise, a tiny positive number is assigned to the epsilon value to prevent division by zero.Table 9 Configuring parametrically the proposed WVCNN ensemble models.Weighted voting convolutional neural networkThe increased prediction values are obtained by an ensemble voting regimen that uses the output probabilities of the three suggested models, ECN1, ECN2, and ECN3. Since each model’s contribution to the final probability depends on a set weight value, these agreed-upon class probabilities represent the outcome of the vote from each model37,39. The weights W1, W2, and W3 for each of the suggested models ECN1, ECN2, and ECN3, as shown in Fig. 2, are set at 0.4, 0.3, and 0.3, respectively, just like in our situation. The final probability vectors for the models ECN1, ECN2, and ECN are represented by the symbols p1, p2, and p3 for a given input feature vector V. The mathematical expression for the weighted average of the probability over all three models is given in Eq. (7).$$\begin{aligned} \hat{P}ens=\sum _{i=1}^{n}W_ip_i \end{aligned}$$
(7)
Choosing the class with the greatest likelihood is done by Eq. (8)$$\begin{aligned} P_{out}=argmax(\hat{P}ens) \end{aligned}$$
(8)
Evaluation parameterPerformance measures for the machine learning models are evaluated, including recall, precision, F1 score, and accuracy. All these metrics were used in this study for the evaluation of the machine learning models based on the confusion matrix. This matrix is a tabular tool that shows how well the model performs in classifying test data.AccuracyThe accuracy score reflects the precision of a model’s predictions, showing how closely the model’s predictions match the actual results. It serves as a measure that quantifies the model’s capability to make accurate predictions. The accuracy score is computed by dividing the total number of correct guesses by the total number of forecasts. An ideal model achieves a perfect accuracy score of 1, while the lowest possible score is 0. It is defined in Eq. (9)$$\begin{aligned} Accuracy (A)=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(9)
where,

True positive (TP): when a patient’s real label is ‘Healthy’ even when the model accurately predicts them to be ‘Healthy’. This shows that the patient’s actual condition and the model’s forecast match.

True negative (TN): when the model accurately predicts an ‘Un-Healthy’ patient when the actual label corresponds to ‘Un-Healthy’. Similar to TP, TN signifies that the model’s prediction aligns with the patient’s actual condition.

False positive (FP): when the model incorrectly predicts a patient as ‘Healthy’ while the true label is ’Un-Healthy’. This represents a scenario where the model fails to identify an infected patient, leading to a false sense of normalcy.

False negative (FN): when the model erroneously predicts a patient as ‘Un-Healthy’ when the true label is ‘Healthy’. FN indicates that the model misclassifies a healthy patient as infected, resulting in unnecessary concern or treatment.

PrecisionThe number of correctly guessed positive cases compared to all the times thought as positive is called precision. The best score a model can get for accuracy is 1. The lowest it can get is 0. Precision is calculated using Eq. (10)$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(10)
RecallRecall, also known as True Positive Rate (TPR) or Sensitivity, indicates how successfully the classifier can recognize each and every positive sample. It is calculated by dividing the sum of True Positives (TP) and False Negatives (FN) by the ratio of TP to FN. A model can have a maximum recall score of 1 and a minimum value of 0. Eq. (11) is used to calculate the recall$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(11)
F1 scoreThe balance between recall and precision is represented by the F1 score, also known as the F measure. It illustrates how these two metrics might be compromised, with a model obtaining an F1 score of at least 0 and up to 1. F1 score is calculated using Equation 12$$\begin{aligned} F1-Score= 2 \times \frac{Precision\times Recall}{Precision+Recall} \end{aligned}$$
(12)

Hot Topics

Related Articles