A gene selection algorithm for microarray cancer classification using an improved particle swarm optimization

DatasetsIn this manuscript, eight microarray datasets are utilized for various types of cancer: Leukemia, Brain Cancer, Colon Cancer, SRBCT (Small Round Blue Cell Tumors), Lung Cancer, brain cancer, Lymphoma, 11_Tumors, and Diffuse large B cell lymphoma (DLBCL). These datasets are publicly available and were sourced from the GEMS system (http://www.gems-system.org). which are listed in Table 2. The comprehensive explanation of the datasets is recorded in Table 2. Table 1 has defined the parameter settings. The selection of these datasets includes a diverse range of cancers, which helps in demonstrating the generalizability and robustness of the gene selection technique across different types of cancer. Each of these cancers has distinct gene expression profiles, making them ideal for evaluating the effectiveness of gene selection methods in identifying relevant biomarkers. fivefold cross-validation is used in this manuscript which is a good compromise between the bias and variance of the model. It helps in ensuring that the results are not overly optimistic (low variance) while still providing a sufficient number of training samples to maintain generalizability (low bias). Compared to leave-one-out cross-validation or higher k-fold values, fivefold cross-validation is computationally less intensive, making it suitable for large microarray datasets. fivefold cross-validation is a widely accepted method in bioinformatics and machine learning literature, providing a balance between reliability and computational demand.Table 1 Parameters used for the proposed method.The parameter tuning process for SIW-APSO involves setting and optimizing the values of key parameters to achieve optimal performance of the algorithm. Here is a detailed discussion of the parameter tuning process, how the parameters were optimized, and their impact on the performance of SIW-APSO which is shown and highlighted blue in Table 1. A swarm size of 30 is a common choice, balancing diversity and computational efficiency. For iteration, setting it to 100 aims to balance solution quality and computational time. C1 and C2 values are typically set to the same value (2) to balance exploration (C1) and exploitation (C2). If C1 is much larger than C2, the particles may explore too much without converging. Conversely, if C2 is much larger than C1, the particles may converge prematurely without exploring the search space adequately. An inertia weight of 0.9 favors exploration in the initial stages of the optimization process. Properly tuned velocity limits Vmax is 4 and Vmin is − 4 ensure that particles explore the search space effectively without oscillating too much.The experimental study has been conducted on eight microarray datasets, including SRBCT, Lung cancer, Colon cancer, brain cancer, Leukemia, and Lymphoma, shown in Table 2.Table 2 Eight microarray dataset genes.The proposed method of self-inertia weight PSO with ELM is to classify the eight Microarray datasets to verify selected gene subsets. Every experiment is made 500 times and means accuracies and standard deviation are listed in Table 3.Table 3 Classification accuracy using various gene subsets.The prediction ability of the selected gene subsetsThe predictive accuracy of certain groups of genes is confirmed using the proposed SIW-APSO-ELM method. The experiments have been conducted 500 times, Std., and the accuracy is shown in Table 3. The proposed method has obtained high accuracy for brain cancer. The proposed method obtains 100%. SIW-APSO-ELM method on the colon obtains excellent accuracies. These results indicate that the SIW-APSO-ELM can select highly valuable prognostic genes. Table 4 shows the best accuracy of the proposed method SIW-APSO-ELM
> using the 5-fold validation on the eight cancer datasets.Table 4 Comparison of the gene selection algorithms using particular datasets with fivefold validation.In Table 5. KCNN classification accuracy with different parameters is given. We achieved 97% on Colon and 100% on SRBCT which shows good results in terms of parameters used. We achieved 100% accuracy on Lymphoma while it was slightly reduced in 11_Tumors which is 97% approximately.Table 5 KCNN Classification Accuracy.In Table 6. SVM classifier shows the classification accuracy with different parameters. We achieved 97% on Colon and 100% on SRBCT which shows good results when we applied BPSO-GCSELM. We achieved 100% accuracy on Lymphoma while it was slightly reduced in 11_Tumors which is 99% approximately when we applied SIW-APSOELM.Table 6 SVM classification accuracy.In Table 7. Different datasets provided different values and it shows the accuracy, sensitivity, and specificity with different parameters. We achieved 96%, 93%, and 92% on Colon accuracy sensitivity and specificity.Table 7 Classification sensitivity and specificity with different gene subsets.In Table 8. A comparison of the proposed method with a modified version of the Moth Flame algorithm is given. It shows Lung accuracy of 94%using MMFA while 97% using the SIW-APSO-ELM approach.Table 8 Comparison of the proposed algorithm with modified Moth Flame Algorithm.In Table 9. The proposed method with selected genes on the Colon dataset is given that shows the description of each selected gene.Table 9 The proposed method with selected genes on Colon datasets.In Table 10. The proposed method with selected genes on the Leukemia dataset is given that shows the description of each selected gene.Table 10 Proposed method with selected genes on leukemia datasets.In Table 11. The proposed method with selected genes on the Lymphoma dataset is given that shows the description of each selected gene.Table 11 The proposed method with selected genes on the Lymphoma dataset.In Table 12. The proposed method with selected genes on the SRBCT dataset is given that shows the description of each selected gene. It is linked to sample classes. Table 4 presents the experimental results with the latest gene selection methods. It demonstrates that the proposed technique has outclassed the other PSO26 variants and other standard gene selection approaches such as IBPSO, SVM27, IG-GA28, EPSO29, BPSO-GCS-ELM30, mABC31, because SIWAPSO-ELM has been used to simplify the genes selection procedure. It has been used to select the smallest gene subset pool from the primary gene pool which has been updating global position and extreme learning for the best selection of dense gene subsets from the accurate gene pool. From Table 3, the SIW-APSO-ELM chooses a nearly similar number of genes as the former approaches on Leukemia, SRBCT, and DLBCL. At the same time, it determines the maximum number of genes in the colon, brain, and lung data, amongst others. ELM attains 100% accuracy on the Leukemia, DLBCL, and SRBCT datasets compared to other state-of-the-art methods.Table 12 The proposed method with selected genes on SRBCT dataset.From Table 3, the SIW-APSO-ELM chooses a nearly similar number of genes as the former approaches on Leukemia, SRBCT, and DLBCL. At the same time, it determines the maximum number of genes in the colon, brain, and lung data, amongst others. ELM attains 100% accuracy on the Leukemia, DLBCL, and SRBCT datasets compared to other state-of-the-art methods. Figure 2 shows the fivefold CV accuracy on the training data versus the iteration number of a proposed algorithm. With the help, we verified that it improves the premature convergence of the proposed method. The graph in Fig. 2. further shows that the tendency of improving convergence rate for the four genes Brain Cancer, Lymphoma, 11_Tumors, and Colon is greatly impacted by the training examples and it improves over time as the model is trained.Figure 2Fold CV accuracy on the training data versus the iteration number of a proposed algorithm.Comparison with other classification modelsThe proposed approach has been evaluated against several state-of-the-art models, including IBPSO32, SVM33, IG_GA34, EPSO35, BPSO-GCS-ELM36, mABC37, and intelligent models38. This judgment was founded on the classification outcome and the number of genes irrespective of data dispensation and classification methods. The assessment outcomes on eight datasets are obtained in Table 4. From Table 3, the SIW-APSO-ELM method achieves 100% accuracy on the Leukemia, DLBCL, and SRBCT data with the genes selected by the other methods. Table 8 shows that the proposed algorithm performs much better as compared to the latest algorithm because the proposed algorithm has been used to simplify the gene selection procedure39. It has been used to select the smallest gene subset pool from the primary gene pool which has been updating global position and extreme learning for the best selection of dense gene subsets from the accurate gene pool.Comparison SIW-APSO- ELM method with the KCNN and SVMThe proposed method achieved stunning performance with the other gene selection methods, such as BPSO-GCS-ELM on public microarray data. The SIW-APSO- ELM is matched with the BPSO-GCS-ELM process on the eight data employing ELM and KCNN consistent results are recorded in Tables 5 and 6. From Tables 5 and 6, KCNN and SVM were obtained with a fivefold CV Accuracy of 100% on the Leukemia and SRBCT data. The accuracy of the KCNN and SVM models is higher when using the genes identified by the SIW-APSO-ELM technique compared to the BPSO-GCS-ELM method for both colon cancer and lung cancer data. In the case of brain cancer, the SIW-APSO-ELM technique achieves more accuracy than the BPSO-GCS-ELM method for ELM. Similarly, for KCNN, the SIW-APSO-ELM method also yields higher accuracy compared to the BPSO-GCS-ELM method. The findings presented in Tables 4 and 5 demonstrate that the suggested method surpasses the BPSO-GCS-ELM method, as well as other PSO variants, in terms of performance. In the context of brain cancer, the SIW-APSO-ELM technique achieves more accuracy than the BPSO-GCS-ELM method in ELM. Similarly, in KCNN, the SIW-APSO-ELM method also outperforms the BPSO-GCS-ELM method in terms of accuracy. The BPSO-GCS-ELM technique chooses a smaller number of genes compared to the proposed method when applied to the Colon, Brain cancer, and Lung, including 11_Tumors and DLBC datasets.Table 7 shows the accuracy, sensitivity, and specificity of the eight datasets. From Tables 5 and 6, we can observe that the proposed method performs an outstanding performance compared to the BPSO-GCS-ELM. From Fig. 3, the classification accuracy of Leukemia, DLBCL, and SRBCT achieved 100%. Figure 4 shows the ranks of genes from five independent runs on the two data sets, DLBCL and 11-Tumor, to assess the proposed method’s reproducibility.Figure 3Classification Accuracy of proposed datasets (a) leukemia, DLBCL, tumor, brain cancer (b) SRBT, lung, colon, lymphoma.Figure 4Ranks of genes from five independent runs on the two data sets.Figure 3 shows the classification accuracy for different datasets of the proposed method. It shows that some of the selected genes are better converged than others. However, it also shows that these variations are due to different parameters for different datasets that can impact the convergence rate of the selected genes.Biological analysis of the selected gene subsetsBiological experiments on different data have been conducted. The highest five often selected genes have been listed in Tables 9, 10, 11 and 12. Table 12 shows the top five frequently selected genes with the proposed method on SRBCT dataset. For classification, gene X03934 is very critical which is listed in Table 10. From Table 9, the genes H06524, H20709, T94579, T92451, and K03474 were also selected. From Table 11, the gene U66559_at is also an essential gene for anaplastic lymphoma kinase. Figure 4. Shows the ranks of selected genes by using 5 independent runs on the two different datasets.

Hot Topics

Related Articles