Optimized classification of diabetes using dynamic waterwheel plant optimization algorithm

DatasetThe publicly available diabetes dataset on Kaggle contains information essential for studying and analyzing the factors and features associated with diabetes. The tested dataset in this work27 consists of several features and 768 diagnoses. In addition, clinical and demographic features are included in each record in this dataset. These features include patient age, blood pressure, body mass index (BMI), and glucose levels. Moreover, a target variable indicates the diagnosis of diabetes within five years. The advantage of this dataset is that its capacity is suitable for exploratory data analysis, feature engineering, and the creation of predictive models to find the potential risk factors related to diabetes. On the other hand, several insights into patterns, correlations, and predictive elements within the data by utilizing machine learning algorithms or statistical approaches can be gained from this dataset. These insights can potentially contribute significantly to breakthroughs in diabetes diagnosis, prevention, and treatment techniques. Furthermore, this dataset provides a platform for the development and validation of prediction models that might aid healthcare practitioners in the early identification and intervention of persons who are at risk of getting diabetes or who are already at risk of acquiring diabetes. Figure 1 shows the histograms of the features of interest in the adopted diabetes dataset.Fig. 1Histograms of diabetes dataset features.Data preprocessingPreprocessing helps alter the data to improve the accuracy of a machine-learning model. This allows for the model to be constructed more effectively. For this investigation, we substituted the mean values of the available data for any missing values. Using Eq. (1), the dataset features are normalized to a range of 0–1 so that all the variables are on the same scale.$${x}^{*}=\:\frac{x-\text{m}\text{i}\text{n}\left(x\right)}{\text{max}\left(x\right)-\text{min}\left(x\right)}$$
(1)
K-nearest neighbors classifierK-nearest neighbors (KNN) is one of the most straightforward machine learning techniques. It may be applied to regression as well as classification. However, it is most frequently utilized for classification tasks, including categorizing data input into predetermined “k” classes. Although the KNN technique is regarded as a nonparametric method, this does not indicate that any assumptions will be made regarding the data that is considered. Furthermore, this technique is referred to as a lazy learner algorithm since it does not instantly learn from the training data; instead, it stores the dataset, and when it comes time to classify the data, it performs an action related to the dataset. The input data for this approach consists of all the training points in the feature space closest to one another, represented by the integer k. Once the k-nearest neighbors have been identified, the data are classed by specifying the most prevalent class among them. It is possible to locate these neighbors by utilizing the distance from the test sample. These neighbors are members of the dataset used to train this approach. This indicates that during the testing process, the class that represents the most frequent occurrence among the classes adjacent to the test sample being observed will be the class to which this test sample belongs. In other words, once the training phase is complete, any new data collected is placed into a category comparable to the latest data gathered. There is a correlation between the number of neighbors and the decreasing accuracy of the KNN classifiers. However, this does not guarantee that the additional samples will be correctly categorized, even though it increases the complexity of the classifier model.Dynamic waterwheel plant algorithm (DWWPA)The WWPA is a population-based method that, via iteration, can produce an acceptable solution based on the search capacity of its population members in the space of potential solutions to the problem. This is because the WWPA uses the search capacity of its population members. It is important to note that each of the waterwheels comprising the WWPA population has its own values for the issue variables. As a result, each waterwheel represents a potential solution to the issue, which may be formally represented as a vector. Because they have such a keen sense of smell, the waterwheels are predators that can detect the pests that they are trying to eliminate. As a matter of course, the waterwheel will start an attack on any insect that attempts to enter the region where it can attack. Once it has accurate information on the bug’s location, it will proceed to attack and hunt it. By replicating the behavior of waterwheels in the same manner, the WWPA can mimic the initial step of its population updating process. As a result of modeling the waterwheel attack on the insect, which results in large fluctuations in the location of the waterwheel inside the search area, the exploration capability of WWPA is enhanced, which enables it to recognize the perfect region and deviate from the local optimality. After being captured by a waterwheel, an insect is then transferred to a feeding tube. After being delivered, the insect is then fed. For the second stage of the population update process in WWPA, the foundation is a simulation of the behavior of waterwheels. This simulation serves as the basis for the process. The WWPA’s exploitation power is increased during the local search because of the model of transporting the insect to the appropriate tube, which results in small changes in the position of the waterwheel in the search space. As a result of this mechanism, better solutions are converged upon near the ones already discovered. It is because of the implementation of the model that this is now achievable. First, a random site is selected for each waterwheel in the population. This is the beginning of the WWPA implementation process. On the other hand, if the value of an objective function is higher at this new location, the waterwheel will be moved to the new place.During the initial stage of a Waterwheel Position Algorithm (WWPA) implementation, the starting positions of the waterwheels within the search domain are selected randomly. The positions are represented by a matrix \(\:P\) as follows:$$P = \left[ {\begin{array}{*{20}c} {p_{{1,1}} } & \cdots & {p_{{1,j}} } & \cdots & {p_{{1,m}} } \\ {p_{{2,1}} } & \cdots & {p_{{2,j}} } & \cdots & {p_{{2,m}} } \\ \vdots & {} & \vdots & {} & \vdots \\ {p_{{i,1}} } & \cdots & {p_{{i,j}} } & \cdots & {p_{{i,m}} } \\ \vdots & {} & \vdots & {} & \vdots \\ {p_{{N,1}} } & \cdots & {p_{{N,j}} } & \cdots & {p_{{N,m}} } \\ \end{array} } \right]$$
(2)
Here, \(\:N\) and \(\:m\) denote the number of waterwheels and the number of variables, respectively. Each element \({p}_{i,j}\) is computed as \({p}_{i,j}={l}_{{b}_{j}}+{r}_{i,j}\cdot\:({u}_{{b}_{j}}-{l}_{{b}_{j}})\), where \(\:i\) ranges from 1 to \(\:N\), \(\:j\) from 1 to \(\:m\), \({r}_{i,j}\) is a uniformly distributed random number between 0 and 1, and \({l}_{{b}_{j}}\) and \({u}_{{b}_{j}}\) are the lower and upper bounds of the \(\:j\)-th variable, respectively. \(\:P\) is known as the population matrix indicating waterwheel locations, where \({P}_{i}\) signifies the \(\:i\)-th waterwheel (a potential solution).Each waterwheel adopts a unique strategy for addressing the problem, allowing individual assessment of their objective functions through the equation:$$f = \left[ {\begin{array}{*{20}c} {f\left( {X_{1} } \right)} \\ {f\left( {X_{2} } \right)} \\ \vdots \\ {f\left( {X_{i} } \right)} \\ \vdots \\ {f\left( {X_{N} } \right)} \\ \end{array} } \right]$$
(3)
where \(\:f\) represents a vector of all objective function values, and \({f}_{i}\) is the calculated value for the \(\:i\)-th waterwheel. The selection of optimal solutions is primarily based on these objective function evaluations, with the best solutions corresponding to the highest values and the worst to the lowest values. The stochastic movement of waterwheels over the search space is designed to explore varying solutions dynamically.A new location for the waterwheel is computed using the equation:$$W = r_{1} \cdot \left( {P\left( t \right) + 2K} \right),{\text{~}}P\left( {t + 1} \right) = P\left( t \right) + W + \left( {2K + r_{2} } \right)$$
(4)
Here, \({r}_{1}\) and \({r}_{2}\) are random numbers within [0,2] and [0,1], respectively, and \(\:K\) is an exponential random variable ranging [0,1]. \(\:W\) denotes a vector representing the circular search area around the waterwheel. If a waterwheel does not find a superior location after three trials, adjustments are made using:$$P\left( {t + 1} \right) = {\text{Gaussian}}\;\left( {\mu _{p} ,\sigma } \right) + \left( {\frac{{r_{1} \left( {P\left( t \right) + 2K} \right)}}{W}} \right)$$
(5)
Furthermore, the relocation of the waterwheel to a potentially better position is modeled by:$$W = r_{3} \cdot \left( {K_{{{\text{best}}}} \left( t \right) + r_{3} P\left( t \right)} \right),{\text{~}}P\left( {t + 1} \right) = P\left( t \right) + KW$$
(6)
where \({r}_{3}\) is a random variable within [0,2], and \({P}_{\text{best}}\) is the best solution found so far. To avoid local minima traps, a mutation mechanism is applied if no improvement is noted after three iterations:$$P(t+1)=({r}_{1}+K)\text{s}\text{i}\text{n}\left(\frac{f\left(\theta\:\right)}{C}\right)$$
(7)
This detailed framework outlines how waterwheels are algorithmically moved in the search space to find optimized solutions over time.Algorithm 1The proposed DWWPA algorithm.In this setup, \(\:f\) and \(\:c\) are random variables that can take on values within the range of \(\:[-\text{5,5}]\). Additionally, the value of \(\:K\) diminishes exponentially according to the following formula:$$\:K=\left(1+\frac{2{k}^{2}+T}{{T}_{\text{max}}+F}\right)$$
(8)
The dynamic waterwheel plant algorithm (DWWPA) optimizes the performance of KNN classifier parameters through its dynamic methodology. Initially, the algorithm assigns a fitness value to each candidate in the population. The candidate with the highest fitness value is considered the best solution. The population is then split into two groups for different strategic purposes: exploration and exploitation. Those in the exploitation group aim to emulate the best solution, while the exploration group investigates potential alternatives in the vicinity. Over time, the composition of these groups evolves dynamically. To ensure a balanced approach, the algorithm starts with the population evenly divided between exploration and exploitation, with each group comprising 50% of the total population. This approach is detailed in Algorithm 1.Feature selectionDuring the process of computationally diagnosing diabetes, a wide variety of problematic issues arise. One of the most prevalent issues is the quality of medical data. There is a possibility that inaccurate medical diagnoses might result from missing data entries and improper weighting. Because of this, the data preparation process needs to be done with extreme caution. The computation efficiency while working with enormous datasets and the precision gained are two additional prevalent difficulties. An exponential increase in the amount of data has occurred because of the tremendous advancements in data storage and computer capacity over the past several decades. Numerous tests and clinical examinations are necessary to identify and diagnose people with diabetes. In the patient data system, the capacity to recognize patterns is significantly hindered by the enormous volume of data and the high dimensionality of the data. It is possible to minimize the number of medical tests performed on patients by removing redundant and noisy characteristics from huge datasets. This can also assist in lessening the load placed on healthcare professionals, particularly in locations with few healthcare resources. A reduction in the number of features indicates a reduction in the amount of time required for machine learning models to be trained and a reduction in the amount of time needed to receive the prediction results. The accuracy of the machine learning models that are now in use is not adequate, even though they are of utmost significance for the assurance of patient safety. It is possible to eliminate duplicate characteristics to improve the performance of machine learning models. The inclusion of such properties might be deceptive, particularly for instance-based models like KNN, which are a good example. Overfitting is another potential outcome of maintaining variables that are not useful. A reliable and early prediction of diabetes can thus be supported by the identification of the most important characteristics. Since diabetes prediction usually involves high-dimensional data, and a high prediction accuracy is required, so this paper proposes a novel wrapper method. This method is denoted by the binary dynamic waterwheel plant algorithm (bDWWPA).Another deployment of this technique is extracting features through the binary version, which is a more straightforward process. After the continuous solution found by the DWWPA algorithm, Algorithm 2 is converted to binary via the following sigmoid function, which is described as follows: Best, which is the best continuous solution from continuous DWWPA.$${\text{Binary}} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if~Sigmoid}}\;\left( {S_{{{\text{best}}}} } \right) > 0.5} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(9)
$$\:\text{Sigmoid}\left({S}_{\text{best}}\right)=\frac{1}{1+{e}^{-10({S}_{\text{best}}-0.5)}}$$
(10)
Algorithm 2The proposed Binary DWWPA (bDWWPA) algorithm.The detailed analysis of the proposed bDWWPA algorithm is presented in the following:A. Initialization (Lines 1–3)

Operations:

The algorithm begins by initializing the positions of n plants, setting up the fitness function, and other parameters.

An additional step is performed to binarize the solution space.

Time Complexity:

Initialization of plant positions and parameter setup takes O(n).

Binarization of the solution space also takes O(n), assuming that it involves converting the continuous positions into a binary format.

Overall, the complexity for the initialization step is O(n).

B. Main Loop (Lines 5–28)The main loop runs until the maximum number of iterations T_max is reached. This loop contains the following operations:B.1. Condition Checks (Lines 6–8 and 11–13).

Operations: The algorithm checks if the best fitness has remained unchanged for three iterations and whether a random number r is less than 0.5.

Time Complexity: These are constant-time operations, O(1).

B.2. Exploration Group (Lines 9–13)

Operations:

If r < 0.5, the algorithm explores the search space by iterating over n1 solutions.

Positions are updated using specific equations, and if there’s no change for three iterations, a Gaussian perturbation is applied.

Time Complexity:

Iterating over n1 solutions takes O(n1).

The update operations inside the loop are O(1).

Therefore, the complexity for this block is O(n1).

B.3. Exploitation Group (Lines 16–19)

Operations:

If r ≥ 0.5, the algorithm exploits current solutions by iterating over n2 solutions.

Similar to the exploration phase, positions are updated, and if no change occurs for three iterations, a sinusoidal perturbation is applied.

Time Complexity:

Iterating over n2 solutions takes O(n2).

The update operations inside the loop are O(1).

Therefore, the complexity for this block is O(n2).

C. Update Parameters and Fitness Calculation (Lines 23–27)

Operations:

Parameters such as K, r, r1, r2, r3, f, and c are updated.

The fitness values are recalculated for each position, and the best position is updated.

The iteration counter t is incremented.

Time Complexity:

Updating parameters and recalculating fitness involves O(n) operations.

Finding the best position also takes O(n).

Thus, this block has a complexity of O(n).

D. Final Conversion to Binary (Line 29)

Operations:

Time Complexity:

E. Time Complexity

Main Loop Complexity: The main loop runs for T_max iterations. The most complex part of each iteration is either the exploration or exploitation group, both of which have a complexity of O(n) (assuming n1 and n2 can be as large as n). Including the fitness recalculations and parameter updates, each iteration has a complexity of O(n).

Total Complexity: Given that the main loop runs for ‘T_max’ iterations, the overall time complexity of the bDWWPA algorithm is O(T_max * n).

F. Space Complexity

Operations:

The algorithm needs to store the positions of n plants and their corresponding fitness values, as well as other parameters and temporary variables.

Space Complexity: The space complexity remains O(n) as it is mainly determined by the number of plants being optimized.

From this analysis, the time complexity of the proposed bDWWPA is O(T_max*n), and the space complexity is O(n). This makes the algorithm scalable and efficient for binary optimization problems, with a time complexity that is linear with respect to the number of iterations and plants.

Hot Topics

Related Articles