Detection of Alcoholic EEG signal using LASSO regression with metaheuristics algorithms based LSTM and enhanced artificial neural network classification algorithms

In this section includes the details about the dataset, clustering via feature extraction using LASSO Regression methodology and feature selection using various metaheuristic algorithms.Dataset detailsOur study employed an alcoholic EEG dataset from the UCI KDD archive24, a well-known publicly available online repository at the University of California, Irvine. This dataset was chosen to explore the relationship between brain activity (reflected in EEG signals) and alcoholism. The dataset comprises EEG recordings from a total of 122 subjects (both normal and alcoholic) following the standard 10/20 International electrode placement system. Electrode impedance was maintained below 5 kΩ to ensure good signal quality. Each subject underwent 120 trials involving different stimuli, and the EEG signals were captured by 64 electrodes at a sampling rate of 256 Hz with 12-bit resolution. However, raw EEG data can be contaminated by noise from muscle movements, eye blinks, and body sway. To address this issue, we implemented a simple yet effective pre-processing technique called Independent Component Analysis (ICA) to remove these artifacts. This step is crucial because artifacts can significantly hinder the accuracy of classification algorithms. Clean EEG signals are essential for reliable alcohol level detection. Following pre-processing, relevant 1D EEG recordings for both normal and alcoholic subjects were segmented into 2D matrices using Short Time Fourier Transform (STFT)25 and stored in separate files, with each file containing 2560 data points. A simplified block diagram outlining our approach is provided in Fig. 1. The process involves pre-processing the EEG signals, feature extraction using LASSO Regression, feature selection using various metaheuristic algorithms like Particle Swarm Optimization (PSO), Binary Coding Harmony Search (BCHS) and Binary Dragonfly Algorithm (BDA) and finally, classification using suitable algorithms to analyze alcohol levels based on the EEG data.Fig. 1Simplified block diagram of the process.Clustering via LASSO Regression technique for simplified feature extractionLASSO regression is also acknowledged as L1 regularization. It is a popular technique in machine learning used to estimate relationships between variables and make predictions26. It excels at balancing model simplicity and accuracy. LASSO achieves this by adding a penalty term to the standard linear regression model, forcing some coefficients to become exactly zero. This feature selection capability makes LASSO particularly useful for identifying and discarding irrelevant variables. The general mathematical equation for LASSO regression represented as follows27:$$\:RSS=\lambda\:* \textit{total of each coefficient{‘}s magnitude{‘}s absolute value}$$
(1)
Where \(\:RSS\) represents Residual Sum of Squares, it is reflecting how well the model fits the data. It measures the total squared difference between the actual values and the values predicted by the model and \(\:\lambda\:\) indicates the parameter of regularization. LASSO regression is explained in detail step-by-step below:
Step 1
LASSO starts with a standard linear regression model assuming a linear relationship between features and the target variable. The standard linear regression equation for this case is as follows.
$$\:g=\:{\beta\:}_{0}+{\beta\:}_{1}{s}_{1}+{\beta\:}_{2}{s}_{2}+\cdots\:{+\beta\:}_{n}{s}_{n}+\epsilon\:$$
(2)
Where \(\:g\) indicates the dependent variable of the target values, \(\:{\beta\:}_{0}+{\beta\:}_{1}+{\beta\:}_{2}\dots\:{+\beta\:}_{n}\) represents the coefficients of the standard linear regression, \(\:{s}_{1}+{s}_{2}+\dots\:+{s}_{n}\) signifies the independent variables of the features and \(\:\epsilon\:\) indicates the error term of the standard linear regression.
Step 2
LASSO introduces a penalty term based on the absolute values of the coefficients. This term, multiplied by a tuning parameter (λ), discourages large coefficients. The L1 Regularization equation for this case is as follows.
$$\:{L}_{1}=\lambda\:*\left(\left|{\beta\:}_{0}\right|+\left|{\beta\:}_{1}\right|+\left|{\beta\:}_{2}\right|\dots\:+\left|{\beta\:}_{n}\right|\right)$$
(3)

Step 3
LASSO aims to minimize the sum of squared errors between predicted and actual values while also minimizing the L1 penalty term.

Step 4
By incorporating the L1 penalty, LASSO shrinks coefficients towards zero. When λ is large enough, some coefficients become zero, effectively removing those variables from the model.

Step 5
The choice of λ is crucial. A larger λ leads to more coefficients being driven to zero, while a smaller λ allows more variables to have non-zero coefficients. LASSO regression offers a powerful approach for both prediction and feature extraction, especially valuable for high-dimensional datasets with many features. Therefore, using the LASSO regression algorithm, clustering is done as follows.
Our dataset contains 64 EEG channels, each with 2560 samples, resulting in a total of 163,840 data points. LASSO regression is employed to reduce the dimensionality of these signals. By applying LASSO, we achieve a tenfold reduction in the number of features per channel, leading to a compressed representation with 256 features per patient.In this case, the parameter of regularization λ = 0.5. To assess whether LASSO regression preserves the inherent non-linearity of the EEG signals, we analyze the distribution of the resulting features using a histogram plot (Fig. 2). A histogram visually depicts the frequency of occurrence for different data values. Our analysis suggests that the non-linear dynamics in the alcoholic EEG signals are reflected in the non-normal distribution observed in the histogram. Table 1 summarizes the average values of key statistical parameters (mean, variance, skewness, and kurtosis) calculated from the LASSO regression features for alcoholic EEG signals.Fig. 2Histogram of the LASSO regression features for alcoholic EEG cases.Table 1 Average values of key statistical parameters of LASSO regression features for alcoholic patients.Table 1 shows the average values of the LASSO regression features for alcoholic patients, the mean average value indicates the alcoholic signals are negative skewed, variance average value represents the alcoholic signals are consistent, skewness average value represents the alcoholic signals are right skewed and kurtosis average values shows the lighter tails than a normal distribution. Relative to the dataset, the negative entropy value shows that there is low uncertainty. This relates to the variance, which is quite small and indicates that the data points are closely grouped around the mean. The energy value of for compressed features per patient shows that low overall power is possessed by signal. This low energy value conforms with a small variance in data implying no large fluctuations in this signal and it keeps steady. Finally the LASSO regression features of alcoholic signals exhibit non-linearity; therefore, feature selection methodologies are utilized to further reduce the dimensionality of the signals. This helps to identify the most significant features for classification.Feature selection via different metaheuristic algorithmsAfter using the LASSO regression approach for clustering, a multitude of metaheuristic methods are used to choose the efficient features from the clusters. Particle Swarm Optimization (PSO), Binary Coding Harmony Search (BCHS), and Binary Dragonfly Algorithm (BDA) are the different metaheuristic algorithms that are examined here for feature selection.Particle swarm optimization (PSO)Based on how birds behave in social groups, the PSO algorithm is a population-based search engine. PSO is computationally affordable, both in terms of speed and memory utilization, and only requires simple mathematical expressions28. Social interaction and learning from one another are the key components of PSO. Particles inside the swarm migrate to resemble their better neighbors depending on the knowledge they have acquired. Neighborhood development shapes the PSO’s organizational framework. It is possible for neighbors to speak with one another. The star topology, ring topology, and wheels topology are among the several neighborhood types that have been identified and investigated. In this paper, we employ the PSO algorithm which utilizes a global best strategy, as shown below29.
Step 1: Initialization
The algorithm begins by creating a swarm of particles, denoted by \(\:Q\left(t\right)\). Each particle, represented by \(\:{Q}_{i}\:\epsilon\:Q\left(t\right)\), has a position \(\:{S}_{i}\left(t\right)\) randomly distributed within the search space (hyperspace) at the initial time step \(\:(t\:=\:0)\).

Step 2: Fitness evaluation
The performance of each particle is then evaluated using its current position \(\:{S}_{i}\left(t\right)\). This evaluation assigns a fitness score \(\:F\left({S}_{i}\left(t\right)\right)\)that reflects how good the particle’s position is in terms of solving the optimization problem.

Step 3: Update personal best
Each particle compares its current performance \(\:{S}_{i}\left(t\right)\) to its best performance encountered so far (personal best):
if \(\:F\left({S}_{i}\left(t\right)\right)<\:{q}_{id}\) then

1

i) \(\:{q}_{id}=\:F\left({S}_{i}\left(t\right)\right)\)

2

ii) \(\:{Q}_{i}=\:{S}_{i}\left(t\right)\)

If the current performance is better, the particle’s personal best position is updated.
Step 4: Update global best
All particles in the swarm can access information about the best performing particle found so far (global best). This allows the swarm to collectively learn and move towards promising areas of the search space.
if \(\:F\left({S}_{i}\left(t\right)\right)<\:{q}_{gd}\) then

1

i) \(\:{q}_{gd}=F\left({S}_{i}\left(t\right)\right)\)

2

ii) \(\:{Q}_{g}={S}_{i}\left(t\right)\)

Step 5: Velocity update
Based on the personal best position and the global best position, the velocity vector of each particle is updated. This velocity vector determines the direction and magnitude of movement for each particle in the next iteration.$$\:{\mathcalligra{v}}_{id}\left(t+1\right)=\omega\:{\mathcalligra{v}}_{id}\left(t\right)+{\eta\:}_{1}*rand\left(\right)*\left({q}_{id}\left(t\right)-{s}_{id}\left(t\right)\right)+{\eta\:}_{2}*rand\left(\right)*\left({q}_{gd}\left(t\right)-{s}_{gd}\left(t\right)\right)$$
(4)

In Eq. (4), the second term on the right-hand side represents the cognitive component, while the final term signifies the social component.
Step 6
Move each particle based on its updated velocity.

1

i) \(\:{s}_{id}\left(t+1\right)={s}_{id}\left(t\right)+{\mathcalligra{v}}_{id}\left(t\right)\)

2

ii) \(\:t=\left(t+1\right)\)

Step 7
Continue steps 2 through 5 until completion. A particle’s shifting velocity to return to the best solutions increases with the particle’s distance from both the global best location and its individual best solution to at this point.
Binary coding Harmony Search (BCHS)A novel meta-heuristic optimization technique called harmony search (HS)30 mimics the process of musical inventiveness, in which performers explore instrument pitches to discover the ideal harmonic phase. Since each variable’s possible values were limited to the numbers 0 and 1, the binary coding approach in HS took the role of the float encoding technique used in this investigation. The following is a description of the optimization challenge30:$$\:Minimizing\:f\left(g\right)\:\:subject\:to\:{g}_{i}\,\epsilon\left\{\text{0,1}\right\}\:\:\:\:{g}_{i}\,\epsilon \,g,\:i=\text{1,2},\dots\:\dots$$
(5)
Where \(\:f\left(g\right)\:\:\)indicates the function of objective, g represents the decision variable each set \(\:{g}_{i}\) and \(\:n\) total number of decision variable. The following is an expression of the BCHS’s detailed implementation:

Step 1:

The algorithm’s factors are being initialized. The factors encompass the harmony memory size (HMS) and the harmony memory consideration rate (HMCR).

Step 2:

Setting up the harmony memory. Within the viable solution space, the HM represented by \(\:HM={\left[{g}^{1}{g}^{2}\dots\:.{g}^{HMS}\right]}^{T}\) is initialized at random.$$\:HM=\left[\begin{array}{ccc}{g}_{1}^{1}&\:\cdots\:&\:{g}_{n}^{1}\\\: \vdots &\:\cdots\:&\: \vdots \\\:{g}_{1}^{HMS}&\:\cdots\:&\:{g}_{n}^{HMS}\end{array}\right]\left[\begin{array}{c}f\left({g}^{1}\right)\\\:\vdots\\\:f\left(\left({g}^{HMS}\right)\right)\end{array}\right]$$
(6)

Step 3:

Making up renewed harmony. Using the randomization and harmony memory consideration rule, which are established by the pre-defined HMCR, a new harmony, designated as \(\:{g}^{{\prime\:}}=\left({g}_{1}^{{\prime\:}},{g}_{2}^{{\prime\:}}\dots\:.{g}_{n-1}^{{\prime\:}},\:{g}_{n}^{{\prime\:}}\right)\) is created. Similar to the improvisation process, the pitch adjustment operator is eliminated in this investigation. The following is how the specific procedure is stated:$$\:{g}_{i}^{{\prime\:}}=\left\{\begin{array}{c}{g}_{i}^{{\prime\:}}\:\epsilon\left\{{g}_{i}^{1},{g}_{i}^{2},\dots\:{g}_{i}^{HMS}\right\};\:\:if\:U\left(\text{0,1}\right)\le\:HMCR\\\:{g}_{i}^{{\prime\:}}\epsilon\left\{\text{0,1}\right\};\:\:otherwise\end{array}\right.$$
(7)

Where \(\:{g}_{i}^{{\prime\:}}\) represents the candidate of New Harmony with \(\:{i}^{th}\) element.

Step 4:

The harmony memory being updated. When it exceeded the weakest one in the HM, the newly created \(\:{g}_{i}^{{\prime\:}}\) took its position.

Step 5:

Assessing the termination standard. The stopping condition must be met for the iterative search to end and the ultimate outcome to be produced. Should this not be the case, repeat steps (3) and (4).

Binary Dragonfly Algorithm (BDA)The best features from the retrieved features from the EEG signals have been chosen in the proposed study using the Binary Dragonfly Algorithm (BDA)31. Recent developments in metaheuristic swarm intelligence have led to the development of the Dragonfly algorithm, a successful solution to a number of continuous optimization issues, including the machine learning optimization problem, the localization problem in networks, and the economic emission dispatch problem. In alcoholic EEG signal categorization and feature reduction, BDA offers a strong incentive for its application. Enhancing the accuracy and efficiency of EEG-based alcoholic detection systems might be made easier with its competitive performance, flexibility to dataset characteristics, interoperability with binary-encoded alcoholic EEG data, and equal emphasis on exploration and exploitation. Exploration and exploitation are the two stages of the BDA that go into fixing any given issue. The BDFA is a straightforward algorithm that leads to faster convergence to optimal solutions with fewer parameters. An intrinsic feature of many optimization methods influenced by nature is the seeming unpredictability in the behavior of BDAs. By enabling the algorithm to investigate many solutions, it raises the probability of discovering globally optimum or nearly optimal solutions in intricate problem domains. Therefore, utilizing the binary form of the dragonfly method, the best feature selection from the alcoholic EEG signal feature space is characterized as a binary optimization issue in this study.The response to the selection of features issue is represented as a vector of \(\:1s\) and \(\:0s\), where ‘0’ denotes that the relevant feature is not picked and ‘1’ denotes that it is. Equation (8) describes how the fitness parameter of the feature selection issue is represented using the efficiency of classification and a few chosen features31.$$\:Fitness=\alpha\:{\gamma\:}_{R}\left(G\right)+\delta\:\frac{\left|S\right|}{\left|N\right|}$$
(8)
Where \(\:\alpha\:\) indicates the interval of the fitness function \(\:\left[\text{0,1}\right]\), \(\:\delta\:=\left(1-\alpha\:\right)\), \(\:{\gamma\:}_{R}\left(G\right)\) represents the error rate of the fitness function, \(\:\left|S\right|\) signifies the number of selected features of the fitness function and \(\:\left|N\right|\) indicates the total number of extracted features from the alcoholic EEG signals. The BDA pseudocode is as follows:BDA pseudocodeStep 1: Initialize the population.Step 2: For each iteration.Calculate each solution Fitness using Eq. (8)Update the position.Step 3: End For.Step 4: Return the optimal solution.Table 2 Summarizes the average values of key statistical parameters like t-test, Friedman statistic test, Mann-Whitney U test, Z-Score with corresponding p values are calculated from the PSO, BCHS and BDA features for alcoholic EEG signals.Table 2 Average values of key statistical parameters of PSO, BCHS and BDA features for alcoholic EEG signals.The \(\:p-value\) is a measure of the significance level that will state the chance of finding an effect as larger otherwise larger than observed in the sample given that null hypothesis is true. Generally the value of \(\:p\) should be small as possible because when \(\:p\) close to zero, the null hypothesis is rejected and vice versa. Typically if the\(\:\:p-value\) is:

i.

\(\:p-value<\:0.05\) the finding is indexed as statistically significant hence rejecting the null hypothesis.

ii.

\(\:p-value>\:0.05\) level then the outcome said not to be statistically significant hence the null hypothesis stands as it is not rejected.

The t-test is a hypothesis testing done to compare the means of two samples in particular, the control and experimental samples. It is widely employed in establishing the presence or lack of a statistically significant difference between the two groups’ means. The t-test results in a t-statistic and a\(\:\:p-value\); both of which are utilized in order to evaluate significance. Typically if the t-statistic value is:

i.

The scale ranges varies between − 1 and 1 indicates the not significant.

ii.

The scale ranges varies between 1 and 2 otherwise − 1 to − 2 represents the not statistically significant.

iii.

The scale ranges varies between above 2 represents the highly statistically significant.

As seen in Table 2, the results for the t-statistic with corresponding\(\:\:p-value\:\)indicates that:

i.

The average t-statistic value for PSO feature selection is 0.4534, and the associated\(\:\:p-value\:is\:0.5553\), indicating that the correlation is not statistically significant.

ii.

The average t-statistic value for BCHS feature selection is 0.3105, and the associated\(\:\:p-value\:is\:0.8725\), representing that the correlation is not statistically significant.

iii.

The average t-statistic value for BDA feature selection is 4.2096E-39, and the associated\(\:\:p-value\:is\:less\:than\:0.00001\), signifying that the correlation is highly statistically significant.

Friedman statistic is non-parametric test used to compare the means of more than two samples in case of abnormal distribution of data. It is a variation of the Wilcoxon signed-rank test of related samples, developed for use in with more than two groups. This test is used to apply the rank-sum test to more than two related groups to identify the variations in the medial values. They are often used in repeated measures design, in which the same subjects are tested several times under various circumstances. It yields a chi-square statistic, which, in turn, is compared to a chi-square value of the respective degrees of freedom. If the calculated statistic is greater than the critical value, then the null hypothesis which states that mean of both the median values is same, is rejected which shows that the groups are significantly different. Typically if the Friedman statistic value is:

i.

Friedman statistic \(\:>\:10-15\) suggests that there are marked differences between the groups and hence the null hypothesis should be rejected.

ii.

Friedman statistic \(\:<\:5-6\) signifies that there is no difference between the groups and therefore the null hypothesis cannot be rejected.

iii.

Friedman statistic varies between 0 and 1 indicates the weak significance.

As demonstrated in the Table 2 the results of the Friedman statistic test values indicates that,

iv.

The Friedman statistic test for PSO feature selection is 6.06, and the associated\(\:\:p-value\:is\:0.1947\), indicating that the correlation is not statistically significant.

v.

The Friedman statistic test for BCHS feature selection is 0.8, and the associated\(\:\:p-value\:is\:0.9385\), representing that the correlation is not statistically significant.

vi.

The Friedman statistic test for BDA feature selection is 40, and the associated p – value is less than 0.00001, signifying that the correlation is highly statistically significant.

The Mann-Whitney U test is another non-parametric statistical tool employed in comparing two independent groups’ distributions. It is used mostly to test for the null hypothesis that there is a difference in the median value of two given sets. Typically if the U statistics value is:

i.

U statistic smaller than 100 corresponds to a significant statistical difference between the groups with greater values in the group size.

ii.

When the U value is large, for instance, any number greater than 200, it shows that the groups are significantly different and the group with a larger number of employees has larger U values.

iii.

If U values close to the sample size, it shows that there is no significant difference between the two groups.

As seen in Table 2, the results for the Mann-Whitney U test indicates that:

i.

Mann-Whitney U test for PSO feature selection is\(\:\:\text{U}\:=\:1268\): This value is quite bigger, which normally means that the two groups differ a lot.

ii.

Mann-Whitney U test for BCHS feature selection is\(\:\:\text{U}\:=\:1225.5\): This value is almost equal to the total sample size that indicates that the two groups are almost similar and representing that the correlation is not statistically significant.

iii.

Mann-Whitney U test for BDA feature selection is\(\:\:\text{U}\:=\:0\): This value shows that both groups are independent to an extent that all values in a given group are much larger than the values in the other group and demonstrating a highly significant difference between the groups.

The Z-score in other terms is called as the standard score that measures the deviation of an entity from the mean of a normally distributed dataset. In general, when determining Z score value is when Z-score equal to zero is interpreted as the fact that the chosen observation is equal to the average, Positive Z-score is indicates the observation above the mean and positive Z-score with value greater than one \(\:(+1)\) represents the more than one standard deviation above the mean and Negative Z-score is indicates the observation below the mean and negative Z-score with value less than minus one \(\:(-1)\) represents the more than one standard deviation below the mean. As demonstrated in the Table 2, the results of Z-score value directs that:

i.

Z-score for PSO feature selection is \(\:-0.2142\) and the corresponding\(\:\:p-value\:is\:0.8337\), indicating that the correlation is not statistically significant.

ii.

Z-score for BCHS feature selection is \(\:0.4986\) and the corresponding\(\:\:p-value\:is\:0.6171\), indicating that the correlation is not statistically significant.

iii.

Z-score for BDA feature selection is \(\:8.7005\) and the corresponding\(\:\:p-value\:is\:0.00001\), signifying that the correlation is highly statistically significant and extremely outlier values.

Table 2 presents the average statistical parameter analysis of various metaheuristic feature selection algorithms, revealing that the BDA algorithm outperforms other feature selection algorithms, yielding more significant results as evident from the average values of statistical parameters. Figure 3 represents the normal probability distribution of the LASSO regression feature extraction based PSO features, Fig. 4 displays the normal probability distribution of the LASSO regression feature extraction based BCHS features and the normal probability distribution of the LASSO regression feature extraction based BCHS features shows the Fig. 5.Fig. 3Normal probability distribution of the LASSO regression feature extraction based PSO features.Fig. 4Normal probability distribution of the LASSO regression feature extraction based BCHS features.Fig. 5Normal probability distribution of the LASSO regression feature extraction based BDA features.The Normal probability distribution plot in Fig. 3, 4, 5 shows a clear relationship between the ‘PSO’, ‘BCHS’ and BDA’ features. Figure 3 represents normal probability distribution plot reveals non-linear patterns and substantial overlap, indicating a deviation from normality and suggesting that the data not be linearly separable. Figures 4 and 5 exhibits the outliers of the features. The data points cluster tightly around a value of 0.45, indicating a strong correlation. This makes it easy to choose a target value for our classification models. Because the central area of the plot is concentrated around 0.45, we can set this value as the target for our classifiers.Figure 6 represents the correlation plot of the LASSO regression feature extraction based PSO features, Fig. 7 displays the correlation plot of the LASSO regression feature extraction based BCHS features and the correlation plot of the LASSO regression feature extraction based BCHS features shows the Fig. 8. A correlation plot demonstrates how multiple variables relate to one another. In this correlation plot visualization, there are scatter plots for the pairs of variables, histograms for the individual variable, and correlation coefficients showing the intensity as well as the direction of linear relationships. In this case, the correlation coefficients vary from − 1 to 1, where − 1 indicates the perfect negative correlation and + 1 represents the perfect positive respectively.Fig. 6Correlation plot of the LASSO regression feature extraction based PSO features.In Fig. 6, the variable histograms show some variability, ranging slightly above BCHS. The scatter plots reveal low positive and negative linear correlations between variable pairs. Significant correlations include var3 and var5, which have a correlation coefficient of -0.21, while var1 and var5 have one of -0.20. Overall, the variables show moderate variation and generally have weak correlations.Fig. 7Correlation plot of the LASSO regression feature extraction based BCHS features.The histograms in Fig. 7 depict narrow distributions for each variable (var1 to var7). The scatter plots indicate weak linear relationships between variable pairs, with correlation coefficients ranging between − 0.16 and 0.24. Important correlations are: var3 and var5 (0.24), var1 and var6 (0.16). Generally, there is little difference between variables while their correlations are weak.Fig. 8Correlation plot of the LASSO regression feature extraction based BDA features.As indicated on Fig. 8, within the − 0.55 to 1.00 range of correlation coefficients, the variables are significantly related linearly. Notable linear relationships are between var1 and var5 with a correlation coefficient of 1.00, var1 and var3 with a correlation coefficient of 0.98, and var2 and var3 with a correlation coefficient of − 0.95 respectively. Histograms show that there is moderate dispersion for each variable. For the most part, data points underscore both very strong decreases as well as increases in between the variables.Classification methodologyIn this research employed several classification models to analyze the effectiveness of the chosen features. These models included Long Short-Term Memory (LSTM) networks and Enhanced Artificial Neural Networks (EANN). Additionally, in this paper investigated Support Vector Machines (SVM) with different kernel functions (linear, polynomial, and RBF), Random Forest and Artificial Neural Network for classification.Proposed modelsAn enhanced four-layer neural network and an LSTM model were developed for improved performance of the neural network architecture. Both models used Python to implement them and a comparative analysis was conducted based on performance benchmarks.LSTMDeep learning algorithms such as recurrent neural networks (RNNs) are the foundation of long short-term memory (LSTM)32. An external register or memory is not needed to save past outcomes since RNN is made up of recurrent structures that locally feed the firing ability. Due to the recurrent structures employed in RNN, LSTM has minimal complexity in computation. Figure 9 illustrates the internal architecture of the LSTM. The following operations are the foundation of how LSTM functions33.$$\:{x}_{t}=\sigma\:\left({W}_{x}\cdot\:\left[{b}_{t-1},{k}_{t}\right]\right)$$
(9)
$$\:{g}_{t}=\sigma\:\left({W}_{g}\cdot\:\left[{b}_{t-1},{k}_{t}\right]\right)$$
(10)
$$\:{\widehat{b}}_{t}=tanh\left(W\cdot\:\left[{{g}_{t}*b}_{t-1},{k}_{t}\right]\right)$$
(11)
$$\:{b}_{t}=\left(1-{x}_{t}\right)*{b}_{t-1}+{x}_{t}*{\widehat{b}}_{t}$$
(12)
Where \(\:\sigma\:\) indicates the sigmoid activation function, \(\:tanh\) represents the activation function for hyperbolic tangent, \(\:W\) indicates the input weights and connections of recurrent with either input gate, forget gate and output gate, \(\:{b}_{t}\) signifies the new cell state and \(\:{b}_{t-1}\) indicates the old cell state.Fig. 9LSTM internal architecture.In an RNN, the learning process occurs in two phases such as structure learning and parameter learning. Nodes incorporate membership functions based on input variables, typically employing Gaussian functions defined by mean and variance. Single dimensional membership functions are assigned through spatial and temporal firing mechanisms. Structure learning involves determining the conditions under which rules are generated and activated, requiring firing strengths above a specified threshold usually between \(\:0\:and\:1\) for each input. Parameter learning follows structure learning and aims to minimize the error cost function effectively.Enhanced artificial neural networkA modernized four-layer design with ReLU activation, SeLU activation, ReLU activation in the first three layers and sigmoid in the last layer improves neural network effectiveness. It employs the Adam optimizer with a binary cross-entropy loss34. The number of hidden-layer neurons is not limited by any predefined limits. The network is completely linked, with weights and thresholds tuned. Prior to categorization, input is normalized. Tuning consists of four completely linked layers, with dropouts as needed. Kernels produce feature maps, whereas ReLU and radial basis procedures extract nonlinear features. Fully connected layers link each neuron to its neighboring layers. Our technique, which focuses on convolution layers and dropouts, surpasses LSTMs and standard neural networks. Pooling layers can be used to improve performance even more. Figure 10 illustrates the architecture of the EANN. The specifications of the EANN architecture components is shown in Table 3.Fig. 10Architecture of the EANN.Table 3 Specifications of EANN architecture components.Preprocess the input data by normalizing its values, ensuring everything is on a similar scale before feeding it into the network. Optimize the network’s internal workings by adjusting weights and thresholds during training. These act like dials that control how the network processes information. Utilize a fully connected structure with four hidden layers. In each layer, all neurons are connected to all neurons in the next layer, allowing for complex information flow. Incorporate dropout layers at strategic points within the architecture. These temporarily remove some neurons during training, helping to prevent overfitting and improve generalization. Extract meaningful features from the data using convolutional layers. These layers apply filters (kernels) that slide across the input, identifying important patterns. Employ ReLU activation functions in the hidden layers. These functions introduce non-linearity, allowing the network to learn more complex relationships within the data. Unlike LSTMs and traditional neural networks considered, our architecture using only convolutional layers and dropouts achieves better performance in this specific case. While pooling layers were not used here, they could be further explored for potential performance improvements.Conventional modelsThe benchmarking process employed conventional models like ANN, Random Forest and SVM-RBF. The kernel methods in SVM-RBF give a good performance when it comes to classification, Random Forest uses decision trees that are based on ensemble learning while ANN’s major strength is its ability to recognize patterns in various neural network layers. These benchmarks were then used to compare the proposed model which performed better and gave more accurate results.Artificial neural network (ANN)A computational model called an ANN classifier is modeled after the biological neural networks observed in the human brain. It is made up of networked components called neurons that cooperate to find solutions to certain issues, specifically those involving categorization. An input layer, an output layer, and perhaps several hidden layers comprise an ANN35. After implementing a weighted sum of inputs, the neurons in each layer employ a non-linear activation function. The output \(\:q\) of a neuron \(\:{s}_{q}\) in a hidden otherwise output layer can be calculated as follows36:$$\:{s}_{q}=f\left(\sum_{p=1}^{n}{w}_{pq}{g}_{p}+{b}_{q}\right)$$
(13)
Where \(\:{g}_{p}\) indicates the neuron’s input signals, \(\:{w}_{pq}\) represents the weights of input \(\:p\) and neuron \(\:q\), \(\:{b}_{q}\) indicates the bias of the neuron \(\:q\), and \(\:f\left(\cdot\:\right)\) indicates the type of activation function. In this case utilized the one hidden layer using ReLU activation and output layer using sigmoid activation function. Figure 11 illustrates the architecture of the feedforward neural network.Fig. 11Architecture of the feed-forward neural network.Through the use of a dataset, the ANN classifier is trained, and during this process, the weights \(\:{w}_{pq}\) and biases \(\:{b}_{q}\) are tuned in order to reduce the amount of classification error. This work employs an ANN-RBF network architecture comprising 32 input neurons in the input layer, 64 hidden neurons in the hidden layer and a single output layer, which collectively achieve a remarkably low Mean Squared Error (MSE) for both training and testing phases.Support Vector Machine (SVM)Appropriate for both regression and classification applications, SVM is a sophisticated supervised machine learning technique37. The process identifies the feature-space hyperplane that best divides the classes. For both linear and non-linear categorization scenarios, SVM can use a variety of kernel functions such as linear, polynomial and gaussian. In terms of kernel functions, the most basic is the linear kernel. If classes can be divided into different groups by a straight line, the data can be classified as linearly separable. The linear kernel function and the input characteristics are combined linearly to form the decision boundary, which is expressed mathematically as follows38:$$\:G\left(p,q\right)=p\cdot\:q$$
(14)
$$\:f\left(p\right)=w\cdot\:p+b$$
(15)
Where \(\:p\:and\:q\) represents the input vectors, \(\:w\) indicates the weight and \(\:b\) represents the bias. For the linear kernel approach, an appropriate hyper parameter is selected through a random search scenario. The polynomial kernel thus adopts polynomial motives of the input characteristics to account for a more complex decision boundaries than offered by the linear kernel. In the case of non-linear data, it is appropriate. The polynomial kernel function characterized mathematically as follows:$$\:G\left(p,q\right)={\left(p\cdot\:q+S\right)}^{d}$$
(16)
Where \(\:S\) reflects a constant that balances the impact of higher-order components with that of lower-order ones, and d symbolizes the polynomial degree. In this case polynomial kernel is 1 is employed. The grid search approach is utilized in order to regulate the polynomial order while utilizing the polynomial kernel method. The gaussian kernel, which is often referred to as the RBF kernel, is a well-liked option for SVMs because of its efficiency to deal with non-linear data by locating input characteristics into an infinite degree space. The gaussian kernel’s mathematical expression is as follows:$$\:G\left(p,q\right)=exp\left(-\gamma\:{||p-q||}^{2}\right)$$
(17)
A parameter called \(\:\gamma\:\) determines how widely the kernel spreads and how much each training sample influences the system. In terms of the hyper parameter selection that is carried out for the Gaussian Classification algorithm, the gamma index of the Gaussian kernel is chosen from a range that begins at 0.2, continues through 0.4, reaches 0.6, and goes all the way up to 2.6. The testing process revealed that an MSE of 0.00000488 is attained at 250 iterations, with a corresponding gamma value of 2.0, indicating a significant reduction in error.Random Forest (RF)In order to categorize the alcoholic signals, the feature values that are produced using distance metrics are placed as input into a classification algorithm39. This is done with the sole goal of categorizing the signals. In the majority of cases, the random forest classification methods are dependent on the categorization outcomes of a number of different tree models. Following that, each tree is given a random vector that is independent of the others and has the same distribution. Consequently, the training data and the randomly allocated vector give the tree the support it needs to carry out the classification. Using the 10-fold cross validation approach, the classification effectiveness is validated. The method is then assessed using the classification performance benchmarks. The random forest classifier \(\:g\left(s\right)\) specified mathematically as follows40:$$\:g\left(s\right)=majority\_\:vote\left({g}_{1}\left(s\right),{g}_{2}\left(s\right),\dots\:\dots\:{g}_{M}\left(s\right)\right)$$
(18)
Where \(\:{g}_{p}\left(s\right)\) indicates the \(\:pth\) decision tree prediction. The class that obtains the majority of votes from the various trees is the one that is ultimately selected as the output for classification challenges.$$\:g\left(s\right)=\underset{C}{\text{argmax}}\sum_{p=1}^{N}\aleph\:\left({g}_{p}\left(s\right)=D\right)$$
(19)
Where \(\:\aleph\:\) represents the function of indicator (true = 1; otherwise = 0) and the variable D represents the label of the classes.Evaluation schemeProposed model evaluation scheme is designed to fully test its performance, compare it with the existing conventional techniques and interpret the results by different performance benchmark measures. By this comprehensive assessment, we ensure that our model is robust and generalizable.Performance benchmarks analysis and model testingBoth, the proposed LSTM and the enhanced artificial neural network models were applied and tested using Python on a computer with 2 GHz processor and 16 GB RAM thereby proving its computational efficiency and accuracy. This research uses a step-by-step methodology to examine the EEG signals in classifying alcohol risk levels. In the beginning, feature extraction is done using LASSO regression method then feature selection involves a set of metaheuristic algorithms such as PSO, BCHS and BDA. After that classification is done through multiple classifiers. The performance of this approach is evaluated using various benchmark metrics such as Sensitivity\(\:\:{S}_{e}\), Specificity \(\:{S}_{p}\), Accuracy \(\:{A}_{c}\), Matthews Correlation Coefficient \(\:\left(MCC\right)\), Kappa Coefficient Analysis \(\:\left(KCA\right)\), Mean Squared Error \(\:\left(MSE\right)\), Good Detection Rate \(\:\left(GDR\right)\), and Error Rate \(\:{E}_{R}\). For reliable results 10-fold cross validation is conducted where dataset divided into 10 equal parts; each iteration uses 70% for training and remaining for testing which finally average performance metrics are calculated across all iterations. The Sensitivity\(\:{\:S}_{e}\), Specificity\(\:\:{S}_{p}\), Accuracy\(\:{\:A}_{c}\), Matthews Correlation Coefficient\(\:\:\left(MCC\right)\), Kappa Coefficient Analysis\(\:\:\left(KCA\right)\), Good Detection Rate \(\:\left(GDR\right)\), and Error Rate \(\:{E}_{R}\) are obtained from the confusion matrix using the following formulas41:$$\:{S}_{e}=\:\frac{TP}{TP+FN}*100$$
(20)
$$\:{S}_{p}=\:\frac{TN}{TN+FP}*100$$
(21)
$$\:{A}_{c}=\:\frac{TP+TN}{TP+TN+FP+FN}*100$$
(22)
MCC is a benchmark metric that is used to determine the performance of a binary classifier. The scale varies between\(\:\:-1\:and\:1\). In this study \(\:0.1\:to\:0.4\) indicates the worst prediction and \(\:0.5\:to\:1\) represents the perfect prediction.$$\:MCC=\:\frac{\left(TP\times\:TN\right)-\left(FP\times\:FN\right)}{\sqrt{\left(\right(TP+FP)}\times\:(TP+FN)\times\:(TN+FP)\times\:(TN+FN)}*100$$
(23)
KCA is a benchmark assessment statistic the extent of the agreement between two classes (alcoholic and normal) in binary classifier. It uses the amount of agreement that would be expected to occur by chance and gives a better estimate of agreement than does the percent agreement. The scale varies between − 1 and 1. In this study \(\:KCA<0\) indicates the no agreement, \(\:KCA\:=\:0.1\:to\:0.4\) indicates moderate agreement and \(\:KCA\:=\:0.5\:to\:1\) represents the almost perfect agreement.$$\:KCA=\frac{{A}_{c}-{E}_{a}}{1-{E}_{a}}$$
(24)
Where \(\:{E}_{a}\) indicates the expected accuracy, which is expressed mathematically as follows,$$\:{E}_{a}=\left(\frac{TP+FP}{N}\right)*\left(\frac{TP+FN}{N}\right)+\left(\frac{TN+FP}{N}\right)*\left(\frac{TN+FN}{N}\right)$$
(25)
$$\:GDR=\left(\frac{\left(TP+TN\right)-FP}{\left(TP+TN\right)+FN}\right)*100$$
(26)
$$\:{E}_{R}=\frac{FP}{(FP+TN)}*100$$
(27)
Where TP (True Positives) refers to the actual positive cases that have indeed been accurately predicted, TN (True Negatives) refer to such cases that are negative but were correctly predicted as such, FP (False Positives) means the wrongly predicted instances that belong to the positive class and FN stands for False Negatives, which are false-negative cases mistakenly predicted as negative.$$\:MSE=\frac{1}{K}\sum_{g=1}^{K}\left({H}_{g}-{S}_{p}\right)$$
(28)
In this study, \(\:{H}_{g}\) represents the observed EEG data values at a given time, while \(\:{S}_{p}\) represents the target values for each of the \(\:64\) models \(\:(p\:=\:1\:to\:64)\). With a total of 122 observations per patient \(\:\left(K\right)\), we utilized all distance features of the EEG data for both training and testing the classifiers. The training process entailed minimizing the Mean Square Error (MSE) values to the least amount of error. Interestingly, majority of classifiers obtained zero training error confirming to the hypothesis of maximum accuracy.

Hot Topics

Related Articles