Advanced differential evolution for gender-aware English speech emotion recognition

We conduct experiments to verify the superiority of the proposed ADE algorithm against DE37, BBO_PSO38 and MA29 algorithms. BBO_PSO and MA are two state-of-the-art algorithms in emotion recognition. BBO_PSO focuses on emotion recognition, while MA classifies emotions based on gender. DE and ADE utilize the gender-emotion model shown in Fig. 1. Table 2 displays the main parameter settings of the compared algorithms.Table 2 The main parameter settings.The algorithms are executed 20 times, and their population size is fixed at 20. DE, BBO_PSO, and MA have a maximum iteration of 100, while ADE has 200 iterations. To evaluate potential significant differences in the experimental results, we utilize the Wilcoxon rank sum test and the Friedman test, with a significance level set at 0.05.Objective functionClassification accuracy is the most important metric for SER algorithms, so it is utilized as the objective function in the experiments, as indicated in Eq. (5). Accuracy, weighted accuracy, and unweighted accuracy are metrics used to evaluate the classification ability of emotion recognition algorithms. Although weighted and unweighted accuracy better reflect a model’s performance in imbalanced categories, our emotion datasets contain multiple types, and they can test the performance of algorithms more comprehensively. Our objective is to maximize the recognition accuracy of these algorithms, which can be achieved through the evaluation of accuracy. We analyze the confusion matrix to identify which emotions the algorithm classifies well. Additionally, we evaluate the algorithms by comparing recall, precision, F1-Score, the number of selected features, and execution time.$$\begin{aligned} accuracy = \dfrac{S1}{S1+S2} \end{aligned}$$
(5)
where S1 and S2 represent the numbers of correctly classified and incorrectly classified samples, respectively.Experimental analysisWe use SVM as the classifier, and we employ 10-fold cross-validation to assess the performance of the algorithms. For ease of reading, we mark the best experimental data obtained by the algorithms in bold font.Table 3 displays the average, minimum, and maximum recognition accuracy. ADE exhibits superior classification accuracy compared to DE in CREMA-D, EmergencyCalls, and RAVDESS, whereas DE outperforms ADE only in IEMOCAP-S1. This suggests the effectiveness of the proposed DE improvement method. Additionally, ADE demonstrates better accuracy than BBO_PSO, MA, and DE in CREMA-D, EmergencyCalls, and RAVDESS, while DE has better average recognition than other algorithms in IEMOCAP-S1. The overall performance of the algorithms in IEMOCAP-S1 is general. The main reason for this is that the dataset contains the most emotional features and the sample distribution is uneven, which prevents the algorithms from creating accurate prediction models. In EmergencyCalls, ADE achieves the best prediction accuracy, and its worst prediction value is also better than BBO_PSO, MA, and DE. In IEMOCAP-S1, ADE attains the highest classification accuracy at 0.5729, outperforming other algorithms. Meanwhile, DE’s worst prediction value of 0.5578 is superior to that of other algorithms. In RAVDESS, ADE and DE outperform the comparison algorithms in the best and worst prediction values, respectively.Table 3 The classification accuracies of the algorithms.The Wilcoxon rank sum reveals that BBO_PSO has consistent statistical data with ADE in RAVDESS, and DE and ADE have similarities in IEMOCAP-S1. The average ranks of BBO_PSO, MA, DE, and ADE in CREMA-D, EmergencyCalls, IEMOCAP-S1, and RAVDESS are 3, 3.75, 2, and 1.25 respectively, with a P-Value of 0.0440. The Friedman test demonstrates that ADE performs best on emotional datasets. MA, DE and ADE are all gender-based emotion recognition algorithms, while BBO_PSO does not utilize gender to complete emotion recognition. From Table 3, we can see that the performance of DE and ADE is better than that of BBO_PSO. This indicates that gender information can improve emotion recognition accuracy.To further validate the efficiency of the algorithms, we analyze the performance of them from precision, recall, and F1-score, as shown in Table 4. The algorithms are the most effective in RAVDESS, and the data is comparable in CREMA-D and EmergencyCalls. Since some emotional samples in IEMOCAP-S1 have less data, the algorithms cannot classify them correctly. Consequently, the data for precision and F1-score are unavailable, but they also have low recall values. ADE outperforms the comparison algorithms in precision, recall, and F1-score in CREMA-D, EmergencyCalls, and IEMOCAP-S1, but its performance in RAVDESS is surpassed by BBO_PSO. ADE outperforms BBO_PSO, MA, and DE in RAVDESS for classification accuracy, but lacks in recall and precision. The optimization goal of ADE is to improve overall classification accuracy rather than focus on the recognition ability for each class of samples. This may cause ADE to perform poorly in identifying rare or borderline samples, leading to missing some positive samples (lower recall) or misclassifying more negative samples (lower precision).Table 4 The recall, precision, and F1-score of the algorithms.Table 5 presents the running time and the number of selected features. BBO_PSO has a clear advantage in running time, and it has better operational efficiency than MA, DE, and ADE in EmergencyCalls, IEMOCAP-S1, and RAVDESS. ADE achieves the shortest running time in CREMA-D. The time complexity of the SVM classifier is between O(\(D * T^2\)) and O(\(D * T^3\)), where D means the feature size and T implies the number of samples. The calculation time of feature selection algorithms is mainly affected by classifiers. Although BBO_PSO employs more features than ADE, the time difference between them is marginal. ADE takes less time than DE on the four datasets. Because CREMA-D contains the largest number of samples and EmergencyCalls has the smallest sample size, the calculation time of the algorithms in CREMA-D is considerably longer than in the other datasets.The algorithms obtain the same number of features from the datasets, so the numbers of features they selected in each dataset are similar. ADE utilizes fewer features compared to the other algorithms. On the other hand, DE employs the most features, but it’s important to note that both ADE and DE explore twice the feature space of BBO_PSO and MA.Table 5 The number of selected features and running time of the algorithms.DiscussionThe time complexity of ADE is \(O(G*N*dim+G*N*f)\), where f is the execution time of the objective function, and G and N represent the maximum iteration and population size. In feature selection, due to the high complexity of f, the maximum time complexity can also be represented as \(O(G*N*f)\).Figure 3 depicts the confusion matrix of ADE. In CREMA-D, ADE recognizes Angry, Neutral, and Sad well, but in Disgust and Fearful, Sad, Happy, and Angry greatly interfere with the accuracy. In EmergencyCalls, the recognition of Angry, Drunk, and Stressful is affected by the presence of Painful. ADE performs remarkably well in identifying Painful, Angry, and Drunk, but it’s hard to distinguish Stressful. In IEMOCAP-S1, the number of emotion samples for Surprised, Fearful, Other, and Distinct is relatively small. It is difficult for ADE to make correct judgments, and they do not affect the recognition of other emotions. The algorithm’s emotion recognition is complicated by Neutral and Frustrated emotions, and ADE has the best accuracy in classifying Sad and Exciting emotions. In RAVDESS, ADE is the top performer in recognizing Calm, Angry, Fearful, Disgust, and Surprised. However, the algorithm could easily mistake Neutral for Calm and Sad.Figure 3The confusion matrix of ADE.In terms of acoustic features, males and females have distinct differences in the following aspects:

(1)

The mean of MFCC [3,4,6,7,8,14] and log Mel Freq. Band [6,7] in CREMA-D.

(2)

The mean and differentiation of MFCC [4,6,7,8,13,14], log Mel Freq. Band [0,7], and LSP Frequency [1] in EmergencyCalls.

(3)

The mean, variance, and second-order differentiation of MFCC [0,1,4,9,10,11,12], LSP Frequency [0], and PCM Loudness in IEMOCAP-S1.

(4)

The mean, variance, and differentiation of MFCC [3,6,7,8,9,13], F0 Envelope Quartile, and F0 by Sub Hungarian Sum in RAVDESS.

Gender information in MFCC3,6,7,8,9,13 is significantly different, and the mean, variance, and differentiation of features also have statistical characteristics that impact recognition.Recently, speech emotions have been recognized through CNN and feature fusion39,40.39 claimed an accuracy of 58.62% in RAVDESS, and it can be increased to 78.35% using data augmentation. By combining the frequency and time-domain features of MFCC40, reported that the recognition rates of IEMOCAP and RAVDESS can be improved to 74.62% and 86.11% using the obtained 23712 features. In the future, we can utilize data augmentation and feature fusion methods to improve the classification performance of the ADE algorithm.

Hot Topics

Related Articles