A cardiac audio classification method based on image expression of multidimensional features

The following will provide a detailed introduction to the CACIEMDF algorithm framework that we have developed. Prior to delving into the algorithm, we will first present the data set’s source, the relevant preprocessing operations conducted on it, and the 102-dimensional feature vector that we have designed. Subsequently, three specific enhancements of our algorithm framework will be introduced.DatasetsWe collected heart sound data from Yaseen201827 (Database1), the 2016 PhysioNet/CinC Challenge21 (Database2), and the 2011 PAsCAL challenge dataset (Database3). The types and quantities of audio in each dataset are shown in Tables 8, 9 and 10.Yaseen201827 provided a dataset, Database1, which contains heart sound data that can be divided into normal and abnormal data. The data were further subdivided into five categories, including one normal (N) category and four abnormal categories: aortic stenosis (AS), mitral stenosis (MS), mitral regurgitation (MR), and mitral valve prolapse (MVP). Detailed audio types and quantities are shown in Table 8.The Database2 dataset used in the 2016 PhysioNet/CinC Challenge21 was composed of 6 different databases (a–f) with a total of 3240 heart sound recordings, of which 665 were normal and 2575 were abnormal. The duration of these recordings varied, with the shortest being 5 s and the longest being over 120 s. The heart sounds recorded in the dataset from healthy subjects were all labeled normal, and those from patients with confirmed heart disease were labeled abnormal. Most of these patients have a variety of heart diseases, especially heart valve defects and coronary artery disease. These abnormal heart sounds were recorded without further information on disease categories. All the recordings in the dataset were resampled to 2000 Hz, as shown in Table 9, for specific audio types and quantities.The data in the 2011 PASCAL Challenge dataset, Database3, come from two different sources: (A) data collected from the public via the iStethoscope Pro iPhone app and (B) data from clinical trials in hospitals using DigiScope, a digital stethoscope, where some of the audio contains excessive noise. Table 10 lists the audio types and quantities.We selected all audio from Database1 and Database2, as well as the normal and abnormal audio from Database3, as our experimental dataset. Among them, there are 3504 abnormal audio recordings, with the longest duration being 101.67 s and the shortest duration being 0.85 s. The number of normal audio recordings is 1216, with the longest duration being 121.99 s and the shortest duration being 0.76 s.Data preprocessingConsidering the problems of balancing positive and negative samples and that the periodicity of normal heart sounds is obvious, we split normal audio that lasts longer than 10 s into equal segments as new samples. In this way, it can be ensured that the audio segment still contains enough complete heart sound cycles, thus retaining the characteristic information of normal audio when the sample duration is increased. After splitting, there are 3452 normal audio recordings, which is close to the number of abnormal samples, achieving data equalization. The total number of normal and abnormal samples reached 6956.In our experiment, the positive and negative samples were divided into a training set, verification set and test set at a ratio of 8:1:1.Due to differences in the audio collection settings, equipment, noise and other factors among the three datasets, there are significant differences in the audio amplitude. Therefore, before extracting features from the audio, the data are denoised, downsampled and normalized. The flow chart is shown in Fig. 6.Fig. 6Flow chart of data preprocessing.In addition to the obvious periodic peaks, some recordings also contain some unusually sharp peaks, as shown in Fig. 7. By measuring the amplitude of the sample relative to the standard deviation, we find that the sample with the sharp peak has a ratio as high as 94, but the average ratio of all the samples is 10. This means that sharp spikes are rare, and we think that this rare occurrence is caused by noise, not a heart murmur.Fig. 7Visualization of audio with sharp noise peaks.To remove the influence of these outliers, we calculated the divergences of the audio amplitude extremes. The calculation is as follows:$$\begin{aligned} ms_{i}=\frac{max((max_{i}-\mu _{i}),(\mu _{i}-min_{i}))}{std_i}, \end{aligned}$$
(1)
where \(i\) represents the number of audio recordings, \(\mu\) represents the mean value of the amplitude of the sampling points and \(std\) represents the standard deviation of the amplitude of the sampling points. According to the statistics, the mean value of \(ms_i\) is close to 10. Based on this, we calculate the upper and lower bounds of the audio amplitude according to the following formula:$$\begin{aligned} upb_i = \mu _i+10*std_i \end{aligned}$$
(2)
$$\begin{aligned} lowb_i = \mu _i-10*std_i \end{aligned}$$
(3)
For the amplitude values within the boundary in the audio, no changes are made. The amplitude values outside the boundary are set to the boundary value.$$\begin{aligned} data_{ij}=\left\{ \begin{matrix} lowb_i & if data_{ij}\le lowb_i\\ data_{ij} & if lowb_i\le data_{ij}\le upb_i\\ upb_i & if data_{ij}> upb_i \end{matrix}\right. \end{aligned}$$
(4)
The amplitude of the processed audio sampling points is shown in Fig. 8.Fig. 8Visualization of audio after outlier removal.A comparison of Figs. 7 and 8 shows that there are obvious outliers in the original amplitude plot of the sampling points, whose maximum value can reach 0.15. After denoising with the upper and lower bounds, the overall amplitude is [− 0.1, 0.1], which reduces the influence of extreme values to a certain extent.The upper and lower bounds of the amplitude vary slightly for each sample. To avoid the influence of the amplitude range on the feature extraction process, we use the following formula to scale the overall amplitude range to [− 1,1].$$\begin{aligned} norm\_data_i=\frac{data_i}{\max _{j}(|data_{ij}|)} \end{aligned}$$
(5)
where \(data_i\) is the sequence of sampling points of the \(i\)th sample and \(data_{ij}\) is the \(j\)th sampling point of the \(i\)th sample. \(|*|\) represents the absolute value, and \(norm\_data_i\) is the scaled sequence of sampling points, as shown in Fig. 9.Fig. 9Visualization of the normalized audio.Multidimensional feature extractionFeature extraction is a key step in the diagnosis and recognition of heart sound signals. Since heart sound is a time-dependent signal, its kurtosis28,29, skewness28,29, mean energy30,31, short-time zero crossing rate32,33 and other time domain features are widely used. Frequency domain features have stronger discriminatory ability than time domain features34, and the feasibility of identifying changes or patterns in heart sound signals is greater. The spectral information of a single frequency can be used, and the spectral information of a frequency band can also be used to help distinguish different types of heart sound signals.In this paper, in addition to those of the time domain and frequency domain, the characteristics of the statistical domain can also be used to describe heart sound signals. We added the statistical features of higher-order cumulants, first-order difference absolute sums, and first-fourth order autocorrelation coefficients to enrich the multidimensional feature vectors of heart sounds. In summary, we designed a 102-dimensional feature vector that contains time domain (\(f_1 \sim f_{18}\)), frequency domain (\(f_{19} \sim f_{46},f_{56} \sim f_{102}\)), and statistical domain (\(f_{47} \sim f_{55}\)) features of heart sound signals. The specific meanings are shown in Table 11.Table 11 Extracted features and their meanings.The MFCC frequency cepstrum coefficient (MFCC) is widely used in speech recognition, and its extraction process includes several key steps, such as preprocessing, fast Fourier transform (FFT), Mel filter bank (Mel), logarithm operation, discrete cosine transform (DCT), and dynamic feature extraction. The MFCC can effectively represent the characteristics of speech signals combined with human hearing characteristics and achieve remarkable results in automatic speech and speaker recognition. The characteristics are calculated based on the MFCC35 as follows:$$\begin{aligned} f_{37}:AV_{MFCC}=\frac{1}{N} { \sum\limits _{j=1}^{N}}\min _{i\in I}C_{i,j}, j=1,2 \ldots N \end{aligned}$$
(6)
$$\begin{aligned} f_{38}:\mu _{r}^{max,MFCC}=E(\max _{i\in I}C_{i,j}-\mu _j)^{\gamma } \end{aligned}$$
(7)
$$\begin{aligned} f_{39}:\mu _{r}^{skew,MFCC}=E(Skew_{i\in I}C_{i,j}-\mu _j)^{\gamma } \end{aligned}$$
(8)
where \(c_{i,j}\) represents the \(i\)th MFCC of the \(j\)th frame and \(min_{i\in I }C_{ij}\), \(max_{i\in I }C_{ij}\), and \(Skew_{i\in I }C_{ij}\) represent the minimum value, maximum value and skewness, respectively. \(\mu\) represents the mean value, \(r=2\). \(I\) represents the set of MFCC features.The wavelet transform is often used to extract the features of heart sound signals. On the one hand, it can perform time-frequency analysis of heart sound signals, providing greater analysis precision and accuracy than a traditional Fourier transform; on the other hand, the signal can be analyzed at different scales to capture the details and global characteristics of the signal. Based on the characteristics of the wavelet transform35, Daubechies 4 should be used to calculate the heart sound signal first. The formulas are as follows:$$\begin{aligned} f_{40}:H(a_5)=-\sum _{i}p(a_{5_{i}})lnp(a_{5_{i}}) \end{aligned}$$
(9)
$$\begin{aligned} f_{41}:H_q(d_5)=\frac{1}{q-1}ln(\sum _{i}p{(d_{5_i})^q}) \end{aligned}$$
(10)
$$\begin{aligned} f_{42}:H_q(d_4)=-\sum _{i}p(d_{4_i})lnp(d_{4_i}) \end{aligned}$$
(11)
$$\begin{aligned} f_{43}:\lambda (d_3)=log_2(\sigma ^2(d_3)) \end{aligned}$$
(12)
The discrete wavelet transform (Daubechies 4) is applied to the heart sound signal, where represents the 5th order approximation coefficient. \(d_3\), \(d_4\), and \(d_5\) represent the detail coefficients of the 3rd, 4th, and 5th orders, respectively. \(\sigma ^{2}\) denotes the variance, \(H_{q}(\cdot )\)is the Rayleigh entropy with \(q=2\), \(H(\cdot )\) is the Shannon entropy, and \(i\) represents the \(i\)th coefficient in the sequence of coefficients.The power spectral density is an important physical quantity that characterizes the power distribution of a signal in the frequency domain. It reflects the power carried by a signal at a unit frequency and is the energy representation of a signal in the frequency domain. It is very important for the analysis and recognition of signal characteristics, and its calculation needs to first perform a Fourier transformation on the signal and then take the square of the module. The calculation formula based on the characteristics of power spectral density35 is shown as follows:$$\begin{aligned} f_{44}:MPSD_{Centroid}=\frac{\int fP(f)^2df}{\int P(f)^2df} \end{aligned}$$
(13)
$$\begin{aligned} f_{45}:AUC_1=\int _{0.7}^{0.8}P(f)df \end{aligned}$$
(14)
$$\begin{aligned} f_{46}:AUC_2=\int _{0.9}^{1}P(f)df \end{aligned}$$
(15)
where \(P(f)\) represents the power spectral density, \(MPSD_{Centroid}\) is the modified power spectral density centroid, and \(AUC_1\) and \(AUC_2\) represent the areas under the AUC curve within two specified frequency intervals.In addition to the time-domain and frequency-domain features described above, this paper also introduces the characteristics of the statistical domain. The higher-order cumulant is the coefficient of the Taylor series expansion of the second characteristic function, which can provide more information than traditional second-order statistics. Moreover, it has strong robustness to Gaussian noise, so it can extract signal features effectively in the presence of noise interference and is especially useful for analyzing and processing non-Gaussian signals. The specific calculation process is as follows:The characteristic function of the random variable \(x\) can be written in the form of the Laplacian operator:$$\begin{aligned} \phi (s)=\int _{-\infty }^{\infty } f(x)e^{sx}dx=E\left\{ e^{sx} \right\} \end{aligned}$$
(16)
Taking the \(k\)th derivative of the characteristic function, we obtain$$\begin{aligned} \phi ^k (s)=E\left\{ x^ke^{sx} \right\} \end{aligned}$$
(17)
The value at the origin is equal to the \(k\)th moment of \(x\)$$\begin{aligned} \phi ^k (0)=E\left\{ x^k \right\} =m_k \end{aligned}$$
(18)
\(\phi (s)\) is often referred to as the moment-generating function of \(x\), also known as the first characteristic function.\(\psi (s)=ln\phi (s)\) is known as the cumulant-generating function of \(x\), also called the second characteristic function. The \(k\)th cumulant of a random variable \(x c_k\) is defined as the value of its \(x\)th derivative at the origin of its cumulant-generating function \(\psi (s)\).$$\begin{aligned} c_k=\frac{d_k\psi (s)}{ds^k} |_{s=0} \end{aligned}$$
(19)
Since \(\phi (s)=e^{\psi (s)}\), \(\phi ^{\prime } (s)= \psi ^{\prime }(s) e^{\psi (s)}\) and \(\phi ^{\prime \prime } (s)= \left\{ \psi ^{\prime \prime }(s)+\left[ \psi ^{\prime }(s) \right] ^{2} \right\} e^{\psi (s)}\). Setting \(s=0\), we obtain \(\phi ^{\prime }(0)= \psi ^{\prime }(0)=m_{1}\) and \(\phi ^{\prime \prime } (0)= \left\{ \psi ^{\prime \prime }(0)+\left[ \psi ^{\prime }(0) \right] ^{2} \right\} =m_{2}\).The first and second cumulants can be obtained36,37.$$\begin{aligned} f_{47}:c_1=m_1 \end{aligned}$$
(20)
$$\begin{aligned} f_{48}:c_2=m_2-m_{1}^{2} \end{aligned}$$
(21)
The third and fourth cumulants can be obtained in a similar manner36.$$\begin{aligned} f_{49}:c_3=m_3-3m_2m_1-2m_{1}^{2} \end{aligned}$$
(22)
$$\begin{aligned} f_{50}:c_4=m_4-3m_2^2+12m_3m_1-6m_{1}^{4} \end{aligned}$$
(23)
The absolute sum of first-order differences is a method used for feature extraction in the analysis of time-series data. It calculates the absolute values of the difference between two consecutive adjacent data items in the time-series data and accumulates these absolute values to obtain a total amount. The specific calculation is as follows:$$\begin{aligned} f_{51}:ASC_1= { \sum _{i=1}^{n-1}} \left| x_{i+1}-x_i \right| \end{aligned}$$
(24)
In the analysis of time-series data, the autocorrelation coefficient is a statistical index used to measure the correlation between data values of the same time series at different time points, which provides information about the correlation degree of time series in different lag periods and further reveals the inherent law of time series. This paper extracts the aggregated statistical characteristics of the autocorrelation coefficients of each lag period, and the specific calculation formula is as follows:$$\begin{aligned} R(l)=\frac{1}{n-l} { \sum _{i=1}^{n-l}}(x_i-\mu )(x_{i+l}-\mu ) \end{aligned}$$
(25)
where \(\mu\) is the mean value and takes values of 1, 2, 3, and 4 as \(f_{52}\), \(f_{53}\), \(f_{54}\), and \(f_{55}\), respectively.To display the distribution of 102-dimensional features more directly, we use t-distributed stochastic neighbor embedding (t-SNE) to visualize multidimensional feature vectors. First, all samples (including normal and abnormal samples) are input into the feature extraction module to obtain the multidimensional feature data of the samples. Then t-SNE is used to generate the feature distribution diagram of the low-dimensional space, as shown in Fig. 10, where black represents negative samples and red represents positive samples.Fig. 10Distribution of 102-dimensional features.The Fig. 10 shows that the 102-dimensional features we extracted have a very clear feature interface, and the point clusters of the two colors are relatively concentrated (the negative samples have different categories of heart sound abnormalities, so the black point clusters are divided into several parts). This shows that the features we extracted are highly correlated with the categories and have high discriminability in positive and negative samples, which is conducive to the correct division of positive and negative samples by the network. Other relevant experimental results are shown in Tables 3, 4 and 5In order to understand the importance of different features,we divided 102-dimensional features into five groups of Fbank features, LFCC features, MFCC features, autocorrelation features and other features. The parentheses after the feature set indicate the range of the feature. We used random forests to assess the importance of these five sets of features and drew an importance pie chart, as shown in the Fig. 11.In assessing the importance of 102-dimensional characteristics of species, we found that the 86th and 87th features (Fbank features), the 70th feature (LFCC features), the 1st feature (kurtosis), and the 58th feature (entropy of energy spectrum), etc., rank among the top in the importance assessment.This suggests that these features are important in the task of classifying heart sounds.Fig. 11importance assessment of 102-dimensional features.Table 12 Numerical statistics of some important features on the Yaseen.In order to analyze the interpretability of these important features, we selected the three most important features from the 102-dimensional features extracted from Yaseen dataset for numerical statistics, and the results were shown in Table 12. The Fbank feature simulates the law of sound received by the ear and pays more attention to the low frequency part of the heart sound signal, while the LFCC feature can better characterize the high frequency information of the signal through a linear filter. Since normal heart sound signals are low frequency signals, while heart murmurs often contain a large number of high frequency signals, the Fbank and LFCC features of normal samples are significantly smaller than those of abnormal samples. Kurtosis represents the steepness of signal peak and also reflects the speed of signal change. AS is a heart murmur caused by aortic stenosis, the aortic valve cannot fully open during systole, resulting in high-frequency noise when blood flows through the narrowed valve, which slows down the signal change between S1-systole-S2. Therefore, the kurtosis of AS is significantly smaller than that of other categories.The above important features are closely related to the pathology of heart sound signals, and the effectiveness of the extracted features can also be preliminarily demonstrated through importance assessment experiments.In the future, we will further study the pathologic interpretability of heart sound features.Color image expression based on multidimensional feature vector projectionWe propose a color image expression method (CACIEMDF) based on multidimensional feature vector projection. Our method first uses PCA reduction and convex hull algorithms to project the above 102-dimensional feature vector into a 2-D image coordinate system and finds the corresponding coordinate representation for each feature component. The feature values projected to the same pixel position are summed. Then, the color image representation of the heart sound signal is obtained by filling three channels of the color image with the value of the one-dimensional feature component and its divergences to the two categories. Finally, the Gaussian model is used to render the image to enrich the image content so that the image content is not only discrete points. The flowchart is shown in Fig. 12.Fig. 12Flowchart of the CACIEMDF.Stacking effect of close-range featuresThe construction of the projection space first represents the expanded sample feature set as a matrix with 102 (feature dimension) rows and 6956 (sample quantity) columns. PCA dimensionality reduction is used to transform the feature data into a matrix with 102 rows and 2 columns, representing the relative positions of 102 features. Then, the convex hull algorithm is used to find the smallest rectangle that can cover all feature points. By rotating it, a Cartesian coordinate system and the position coordinates of each feature in the coordinate system are obtained. Each feature can be mapped to a grayscale image through a pair of 2D coordinate representations.Fig. 13Schematic diagram of the different filling methods. (a) Sequential filling. (b) Reverse filling.(c) PCA combined with the convex hull algorithm.The two-dimensional data obtained by PCA dimensionality reduction are not affected by the feature input order. Therefore, assuming that the feature content is certain and the input order is changed, the dimensionality reduction data obtained are the same; that is, the obtained feature coordinate array representations are consistent. In this case, the coordinate system and position obtained by the convex hull algorithm are also fixed. Unlike the simple arrangement of input features into a two-dimensional array, it will be affected by the input order of the features. The diagram is shown in Fig. 13.Two or more feature positions may be the same when calculating the coordinate representation of the feature vector component. DeepInsight [18] took the average of multiple feature values to fill in this pixel. If the positions of some features in the projection space are close enough to be represented by the same pixel coordinates, we believe that these features can be stacked during image expression to achieve feature fusion. The results are shown in Fig. 14.Fig. 14Stacking effect of close-range features. (a) Fill results without the stacking effect. (b) Fill results with the stacking effect. (c) 3D grayscale distribution of (a). (d) 3D grayscale distribution of (b)In Fig. 14a,b, feature images generated by filling the corresponding pixels with feature values are shown. Most of the images have black background areas, which are positions without feature filling, and only a small number of pixels are filled with feature values, which are displayed as highlights of different gray levels on the feature image. The readability of feature images is limited by the number of features and grayscale, so we show the corresponding three-dimensional grayscale distribution in Fig. 14c,d to help show the grayscale distribution in feature images.A comparison of the gray distributions at three positions (1, 2, and 3) is shown in Fig. 14c,d. Adding the values of multiple feature components represented by the same coordinate using the stacking effect can make the coordinate positions representing multiple feature components more prominent, and their gray values are larger and visually more significant.Color representation of the projected imageThe images in Fig. 14 are gray images, whereas existing deep learning networks have high performance in color image classification. Our CACIEMDF projects the feature vector into a three-channel projection space to obtain the color image expression, which can exploit the recognition ability of deep networks. The value of the 1D feature component and its divergences to the two categories are used to fill the three channels. Our method calculates the mean and variance of each feature component of the positive and negative samples, respectively, and then calculates the degree of dispersion as follows.$$\begin{aligned} img_{ij}^r=f_n \end{aligned}$$
(26)
$$\begin{aligned} img_{ij}^g=\frac{f_n-\mu _{n}^{+}}{std_n^+} \end{aligned}$$
(27)
$$\begin{aligned} img_{ij}^b=\frac{f_n-\mu _{n}^{-}}{std_n^-} \end{aligned}$$
(28)
In \(img_{ij}^{s}\), \(s\) is taken as r,g,and b to represent three different channels, \(ij\) indicates that the image coordinate position of the th feature projection is \((i,j)\), \(f_n\) represents the feature value of the \(n\)th feature, \(\mu _{n}^{+}\) represents the mean of the feature values of the \(n\)th feature in all normal samples, and \(std_{n}^{+}\) represents the standard deviation of the feature values of the \(n\)th feature in all normal samples. Similarly, \(\mu _{n}^{-}\) represents the mean of the feature values of the \(n\)th feature in all abnormal samples, and \(std_{n}^{-}\) represents the standard deviation of the feature values of the \(n\)th feature in all abnormal samples. The color representation of Fig. 14b is shown in Fig. 15.Fig. 15Color representation of projected images.Comparing Figs. 15 and 14b shows that the color expression of the feature image can be realized by filling the three channels of the feature image differentiated by the category dispersion of the designed features. The feature filling pixels in the Fig. 15 above are shown in blue and red. Different samples have different feature dispersions, and the colors displayed by the feature points will also be different.Considering that RGB images have only three channels, the above filling method is not suitable for multi-classification tasks. In multi-classification tasks, our approach requires some changes. First, the mean and standard deviation of the eigenvalues of the samples of each class are calculated, and then reduced to two dimensions using PCA. The rest of the process is consistent with the filling method of the binary classification, as shown in Equations 29 to 31.$$\begin{aligned} img_{ij}^r=f_n \end{aligned}$$
(29)
$$\begin{aligned} img_{ij}^g=\frac{f_n-\mu _{n}^{g}}{std_n^g} \end{aligned}$$
(30)
$$\begin{aligned} img_{ij}^b=\frac{f_n-\mu _{n}^{b}}{std_n^b} \end{aligned}$$
(31)
The feature mean value of the five categories is represented as \(\{ \mu _n^k, k = 0, 1, 2, 3, 4, 5 \}\), and the standard deviation is represented as \(\{ \text {std}_n^k, k = 0, 1, 2, 3, 4, 5 \}\). PCA dimensionality reduction method is used for the mean value and standard deviation sets respectively to obtain \(\{ \mu _n^g, \mu _n^h \}\) and \(\{ \text {std}_n^g, \text {std}_n^h \}\). Through the above method, we can realize the color image representation method of any category.Continuous expression of the feature projection realized by using the Gaussian extension methodAlthough Fig. 15 shows the color expression, the image still contains discrete points, making it quite different from the natural image. Therefore, our method uses Gaussian kernel mapping to project the features into the image and superimpose them so that the distribution of features in the image will no longer be discrete points. The continuous expression of Fig. 15 is shown in Fig. 16.Fig. 16Continuous expression of feature projection.From the above figure, the use of Gaussian extension can make the content of the image more continuous.

Hot Topics

Related Articles