A blended framework for audio spoof detection with sequential models and bags of auditory bites

This research combines audio file features extracted from raw audio files, and time-frequency converted spectrograms that are interpolated to a lower dimension space using sequential models. This feature set of varied sizes is further regularised by clustering using the feature standardising procedure named Bags of Auditory Bites (BoAB). The transformed feature space is then mapped to predictions using a machine learning algorithm to classify between bonafide and spoofed speeches. The proposed architecture was then trained and tested on multiple public datasets to authenticate effectiveness of the model.DatasetWe chose the Logical Access (LA) and DeepFake (DF) dataparts of the well known ASVspoof2021 challenge datasets15. The ASVspoof2021 challenge is a community-led initiative to promote awareness of spoofing and to develop countermeasures. It involves three different tasks aimed at creating tools to detect genuine speech from spoofed speech, utilizing the biggest open-access dataset available for evaluating anti-spoofing countermeasures. The LA evaluation data consists of a collection of genuine and synthetic utterances that were transmitted across Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN), which can be subject to encoding and transmission artifacts. These artifacts may be present in the recordings and can affect the accuracy of a system to detect spoofing. The DF detection task is designed to identify artificial audio generated by combining natural and artificially created utterances from Text-to-Speech and Voice Conversion technologies. It is comparable to the LA task, although it does not include Speaker Verification during the generation process. The DF evaluation data is a mix of real and falsified audio recordings that have gone through a range of popularly used media storage formats. The ASVspoof 2019 LA assessment data and other sources have been used to create the 2021 DF evaluation data which includes spoofing attacks generated by spoofing algorithms. The database contains a total of 181567 recordings in the LA data part, and 611830 recordings in the DF detection data, including genuine recordings from 1,066 speakers. The logical and deepfake attacks include 13 synthetic speech or voice conversion attacks, 2 microphone types (far-field and close-talking), 3 recording environments (studio, home, and telephone), and 8 attack scenarios (3 speaker-dependent, 2 speaker-independent, and 3 multi-speaker). Additionally, there are 4 difficulty levels (Easy, Medium, Hard, and Extreme) and 4 types of data (development, evaluation, blind, and unseen).We balanced the dataset to ensure uniformity in training. Hence, we selected 18,452 audio waveforms each from the bonafide and spoofed categories of the LA data, and 5535 waveforms each from the bonafide and spoofed classes of the DF data. The total data was highly imbalanced, with only 11.31% belonging to the LA spoofed set and 3.75% belonging to the DF spoofed set.Audio spoof detection frameworkFigure 2Overall architecture of the proposed Automated Spoof Detection System.Figure 2 illustrates the overall framework of the proposed automated spoof detection system. We propose a novel technique for processing audio features of varied lengths m. The one-dimensional features \(m\in R^1\) and two-dimensional features \(m\times n \in R^2\) extracted from raw features and spectrograms are processed using the Bag of Auditory Bites (BoAB) algorithm. The BoAB model similar to the Bag of Words (BoW) in Natural Language Processing is created to balance the number of features for classification. The one-dimensional feature arrays of Zero Crossing Rate (ZCR), Spectral Centroids and Bandwidths of each audio sample are clustered into different auditory bins using the BoAB algorithm. The Mel-Frequency Cepstral Coefficient (MFCC), Constant-Q Cepstral Coefficient (CQCC) and Chromagram features are 2-dimensional. Hence, each time array is processed separately using a deep sequential learning Bi-LSTM model in a many-to-one scheme to predict a single value for the entire time-wise sequence38,39. The sequential model is trained to produce a one-dimensional array \(m\in R^1\) of values (representing the raw signal strengths) corresponding to each time point. These are then clustered into several bags of auditory bites, the count of each of which is then considered as a separate fixed set of \(K\times 1\) features for the entire audio sequence. The patterns in BoAB features of each characteristic are fused to \(6K\times 1\) features and then learned by an Extreme Learning Machine (ELM) model that predicts the confidence of genuineness.Feature extraction from speech signalsA strong description must be formulated to capture any distortion of cloning algorithm artefacts and the dynamic features of the vocal tract of a human speaker in real audio to build a powerful voice anti-spoofing system. The audio feature extraction needs to be done, which is a digital and analogue signal conversion process to eliminate any unwanted noise and balance out the time-frequency ranges. Spectrograms from the frequency domain extract features from the audio signals40. A spectrogram is a visual graph that displays how the frequencies of a signal change over time. It comprises two axes, one for time and one for frequency, with the intensity or colour of each point representing the amplitude of a frequency at a given time. It is generated by dividing the audio data into brief intervals and computing the Discrete Fourier Transform (DFT) for each segment, determining the magnitude of the frequency spectrum. These magnitude spectra, illustrating the amplitudes of various frequency components, are subsequently organized over time to construct the two-dimensional spectrogram. From this spectrogram, features relevant to identifying human vs. machine-generated tracts are extracted.$$\begin{aligned} ZCR_n&= \sum _m|sign(s_m) – sign(s_{m-1})|win(n-m) \end{aligned}$$
(1)
$$\begin{aligned} SC_n&= \frac{\sum _m freq_{mn} s_m}{\sum _m s_m} \end{aligned}$$
(2)
$$\begin{aligned} SB_n&= \sqrt{\sum _m s_m(freq_{mn} – SC_n)^2} \end{aligned}$$
(3)
Zero Crossing Rate (ZCR) measures the frequency of sign changes in the signal, producing one ZCR value per frame, resulting in a feature vector with the same length as the number of frames. In Eq. 1, \(ZCR_n\) measures the number of times a signal crosses the zero line during a given time frame n within window width \(\text {win}\). The signum function sign() produces \(-1\) if the signal strength \(s_m\) is less than 0, and \(+1\) if strength is more than 0. It is primarily employed for assessing human speech signals, effectively distinguishing speech from background noise, as speech usually has a higher ZCR than noise, and is a potent indicator to detect audio manipulations.Spectral Centroid (SC) calculates the weighted mean of frequencies in each frame, resulting in a feature vector with one SC value per frame. Calculated using Eq. 2, it identifies unique voice characteristics and audio spoofing by measuring the weighted mean of frequencies. Here, \(freq_{mn}\) is the magnitude of the Fourier transform at frame n and frequency bin m. Although there may be a spurious rise in the centroid at the start of the signal due to undefined silence in the audio sequence. It can impact the authentication process by leading to false positives or negatives, increasing feature variability, and complicating feature extraction. Techniques such as trimming the initial silence or applying threshold-based techniques to ignore low-energy segments may attenuate their effects.Spectral Bandwidth (SB) assesses the spread of frequencies around the centroid, producing one SB value per frame, creating a feature vector of similar length. The bandwidth of a signal is the range of frequencies in which it oscillates. It is determined by calculating the average distances between the upper and lower frequencies in a continuous range with respect to the spectral centroid. The values in the range are weighted and averaged by the signal strength with respect to the spectral centroid \(SC_n\) (Eq. 3). Analyzing spectral bandwidth helps identify discrepancies, modifications, or manipulations in the frequency spectrum, indicating audio spoof presence or background noise concealment.Figure 3Flow of extracting Chromagrams, MFCCs, and CQCCs.The Chromagram, MFCC, and CQCC are considered for extracting 2-dimensional features as shown in Fig. 3. Chroma-based features, also known as pitch class profiles, measure the frequency of sound, allowing them to be categorized by pitch and enabling users to interpret the tonal quality of the signal. Chromagram maps frequencies to 12 pitch classes, providing a 12-dimensional feature vector per frame, leading to a matrix with dimensions \(12\times (number\; of\; frames)\), which is particularly useful for machine-generated and manipulated audio. Their static pitch differences are used for high-level semantic analysis.Furthermore, the Mel-Frequency Cepstral Coefficients (MFCC) is the most commonly used feature coefficient in spoof detection. This approach involves comparing pitches on the Mel scale, designed to reflect how the human auditory system perceives sound. To calculate the MFCCs, FFT (Fast Fourier Transform) or DFT (Discrete Fourier Transform) is applied to the speech signal, which creates an audio spectrum. This spectrum is then filtered with triangular, gaussian, etc. filters to convert it to the Mel-scale. The spectrum is then taken through a logarithmic transformation and a Discrete Cosine Transform (DCT) to generate the MFCCs. MFCC typically results in 13 coefficients per frame, creating a matrix with dimensions \(13\times (number \;of \;frames)\).The Constant Q Transform (CQT) generates Constant-Q Cepstral Coefficients (CQCCs), which provide higher frequency resolution at lower frequencies and more temporal resolution at higher frequencies. The extraction of these features starts with the application of CQT, which transforms the time domain into the frequency domain, followed by taking the power of the spectrum and a logarithmic operation, uniform re-sampling to convert the geometrically spaced CQT bins to linearly spaced bins, and finally, the application of the DCT. They generally produce 13 coefficients per frame, forming a matrix with dimensions \(13\times (number\ of\ frames)\). These features are extensively used for LA and PA spoof detection in ASV systems and perform better than other extracted features.Significance learning from sequential modelLong Short-Term Memory (LSTM) neural architectures are sequential models for learning and forecasting time-series data of varied lengths38. The models capture long-term context-sensitive dependencies. Bidirectional LSTMs (Bi-LSTM) utilise both past and future context to predict the next value in a sequence accurately. This model can learn the sequence patterns and use this knowledge to make reliable predictions. We used a Bi-LSTM to take in the cepstral coefficients at each timestamp t and predict the signal strength at the next timestamp \(t+1\). Signal strengths refer to the amplitude of the sampled digitised audio sequence. The model is trained with input-output training pairs constituting the cepstral coefficients at each time point t throughout the dataset and their corresponding raw signal strengths. The sequential model is designed with a layer of embedding layer that extrapolates the input signal strength to a vector of length 128, further attaching the embedded vector through a Bidirectional LSTM of 64 neurons. The model is regularized with a dropout of 50%. The network was trained to predict single values for every set of coefficients over each time point with a sigmoid activation function, signifying the relevance of the coefficients. The significance of predictions is directly related to their accuracy, with more accurate predictions holding higher significance and less accurate predictions indicating lesser significance. To maintain the integrity of both classes, the network was trained exclusively on the bonafide class, enabling us to assess the reproducibility of both classes. Additionally, this process facilitated the reduction of dimensionality from two dimensions to a single array of values, which represents the reproduced signal. Through this sequential model, new values are generated for each amplitude of the original signal, with a closer resemblance indicating more significant cepstral coefficients and a greater disparity indicating less significant coefficients.Bag of Auditory Bites (BoAB)Machine learning algorithms require feature attributes to be of the same dimension. Due to the varying lengths of audio signals, the extracted features are also varied. The BoAB technique was designed to regularise the number of features to a defined set of values. It is wrapped around K-Means algorithm, which compresses and bounds dimensionality to a finite count. BoAB employs the train set to acquire a vocabulary of auditory words, which are then transformed into codewords. Algorithm 1 describes the two-stage procedure of BoAB to regularise the number of features.Algorithm 1Firstly, the dictionary is created using k-means clustering to group similar auditory bites into k distinct sets of clusters, with cluster centroids \(u_k\) representing the base codewords. Initially, k random datapoints are considered as the cluster centroids (codewords). We further compute the euclidean distances between each datapoint \(m_j\) and the cluster centroids \(u_k\), and assign the \(i^{th}\) bag to the \(j^{th}\) vector based on minimal distance. Next, new centroids are calculated as the mean sum of all datapoints assigned to the cluster \(w_k\). U and V represents the cluster centroids of adjacent iterations. The procedure is repeated until the cluster centroids U and V are equal. These k codewords in the dictionary are now the new feature set. The second step is to create feature values for each training waveform by making histograms of the k codewords. \(c_{ky}\) is the number of features of the \(y^{th}\) datapoint clustered into the \(k^{th}\) bag, while \(c_K\) is the number of features of all datapoints in the dataset clustered into the entire K bins. For optimal performance, the data points are further standardized to \(C|_K\) by subtracting the mean from each data point. Then, each data point is divided by the standard deviation, resulting in a mean of 0 and a standard deviation of 1. The regularity of the auditory bites in a timestamp denotes the feature values of the k-numbered feature set. The value of k is altered from 100 to 300 and tested with different machine learning algorithms.Learning models and evaluationExtreme learning machines (ELMs)41 represent a class of artificial neural networks known for their high performance and robustness in classification tasks. Unlike the traditional iterative training procedure, ELMs accelerate training time by updating network weights only once during the learning process. We trained and tested the BoAB features on Extreme Learning Machine (ELM) to detect whether the incoming features were genuine or spoofed. ELMs are powerful tools for classification due to their non-linear mapping abilities and quick training times with strong generalization capabilities. They constitute a single hidden layer that captures the input data patterns simultaneously. ELMs are universal approximators with the ability to accurately mimic any continuous function, especially when they have a large number of hidden neurons in a single-layer hidden structure. This study designed ELM with 1000 hidden neurons to maintain parity and sufficiency with the number of input features. The two output neurons were coupled to the softmax activation function, which forecasts probability scores indicating the degree of confidence with which the input stream belongs to either class. Additionally, various settings for k Nearest Neighbour (kNN), Support Vector Machine (SVM), and Random Forest (RF) were tested for comparison.Figure 4Waveforms and the several descriptors extracted from the bonafide and spoofed audio of speaker LA_0046.All models were evaluated in terms of accuracy, precision, and recall computed from the confusion matrices. Each row of the confusion matrix corresponds to the number of an actual class, and the columns correspond to the number of predicted classes. The accuracy of a model is the frequency with which it rightly categorises a sample to its representative class42. Precision is the proportion of accurately recognised positive outcomes compared to all projected positive outcomes. In contrast, recall relates to the ratio of accurately identified good outcomes to all real positive outcomes. The accuracy of a biometric system is measured using Equal Error Rate (EER). It is the rate at which incorrect results (both False Positives and False Negatives) occur equally and is determined by plotting the False Negative Rate (FNR) and False Positive Rate (FPR) on the same graph. The point of error at which the two curves intersect is the EER, expressed as a percentage. A lower percentage denotes a more precise system.

Hot Topics

Related Articles