Enhancing gait recognition by multimodal fusion of mobilenetv1 and xception features via PCA for OaA-SVM classification

The proposed method employs a deep learning-based approach to identify humans. Figure 1 depicts the flow of the approach. The approach comprises extracting frames from videos, preprocessing the extracted frames, using transfer learning for feature mining, concatenating the extracted features, reducing the extracted features, and then classifying the frames. In the following section, these steps are discussed in detail.Figure 1Architecture of the proposed methodology.DatasetIn January 2005, CASIA-B16, a large multiview gait database, was developed. It covers the gait data of 124 people from 11 different perspectives. The dataset is well known in the area of gait research and takes into consideration the angle from which a person walks, the clothes that a person wears, and the object a person is carrying. In addition to visual data, it displays individual silhouettes extracted from video formats. The data contain 3 variations of the person walking: walking with a bag, walking in a coat, and walking normally. The dataset contains this variation from eleven distinct viewing angles from 0 to 180.In our research, frames whose original size was 240 × 320 were extracted from the videos. These frames were resized to 224 × 224. As shown in Fig. 2, the dataset was split 50–50 for training and testing, employing a holdout validation technique. This ratio of splitting was chosen due to the structure of the dataset. Anglewise analysis was performed for all 11 angles in the dataset. As shown in Fig. 3, colorful frames were used for research, and empty frames were discarded during extraction. Table 2 provides details of the modified dataset used from the CASIA-B dataset16.Figure 2Directory structure of the dataset after splitting.Figure 3Sample frames extracted from the CASIA-B database.Table 2 Subset of the CASIA-B dataset used in this research.NormalizationThe normalization of pixel values is the initial stage in our process. Normalization is a technique that is commonly used in data preprocessing. Normalization is the process of rescaling a variable’s values to a common scale so that they may be compared directly. Pixel normalization is an important preprocessing step in image processing that guarantees that pixel values are within a given range. Pixel normalization ensures that the model receives inputs with a consistent scale and distribution. In an image, pixel standards can range from 0 to 255, where 0 means total darkness and 255 means total brightness. We normalize these pixel values to be between 0 and 1. This helps avoid numerical overflow and underflow and makes data processing easier. Additionally, normalizing the image improves the contrast between pixels, making details that were hard to see in the original image more visible. Normalization can be represented mathematically as follows:$$\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}={x}_{normalized}$$
(1)
If \(x\) is the original pixel value, \({x}_{min}\) is the picture’s lowest pixel amount, and \({x}_{max}\) is the picture’s maximum pixel value. The resulting \({x}_{normalized}\) value is in the \(\left[\text{0,1}\right]\) range.The rescale function is commonly employed in image processing libraries such as OpenCV, Scikit-Image, or Keras for the purpose of achieving normalization. The ImageDataGenerator class from Keras is utilized in this study to standardize the pixel values. The rescale parameter of the ImageDataGenerator class is assigned a value of 1.0/255.0, which scales down the pixel values to a range between 0 and 1. Normalization is an essential preprocessing step since it enhances the model’s ability to efficiently process input data. It improves the precision of the model and reduces numerical inaccuracies caused by fluctuations in pixel values.Feature extractionThe next step is to use pretrained convolutional neural network (CNN) models to obtain attributes from these images. CNNs are good at automatically understanding and obtaining features from raw pixel data, which makes them useful for feature extraction tasks in image processing. Traditional neural networks require considerable computational power and memory to train because every neuron in one layer is linked to every neuron in the earlier layer. CNNs, on the other hand, use convolutional layers, which reduce the number of parameters and calculations needed for training by merely linking a subset of neurons from the earlier layer to a tiny receptive field in the current layer. In the next subsection, CNNs are discussed along with various layers of CNNs.Convolutional neural networkConvNet is a type of neural network that is remarkably suitable for picture processing tasks. A typical CNN will have the following layers: an input node, many levels of convolutional processing, another type of layer known as a pooling layer, and finally, a fully connected layer.Convolutional layerSliding a kernel (also called a filter or a pattern detector) across the input picture, multiplying the kernel by the given input pixels, and then summing the results is the convolution process. For a convolutional layer, the resultant feature map \(O\) is acquired by directing a convolution operation \(\star \) between the entered image \(I\) and the filter \(F\), followed by an optional bias term \(b\) and an activation function \(\sigma \):$$O=\sigma \left(I\star F+b\right)$$
(2)
The convolution operation \(\star \) is defined as calculating the elementwise product and total at each place while moving the filter over the given image:$$I\star F={\sum }_{i=1}^{{F}_{h}}{\sum }_{j=1}^{{F}_{w}}{\sum }_{k=1}^{C}{I}_{m+i-1,n+j-1,k}{F}_{i,j,k}$$
(3)
where \({F}_{h}\) and \({F}_{w}\) are the height and width of the filter, respectively, and \(C\) is the number of channels in the input picture. The resultant feature map has a height and width given by:$${O}_{h}=\frac{{I}_{h}+2{P}_{h}-{F}_{h}}{{S}_{h}}+1$$
(4)
$${O}_{w}=\frac{{I}_{w}+2{P}_{w}-{F}_{w}}{{S}_{w}}+1$$
(5)
where \({I}_{h}\) and \({I}_{w}\) are the height and width of the input image, respectively; \({P}_{h}\) and \({P}_{w}\) are the padding values along each dimension; and \({S}_{h}\) and \({S}_{w}\) are the stride values along each dimension.Pooling layerThe pooling operation is used to shrink the proportions of the attribute maps and to introduce some degree of translation invariance to the output.For a pooling layer, the resultant attribute map \(O\) is acquired by applying a pooling function \(\mathcal{P}\) over nonoverlapping regions of size \({F}_{h}\times {F}_{w}\) in the input feature map \(I\):$${O}_{m,n}=\mathcal{P}\left({I}_{m{S}_{h}:m{S}_{h}+{F}_{h},n{S}_{w}:n{S}_{w}+{F}_{w}}\right)$$
(6)
where \(\mathcal{P}\) can be either max pooling or average pooling:$$\mathcal{P}{\text{mx}}\left(X\right)=\text{max}\left(X\right)$$
(7)
$$\mathcal{P}{\text{ag}}\left(X\right)=\frac{1}{\left|X\right|}\sum \left(X\right)$$
(8)
The height and width of the output feature map are given by:$${O}_{h}=\frac{{I}_{h}-{F}_{h}}{{S}_{h}}+1$$
(9)
$${O}_{w}=\frac{{I}_{w}-{F}_{w}}{{S}_{w}}+1$$
(10)
where \({I}_{h}\) and \({I}_{w}\) are the dimensions of the input feature map, and \({S}_{h}\) and \({S}_{w}\) are the stride values along each dimension.Batch normalization layerThe inputs to a neural network layer may be normalized through a method called batch normalization (BN). The normalization of inputs to a layer is achieved by removing the layer’s average and dividing by its standard deviation. The output of this layer is calculated as follows:$${F}_{k}=\frac{{F}_{k}-{\mu }_{k}}{\sqrt{{\sigma }_{k}^{2}+\epsilon }}\cdot {\gamma }_{k}+{\beta }_{k}$$
(11)
where \({F}_{k}\) is the activation of feature map \(k\), \({\upmu }_{k}\) and \({\upsigma }_{k}\) are the mean and standard deviation of \({F}_{k}\) over the batch, \(\upepsilon \) is a small constant to prevent division by zero, and \({\upgamma }_{k}\) and \({\upbeta }_{k}\) are learned scaling and shifting parameters.Fully connected (FC) layerThis layer is a type of layer in ConvNets where each neuron in the current layer is bonded to every other neuron in the next layer. For a fully connected layer, the output vector \(O\) is obtained by applying a linear transformation to the input vector \(I\), followed by an optional bias term \(b\) and an activation function \(\sigma \):$$O=\sigma \left(WI+b\right)$$
(12)
where \(W\) is a weight matrix that has a shape of \(\left({O}_{d},{I}_{d}\right)\), \({O}_{d}\) is the output dimension and \({I}_{d}\) is the input dimension.Pretrained modelsIn machine learning and deep learning, the concept of transfer learning is utilized to acquire the skills while solving one issue and applying them to another similar problem. The idea behind transfer learning is to use a model trained on a large, complex dataset as a base for a similar problem with a smaller dataset. By using these pretrained models, we can save the time and computational resources needed to train a model from the beginning. Instead, we can use the features already learned by the model and use it to extract features for our specific task. This approach is especially helpful when we have limited data or limited computational resources.MobileNetV128 and Xception29 are two pretrained CNN models that are competent on the ImageNet dataset. MobileNetV1 is a light CNN model devised for mobile devices with fewer parameters than other CNN models. Xception, on the other hand, is a more complex CNN model that has shown good performance in different picture recognition tasks.In this research, pretrained MobileNet and Xception models are used. The fully connected layers are excluded from the model. This allows features to be extracted from the intermediate layers of the CNN models instead of using them for classification tasks. In Table 3, the architecture of MobileNetV1 used in the research is given.Table 3 Architecture/structure of the MobileNetV1 model.The Xception model is then expanded to include a global average pooling layer. In a pooling procedure, global average pooling calculates the average of all feature maps in the last convolutional layer. This process compresses the spatial dimensions of the feature maps and yields a vector that encapsulates the key features of the primary picture. As one of the features of the Xception model, the output of the global average pooling layer is taken. In Table 4, the architecture of the Xception model is given, including the different layers, their output sizes and the number of filters.Table 4 Architecture of the Xception model.New models for MobileNet and Xception are created by stating the input and output layers. All the layers of the MobileNet and Xception models are set to be nontrainable by fixing the “trainable” parameter to “False”. This guarantees that the pretrained weights of the CNN models are not modified during feature extraction.In our feature extraction process, the global pooling layers of both the MobileNet and Xception models consistently produced outputs with shapes of (1024) and (2048), respectively, across various viewing angles. This uniformity is intentional and aligns with the chosen architectures, reflecting the condensed representation of features by the pooling layers, ensuring consistent input dimensions for subsequent layers in the gait recognition model. The consistent output shapes contribute to the stability and generalization capability of the overall network.The feature vectors are then obtained from the input images via the MobileNet and Xception models. The “predict” method of the Keras library is employed to obtain the characteristics from the middle layers of the CNN models. The output of the “predict” method is a matrix of feature vectors that represents the important features of the input images. Table 5 summarizes the sizes of the features extracted from MobileNetV1 and Xception across various viewing angles, while Fig. 4 shows the numbers of samples across various viewing angles, which are the same for both the MobileNet and Xception models. Figure 5 shows the time taken by both models for feature extraction for their training and testing datasets across various viewing angles.Table 5 Features extracted from MobileNetV1 and Xception for various viewing angles.Figure 4Number of Samples for MobileNetV1 and Xception.Figure 5Time Taking for Feature Extraction at Different Angles.Finally, the feature matrices are reshaped into a 2D array where every row represents a single image and every column represents a feature. This creates a feature matrix that can be used as input to a machine learning model for training and testing.Feature fusionThe next step after extracting features from the two pretrained models is to fuse these feature vectors. A method for combining the characteristics retrieved by various neural networks or layers is called feature fusion. Using the distinct features that are recorded by each network or layer, feature fusion seeks to improve the model’s output. In our study, we use concatenation to implement feature fusion to enhance the accuracy of the picture classification model.Combining two or more tensors along a designated axis is known as concatenation. All of the information from the input tensors is contained in the greater dimensionality resultant tensor. If the tensor shapes are consistent along the designated axis, concatenation can be applied to tensors of various shapes.Let us assume that the feature set generated by MobileNet has dimensionality \(M\) and that the feature set generated by Xception has dimensionality \(N\). When two feature sets are concatenated, a new feature set with dimensionality \(M+N\) is created.The concatenation operation is mathematically represented as follows:Let X be the feature set generated by MobileNet with dimensionality M, and let Y be the feature set generated by Xception with dimensionality N. X and Y can be concatenated along the second axis (axis 1) to obtain a new feature set \(Z\) with dimensionality \(M+N\):$$Z=\left[X,Y\right]$$
(13)
where \(\left[X,Y\right]\) represents the concatenation of \(X\) and \(Y\) along axis 1.The MobileNet and Xception models can extract features from the input images because they have already been trained on a sizable image dataset. The features that these models have extracted are complementary to one another because they were trained on different parts of the image data. Through feature concatenation, the performance of each model’s unique features may be utilized to enhance the picture classification model. The feature tensors extracted from the MobileNet and Xception models have distinct dimensions. Thus, it is necessary to ensure that the features have the same dimensionality before concatenating them. Furthermore, our method involves reshaping the feature tensors from both models into a 1D vector using the reshape function. The size of the postfusion feature vector is 3072. This dimensionality reflects the combined information extracted through the fusion process, contributing to the comprehensive representation of gait patterns in our model.Feature selectionThe next step after fusing the extracted features is to reduce the dimensionality of the feature vector, and feature selection via principal component analysis (PCA)30 is a common practice. The entirety of the data’s natural variability as much as possible is preserved by PCA’s dimensionality reduction. Given a dataset \(X\) of \(n\) samples, each having \(p\) features, the principal components are linear combinations of the initial features, which are generated through principal component analysis (PCA) to produce a new set of p orthogonal features known as principal components. The variation in the first principal component is the most significant, followed by the variation in the second principal component, which is the second most significant, and so forth.Figure 6 displays the cumulative explained variance of the principal components for our feature vector. The x-axis represents the number of major components, while the y-axis represents the cumulative explained variance up to that specific number of components. This graph is valuable for selecting the optimal amount of main components to retain for subsequent investigation.Figure 6Cumulative explained variance for different numbers of components.PCA computes the covariance matrix \(C\):$$C=\frac{1}{n}{\left(X-\overline{X }\right)}^{T}\left(X-\overline{X }\right)$$
(14)
where \(\overline{X }\) is the mean of dataset \(X\) and \(n\) is the number of samples in \(X\).The next step in principal component analysis (PCA) involves computing the eigenvectors and eigenvalues of the covariance matrix \(C\). The eigenvectors illustrate the directions that contain the greatest amount of variation in the dataset, and the eigenvalues that correlate to those eigenvectors illustrate the amount of variance that may be found along those routes.The eigenvectors and eigenvalues can be computed using the following equation:where \(v\) is the eigenvector and \(\uplambda \) is the eigenvalue.Finally, PCA selects the top k eigenvectors (those with the highest eigenvalues) and projects the dataset onto these eigenvectors to obtain a lower-dimensional representation:$${X}_{pca}=X{V}_{k}$$
(16)
where \({V}_{k}\) contains the top k eigenvectors and \({X}_{pca}\) is the projected dataset.The feature vector was initially 3072 in size, and after applying PCA, we strategically reduced it to 620 features. This reduction aligns with a cumulative explained variance (CEV) of 0.9, indicating that the retained features capture 90% of the variance in the original data. The decision aims to strike a balance between dimensionality reduction for computational efficiency and retaining sufficient information to ensure robust gait pattern recognition in our model. The application of PCA was primarily aimed at reducing the computational complexity associated with a high-dimensional feature space and potentially capturing the most informative components. While PCA does not have an inherent regularization effect like some other techniques, the dimensionality reduction aspect indirectly contributes to mitigating overfitting by focusing on the most salient features.ClassifiersOur proposed methodology was tested on two different machine learning classifiers: OaA-SVM and random forest. A thorough description of these classifiers is given below.OaA-SVMOaA-SVM, which stands for one against all support vector machine, is a well-known multiclass classification technique that expands the capabilities of the conventional support vector machine (SVM) algorithm to include the management of more than two distinct groups31. OaA-SVM teaches numerous binary classifiers, each of which is intended to differentiate between one class and the other classes. These binary classifiers are all trained together. During the testing phase, the result of each binary classifier is compared, and the category that had the maximum score is chosen to be the one that will be predicted. In Fig. 7, the architecture of OaA-SVM used in our study is given.Figure 7OaA-SVM may be trained through a variety of kernels, which are used to transform the data into a higher-dimensional space in which it can be divided more easily. Some of the most often used kernels are listed below:1. Linear kernel: This kernel is the most basic kernel used in SVM; it transforms the data linearly. When the data can be separated linearly, it is appropriate.$$K\left({x}_{i},{x}_{j}\right)={x}_{i}^{T}{x}_{j}$$
(17)
2. Polynomial kernel: By transforming the original features of the data using a polynomial function, this kernel performs a nonlinear transformation. This approach is useful when the data have nonlinear boundaries.$$K\left({x}_{i},{x}_{j}\right)={\left(1+{x}_{i}^{T}{x}_{j}\right)}^{d}$$
(18)
Here, \(d\) denotes the degree of the polynomial.3. Radial basis function (RBF): This kernel transforms the information into a higher-dimensional place by applying a Gaussian function to the Euclidean distance between the data points. It is useful when the data are not linearly separable and have complex boundaries.$$K\left({x}_{i},{x}_{j}\right)={e}^{-\upgamma {\left|{x}_{i}-{x}_{j}\right|}^{2}}$$
(19)
Here, \(\upgamma \) is a hyperparameter that controls the width of the Gaussian kernel.The One against All Support Vector Machine (OaA-SVM) was used in our investigation with various kernels. A value of 2 was selected for ‘C’ in the linear kernel. The ‘C’ value for the RBF kernel was set to 10, while that for the poly kernel was set to 5. These decisions were made with the goal of optimizing the model’s gait recognition performance, taking into account variables such as batch size (32), image size (224*224), and dimensionality (620) following PCA. To balance model complexity and accuracy, we adjusted the parameters. Section “Results and discussion” of this document provides a detailed analysis and results of the experiments conducted with this classifier.Random ForestThe random forest technique of machine learning is extensively used, and it may be used for classification and regression work19. RF is a method that utilizes several decision trees to make accurate predictions. Each tree is trained with different samples of training data and different subsets of input attributes. This helps the system predict new data accurately. When making the final prediction, the algorithm combines all the estimations from each tree. This method reduces the chance of overfitting and improves the model’s accuracy. RF is excellent because it can supervise high-dimensional data and nonlinear relationships between features. It also deals well with outliers and missing data, making it a versatile tool for various situations.To improve RF performance, one can adjust several hyperparameters, such as the number of trees, the greatest depth of each tree, and the number of attributes considered at each split. Increasing the maximum depth and the number of variables at each split are other ways to boost its effectiveness. Due to its speed and scalability, the random forest algorithm is highly recommended for processing large datasets.

Hot Topics

Related Articles