PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features | BMC Bioinformatics

Hyper-parameters analysisThe objective of this section is to determine the optimal configuration values for the hyper-parameters employed in the topology of the Deep Neural Network (DNN). The critical hyper-parameters encompass the number of layers and neurons, seed, regularization techniques (L1 & L2), activation function, weight initialization, momentum, dropout, updater, iterations, learning rate, and optimizer, as indicated in Table 1. These parameters significantly influence the performance and behavior of the neural network. For instance, the number of layers and neurons per layer directly impacts the network’s learning capacity and its ability to fit the training data. The seed is a predefined starting point for initializing or controlling random processes, ensuring reproducibility in computations or experiments. Regularization techniques like L1 and L2 regularization contribute to preventing over-fitting by introducing penalty terms to the loss function. Activation functions introduce non-linearity into each neuron, while the initialization of weights sets the initial values for the parameters (weights and biases) of the network’s neurons before training.Table 1 DNN model optimum hyper-parameters valuesAdditionally, momentum enhances the optimization process by incorporating past gradient information to expedite convergence and improve stability during training. Dropout, a regularization technique, randomly drops out a fraction of neurons during each training iteration. The “updater” is responsible for adjusting model parameters, “iteration” represents a single cycle through the training data, “learning rate” controls the size of weight updates, and the “optimizer” serves as the overarching algorithm guiding the optimization process by determining how weights are updated in each iteration. To assess the DNN’s performance under various hyper-parameters, a grid search technique was employed, exploring different combinations of parameters. Specifically, the analysis focused on the hyper-parameters that significantly influence the performance of the DNN model, including the activation function, learning rate, and number of iterations.We conducted experiments to examine the effects of activation functions and LR. The result of the experiments is given in Table 2. From the table, it can be observed that the highest accuracy i.e. 98.71% is obtained by the DNN classifier at a learning rate value of 0.1 using Tanh as an activation function. Furthermore, the DNN model is continuously improved by decreasing the learning rate, however, after reducing the LR (i.e. 0.09 and 0.08), the DNN model accuracy could not significantly be improved. Hence, the DNN model presented a high accuracy at a learning rate 0.1 with the Tanh activation function.Table 2 Performance comparison of DNN model with different grid search of DNN modelSecondly, we also carried out many experiments to test the performance of DNN with different iteration counts for the model training. Tanh, ReLU, and Sigmoid activation functions results show that after 500 epoch, error losses reach stabilization. In our study, we used two activation functions Tanh, which is used at the hidden layers and Softmax, which is used at the output layer, for predicting the input instance in the Sumoylation or non- Sumoylation site class. The specific optimal parameters for the DNN used in this study are shown in Table 1.Performance analysis of cross-validation schemeIn computational biology and bioinformatics, statistical learning models experience difficult testing through validation methods like jackknife, k-fold, and sub-sampling. Among these, k-fold cross-validation is particularly prevalent due to its unbiased results. Its systematic approach partitions data, ensuring thorough evaluation and enhancing the reliability of statistical learning models in diverse biological applications. This study analyzed the proposed method’s performance using fivefold and tenfold CV tests. Results in Table 3 indicate that the PSSM-Sumo model achieved higher accuracy (95.91%) with a tenfold CV compared to a fivefold CV (95.94%).Table 3 Evaluating the performance of the PSSM-Sumo model via both the feature set and an optimized subset of featuresThe feature vector obtained through the PsePSSM method contained 1,090 features, which may include inappropriate, redundant, and noisy features. To obtain efficient features and reduce the dimensionality, we employed the feature selection method discussed in “Feature selection” section. We reduced the feature vector dimension from 1090 × 1560 to 120 × 1560. The evaluation of the proposed model includes the assessment of its performance using both comprehensive and optimized feature sets, ensuring a thorough analysis of its capabilities. The experimental results of this evaluation are shown in Table 3.Table 3 shows that the proposed model’s performance is superior when using an optimized feature set compared to the entire feature set. For instance, using tenfold cross-validation, the proposed model achieved an accuracy of 95.94% with the entire feature set, while it achieved an average accuracy of 98.71% with the optimized feature set. Similar improvements are reported for other performance metrics: specificity (99.68%), sensitivity (97.72%), F1 score (0.974), and MCC (0.974) using the optimized feature set compared to the entire feature set. Given the significance of the optimized feature vector and its prediction results via tenfold testing, we select the optimized vector and DNN classification as our training model.Moreover, the performance of the PSSM-Sumo model was assessed using the Area Under the Curve (AUC) metric, a measure of the accuracy of binary classifiers, where higher values correspond to improved performance. As depicted in Fig. 4, the PSSM-Sumo model demonstrated the highest AUC values of 0.996 with tenfold cross-validation and 0.992 with fivefold cross-validation, leveraging the efficient feature set. These outcomes validate the superior predictive capabilities of the proposed model, particularly when utilizing the tenfold cross-validation approach and the selected features. Additionally, a confusion matrix is presented in Fig. 5 to further explore the behavior of the proposed DNN in prediction using the optimized features vector on tenfold.Fig. 4Comparison of AUC using different cross-validation schemesFig. 5DNN model Confusion matrix using optimized featuresPerformance comparison of different classifiersIn this section, we provide an analysis of the DNN model in comparison to well-known machine learning algorithms such as K-Nearest Neighbor (KNN) [36], Random Forest (RF) [34], and Support Vector Machine (SVM) [25, 42]. The KNN algorithm, often used in image processing, is an instance-based learning technique that identifies instances based on distances [34, 43]. RF is a popular supervised learning method for regression and classification tasks, creating a large number of decision tree models based on random samples using the bootstrap algorithm. This comparative study highlights the specific strengths and weaknesses of each approach. Additionally, the SVM algorithm, known for its effectiveness in bioinformatics, determines an optimal hyper-plane to differentiate groups both linearly and nonlinearly. To ensure a fair assessment of all the learning algorithms, we used a similar effective feature set, standard measurements, and validation methods.Table 4 presents the performance evaluations of different algorithms. The DNN model performed significantly better than the other models. For instance, the DNN model achieved an average accuracy of 98.71%, while the SVM achieved only 95.32%. Similarly, regarding the MCC criterion for model stability, the DNN achieved a top rate of 0.974, standing high above the SVM’s highest value of 0.911. According to all performance measures, the KNN model performed very poorly.Table 4 A comparison of the proposed model with machine learning models has been consideredThe analysis findings suggest that our proposed model outperforms traditional learning algorithms. Due to the high similarity between sumoylation and non-sumoylation sites, traditional machine learning algorithms struggle to classify them accurately. These traditional methods often rely on a single processing layer, which may be insufficient for handling nonlinear datasets, significantly affecting their performance. In contrast, our DNN model utilizes multiple hidden layers to perform layer-by-layer sampling on the input data. This layered approach enhances its ability to distinguish between similar sites, leading to superior performance compared to traditional methods. Additionally, Fig. 6 demonstrates a comparison between the performance of the DNN model and traditional learning algorithms using AUC. The figure shows that the proposed DNN model outperforms all other models in terms of obtaining the highest AUC value. For example, the DNN model scored an AUC value of 0.996, while SVM, RF, and KNN algorithms recorded AUC values of 0.981, 0.978, and 0.948, respectively.Fig. 6AUC performance comparison of ML algorithmsExisting models performance comparisonIn this section, we compare our proposed model with the existing benchmark methods i.e. [23,24,25]. The mentioned latest methods build prediction models based on machine learning algorithms. The performance of our proposed model and the existing benchmark models are evaluated on benchmark datasets by using tenfold cross-validation. For facilitating comparison, Table 5 shows the corresponding results obtained by the existing state of the art methods. It can be observed from Table 5 that our proposed PSSM-Sumo model performs overwhelmingly better than the existing model. For instance, the proposed model yielded the highest accuracy of 98.71%, and the current predictor (Deep-Sumo) got the second-highest success rate, which is equal to 96.47%. Likewise, the PSSM-Sumo had a 0.974 MCC, which was the highest score achieved and far more significant than Deep-Sumo’s result of 0.929. These outcomes emphasize the superior performance of PSSM-Sumo compared to the existing models, with an average success rate increasing up by 10.46%.Table 5 Comparison performance of the existing modelsPerformance analysis of classification learners using an independent datasetIn most cases, the generalization capability of a prediction model is examined using unseen data. Therefore, to test our proposed model we used an independent dataset (i.e. 80% Training and 20% Testing dataset). The performance outcome of the independent dataset is given in Table 6. Among the traditional algorithms, SVM achieved an improved accuracy of 92.53% with sensitivity, specificity, and MCC of 93.23%, 91.82%, and 0.861. On the other hand, DNN obtained higher prediction outcomes with an accuracy of 94.45%, a sensitivity of 96.87%, a specificity of 92.03%, and an MCC of 0.912.Table 6 A comparison of the proposed model with machine learning models on independent dataset

Hot Topics

Related Articles