iDLB-Pred: identification of disordered lipid binding residues in protein sequences using convolutional neural network

The genomic data is converted into fixed-size vectors by using the statistical moments. Every moment provides unique information about the nature of data. Researchers have probed the moments for a variety of distributions to see if they are fit for this procedure. The feature set will include raw, central, and Hahn moments of the genomic data, thus establishing essential input vector’s elements for model. It was noted that the genomic and proteomic sequences share base positions that their features depend upon. As a result, there have been developed several computational and mathematical models for the study of nucleotide base’s correlated positioning in genomic sequences for feature vector. It’s essential to proceed with a reliable and consistent set of feature24. Genomic sequences are converted into a two-dimensional matrix S’ of size k*k having similar information to S but in a two-dimensional vector form since two-dimensional data is required by Hahn moments25.$${\text{k}} = \sqrt n$$
(7)
$$S^{\prime } = \left| {S_{{11}} S_{{12}} \ldots S_{{1n}} S_{{21 \ldots }} S_{{22 \ldots }} \ldots S_{{2n \ldots }} S_{{n1}} S_{{n2}} \ldots S_{{nn}}} \right|$$
(8)
Fixed-size feature vector is created from the square matrix obtained and statistical moments, which helps in reducing its dimensionality26.For this study, raw moments, Hahn moments, and central moments are employed. Below the expression indicates computation of raw moments of order a + b:.$$\:{U}_{ab}={{\varSigma\:}^{n}}_{e=1}{{\varSigma\:}^{n}}_{f=1}\:{e}^{a}{f}^{b}\delta\:ef$$
(9)
Up to order 3 of Moments, significant information is embedded in the following sequences, which are U00, U10, U11, U20, U02, U21, U12, U03 and U30. Calculating the central moments also requires that the centroid (x, y) is calculated23. The central point of the data is centroid. This will be used to calculate the central moments:$$v_{{ab}} = \Sigma ^{n} _{{e = 1}} \Sigma ^{n} _{{f = 1)~}} \left( {e – \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{x} } \right)^{a} \left( {f – \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{y} } \right)^{b} \delta ef$$
(10)
The Hahn moments were computed using a discretized input square grid; it offers data regularity and reversibility, since using inverse Hahn moments the original data can be reconstructed.Because the Hahn moments are reversible, any information, which is transformed from original sequences, keeps and utilized to reconstruct the feature vector of model. Following is the Hahn moment’s calculation.$$\:{h}_{n}^{x,y}\left(p,Q\right)=(Q+{V-1)}_{n}(Q{-1)}_{n}\times\:{{\varSigma\:}^{n}}_{z=0}{\left(-1\right)}^{z}\frac{\left({-n)}_{z}\right({-p)}_{z}({2Q+x+y-n-1)}_{z}}{\left({Q+y-1)}_{z}\right(Q{-1)}_{z}}\frac{1}{z!}$$
(11)
The Pochhammer notation is defined by the following Eq. (11) and Gamma operator, which is well described by Akmal et al.27. Applying above equation, usually with the help of coefficient, defined as follows, the coefficient for Hahn moments are normalized.$$\:{H}_{pq}=\:{{\varSigma\:}^{G-1}}_{j=0}{{\varSigma\:}^{G-1}}_{i=0}\:{\delta\:}_{pq}{{h}^{a,b}}_{p}\left(j,Q\right)\:{{h}^{a,b}}_{q}\:\left(i,Q\right),\:\:\:\:\:\:m,n=\text{0,1},2,\:\dots\:,\:Q-1$$
(12)
Convolutional neural networkA Convolutional Neural Network (CNN) is a deep learning technique that is used to process and analyze data that has a grid-like structure, such as photos, videos, and time series data. CNNs are very effective in computer vision approaches such as image categorization, entity classification, and picture sectionalization. CNNs distinguish themselves by their capacity to automatically learn and derive hierarchical representations from incoming data. Convolutional layers, which execute convolutions between input data and a collection of learnable filters or kernels, are used to do this. Local patterns and characteristics, such as edges or textures, are captured by these filters at various spatial positions within the input. CNNs may train to recognize many features at the same time by employing multiple filters. Pooling layers, which down-sample the feature maps created by the convolutional layers, are also included in CNNs. Pooling aids in reducing the spatial dimensions of the data while retaining the most important properties. This results in a more compact representation, allowing the network to concentrate on the most significant features of the material as shown in Fig. 2. The retrieved characteristics are then sent into fully connected layers, which are standard neural network layers31,32,33.Fig. 2Workflow of Convolutional Neural Network.Deep neural network (DNN)A Deep Neural Network (DNN) is a form of artificial neural network made up of numerous layers of linked nodes called neurons34. DNNs are built to learn and express complicated relationships in data using several layers of non-linear transformations. The raw input data is received by the input layer of a DNN and then passed via a succession of hidden layers. Each buried layer is made up of several neurons that process the incoming input. The output of one layer is used as the input for the next layer, enabling for the extraction of higher-level features and representations to be done in stages35. The capacity of DNNs to learn abstract and hierarchical representations of incoming data is their strength. As input flows through the network, each layer learns to extract increasingly complex and significant information from the output of the preceding layer. DNNs may capture complicated patterns and relationships in data using this hierarchical representation. Backpropagation is a technique used to train a DNN in which the network changes its weights and biases to minimize the discrepancy between the expected and actual output36. Typically, this optimization process is carried out using gradient descent methods, which update the network parameters based on the gradients of the loss function with respect to the parameters, illustrated in Fig. 3.Fig. 3Structure of Deep Neural Network.Multilayer perceptron (MLP)A Multi-Layer Perceptron (MLP) is a feedforward artificial neural network with numerous layers of linked neurons. It is one of the most basic and extensively used neural network topologies. MLPs are mostly utilized for supervised learning applications like classification and regression. The network in an MLP is made up of an input layer, one or more hidden layers, and an output layer. Each layer is made up of several artificial neurons, also known as perceptron, shown in Fig. 4. The neurons in one layer are completely linked to the neurons in the layers above and below. The weights associated with neural connections determine the strength of the connection. During an MLP’s forward pass, input data is sent into the input layer, and calculations are carried out layer by layer. Each neuron in a hidden layer gets input from neurons in the preceding layer, applies an activation function to the weighted sum of its inputs, and generates an output. This procedure is continued until the output layer is reached, which results in the network’s final output. MLPs are distinguished by non-linear activation functions such as sigmoid, tanh, or ReLU (Rectified Linear Unit), which add non-linearity into the network and allow it to learn complicated data correlations.Fig. 4Representation of Multi-Layer Perceptron.Recurrent neural network (RNN)A Recurrent Neural Network (RNN) is a form of artificial neural network that uses recurrent connections between neurons to process sequential input, illustration Fig. 5. RNNs, as opposed to feedforward neural networks, feature feedback connections that allow them to save and use information from earlier time steps or inputs37. RNNs are distinguished by their capacity to grasp temporal relationships and handle sequences of varied lengths. As a result, they are well-suited to jobs requiring time series data, natural language processing, speech recognition, and other sequential data analysis. Each neuron in an RNN has an internal memory state that acts as a hidden state or context vector. The network takes an input and mixes it with the previous hidden state to create a new hidden state at each time step38. Because of this recursive feedback loop, information may remain and flow through the network, allowing it to represent and capture long-term relationships in sequential data. Depending on the job, the output of an RNN can be generated at each time step or at the end. In sequence classification, for example, the output is often generated at the final time step, summarizing the whole sequence. In sequence creation, such as text generation, the network can create outputs at each time step, progressively generating a sequence.Fig. 5Working Diagram of Recurrent Neural Network.Long short-term memory (LSTM)Long Short-Term Memory (LSTM) is a Recurrent Neural Network (RNN) version developed to solve the difficulties of collecting long-term relationships in sequential input represented in Fig. 6. To address the vanishing gradient problem and allow RNNs to learn and store knowledge over longer time intervals, LSTMs were invented. The memory cell is the basic building element of an LSTM, and it is made up of multiple components: an input gate, a forget gate, a cell state, and an output gate. These components collaborate to control the flow of information within the LSTM. The input gate regulates how much of the incoming data should be kept in the cell level at each time step. It applies a sigmoid activation function to the current input and the prior concealed state, yielding a number between 0 and 1 that indicates the current input’s contribution to the cell state. The forget gate determines which cell state information should be deleted. It analyses the current input and the prior concealed state, using a sigmoid activation function to determine how much of the cell state should be remembered and how much should be lost. The cell state acts as the LSTM’s long-term memory. It is updated by combining the prior cell state with the input gate’s contribution and deleting the useless information indicated by the forget gate. Finally, at each time step, the output gate decides the output of the LSTM. It takes into account the updated cell state and the current input before producing a value between 0 and 1. This value is multiplied by the cell state after a tanh activation function is applied, yielding the LSTM output for the current time step.Fig. 6Model of Long short-term memory.Gated recurrent unit (GRU)GRU is an abbreviation for “Gated Recurrent Unit.” It is a recurrent neural network (RNN) architecture that was developed to improve on existing RNNs. GRUs are intended to overcome the vanishing gradient problem that can arise while training deep neural networks, particularly ones with lengthy sequences. The use of gating mechanisms allows the network to selectively update and reset its hidden state, which is a critical property of GRUs. The gating mechanism assists the network in retaining useful information from prior time steps while discarding irrelevant data. This enables GRUs to more efficiently capture long-term relationships in sequential data. A GRU unit typically has two major gates: an update gate and a reset gate as shown in Fig. 7. The update gate controls how much of the old hidden state is kept and how much of the new candidate hidden state is introduced. The reset gate regulates how much of the foregoing data should be erased. GRUs can efficiently collect important information and propagate it over time by adaptively updating and resetting the hidden state. GRUs have a simpler design with fewer parameters than other forms of RNNs, such as the basic RNN or the more complicated LSTM.Fig. 7Architecture of Gated Recurrent Unit.The proposed approachIn this study, the proposed system consists of combination of advanced feature formulation techniques such as relative positioning with a Convolutional Neural Network (CNN) to analyze sequences. The feature formulation involves two widely used techniques: composition-specific and position-variant, which extract detailed features from the sequences, capturing both the compositional and positional information of amino acid residues. The core of our feature extraction process includes the Position Relative Incidence Matrix (PRIM) and the Reverse Position Relative Incidence Matrix (RPRIM). PRIM captures the positional interaction among amino acid residues in a polypeptide chain, resulting in a 20 × 20 matrix that provides a comprehensive positional overview. The matrix records the positions and correlations of residues, generating 400 coefficients that are then reduced to a manageable set of 30 features through statistical moments. RPRIM extends this concept by reversing the sequence, revealing hidden homologous features and ensuring no positional information is lost. This dual approach of PRIM and RPRIM ensures a thorough capture of both forward and reverse positional relationships. Additionally, we use the Frequency Vector (FV) to calculate the frequency of each amino acid residue in a sequence, providing essential compositional data. To further enhance the positional information, the Accumulative Absolute Position Incidence Vector (AAPIV) and its reverse (RAAPIV) are utilized. These vectors split the sequence into quarters and accumulate positional data, offering a refined view of residue distribution. The extracted features are then fed into a CNN, renowned for its proficiency in capturing spatial hierarchies in data. CNNs consist of multiple layers that convolve the input features, enabling the network to learn intricate patterns. The combination of CNN with our advanced feature formulation methods allows for the efficient processing of high-dimensional data, leading to superior predictive performance. Our CNN model, trained on these meticulously extracted features, has demonstrated remarkable accuracy and robustness. The integration of sophisticated feature extraction techniques with a CNN forms a powerful system for analyzing telomere sequences. By capturing both compositional and positional information, our system can identify subtle patterns that traditional methods might overlook.Performance assessmentAfter developing a machine learning computational model, it is critical to assess the model’s performance to determine how successfully it handled the given issue. This is accomplished through the use of several performance estimate methodologies, references, all of the preceding investigations make use of parameters. The usage of a parameter is determined by the sample class and the classification issue. The confusion matrix is used to generate performance evaluation metrics39,40,41,42. The right and wrong values for each class are stored in the confusion matrix. Comparisons are made between the confusion matrix results and the actual outcomes. Each column of the confusion matrix reflects the actual value for that class, whereas the rows of the matrix represent the anticipated class.$$\:Accuracy=\:\frac{TP+TN}{TP+FP+FN+TN}$$
(13)
$$\:Sensitivity=\:\frac{TP}{TP+FP}$$
(14)
$$\:Specificity\:=\:\frac{TP}{FP+TN}$$
(15)
$$\:MCC=\:\frac{TP\:\times\:TN-FP\:\times\:FN}{\sqrt{\left[TP+FP\right]\left[TP+FN\right]\left[TN+FP\right]\left[TN+FN\right]}}$$
(16)

Hot Topics

Related Articles