A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications

The 5-methylcytidine-positioned RNA modification predictor involves several interrelated steps. It begins with extracting sequences from steady-state DNA (ssDNA) and applying feature extraction techniques. The extracted features are then analyzed and optimized using feature selection techniques. Lastly, Predictive modeling employs different performance assessment parameters such as accuracy, sensitivity, specificity, Matthew Correlation Coefficient, and precision-recall scores. The complete framework of the proposed methods model representing each phase of the model is provided in Fig. 1.Fig. 1Predictive Architecture for Identifying 5hmC Modifications in RNA Sequences. This figure depicts the architecture of the model used to predict 5-hydroxymethylcytosine (5hmC) modifications in RNA sequences. It outlines the steps from data input and feature extraction to the final machine learning model (XGBoost) used for prediction. The workflow highlights the integration of various computational techniques to identify 5hmC sites effectively.Benchmark datasetIn bioinformatics and machine learning, the acquisition or selection of a valid benchmark dataset is an essential step for developing an intelligent computational model. The selection of a suitable benchmark has a high impact on the performance of a computational model. According to Chou’s comprehensive review18,19, a valid and reliable benchmark dataset is always required for the design of a powerful and robust computational model. Hence, in this paper, we used the same benchmark datasets that were used in16. The selected datasets can be expressed in mathematical form using Eq. (1).$$R_{1} = R_{1}^{ + } + R_{1}^{ – }$$
(1)
where, \({R}_{1}\) represent the total number of 5hmC samples, \({R}_{1}^{+}\) represent the positive 5hmC samples and \({R}_{1}^{-}\) all negative 5hmC samples. Firstly, CD-HIT software was applied using a threshold value of 80% to eliminate high resemblance sequences. Secondly, we employed a random sampling technique to select the same number of positive samples as that of the negative samples to balance the benchmark dataset. Finally, we obtained a benchmark dataset that contained a total of 1324 sequences, of which 662 are 5hmC sequences and 662 are non-5hmC samples. To assess the generalization power of our proposed model, we employed an independent dataset. To create an independent test set, the original dataset is divided into two parts: 80% of the samples are allocated for training, while 20% are reserved for testing. The 20% used for testing is randomly selected to form an independent dataset. The training phase then utilizes the remaining 80% of the data. The independent dataset consists of 264 sequences, of which 132 are positive samples and 132 are negative samples.Feature extraction techniquesTo generate prominent, reliable, and variant statistical-based discriminative descriptors, several feature encoding approaches have been utilized for the formulation of proteins, RNA, and DNA sequences20. The detailed overview of the proposed feature encoding schemes is presented in the below sections.MismatchesMismatch21 calculates the occurrences of k-length neighboring nucleic acids that differ by at most m mismatches (m < k). The second step of the profile is the allowance for the maximum number of m mismatches instead of the sole occurrence analysis of k-mers. For a 3-length subsequence “AAC” and max one mismatch, we need to consider 3 cases, “–AC”, ”A–C” and “AA–”, “–” can be replaced by any nucleic acid residue. This descriptor is governed by two parameters.K Represents the nearby nucleic acids or k-mers, which are considered “neighbors” in the analysis and m is the number of mismatches allowed. This threshold defines how flexible the matching process is22. A mismatching threshold is written as m, shortening the word inexact matching to match and formalized as in Eq. (2), the matching descriptor, labeled as \(c_{i,j}\) describes cases when the ith class occurrences with j mismatches occur.$$C_{k,m} = \left( {\sum\limits_{j = 0}^{m} {C_{1,j} ,} \sum\limits_{j = 0}^{m} {C_{2,j} , \ldots ,} \sum\limits_{j = 0}^{m} {C_{{4^{K} ,j}} } } \right)$$
(2)
where, Ci j represents the occurrences of i-th k-mer type, with j mismatches, \(i = 1,2,3, \ldots ,4^{k}\);\(j = 0,1,2,…,m\)Accumulated Nucleotide Frequency (ANF)The Accumulated Nucleotide Frequency (ANF) method of encoding23 makes it possible to encode the sequence of each nucleus, as well as to take into account the nucleotide distribution within the sequence of the RNA24. The density \(d_{i}\) of any nucleotide \(s_{i}\) at position \(i\) in the RNA sequence is computed using Eq. 4.$$d_{i} = \frac{1}{{s_{i} }}\sum\limits_{j = 1}^{{s_{i} }} {f(s_{i} ),f(q) = \left\{ {_{0\,\,\,\,other\,\,case}^{{1\,\,\,\,\,ifs_{i} = q}} } \right\}}$$
(3)
Here, variable l denotes the resulting sequence length; \(s_{i}\) it simply represents the last size \(i_{prefixes} \left\{ {s_{1} ,s_{2} , \ldots ,s_{i} } \right\}\) in our sequence, and q is an element belonging to the set (A, C, G, U). Let us get in-depth with the sequence “UCGUUCAUGG”. Starting with position 1, we have a density value of ‘U,’ which is 1, 0.5 for position 4, and 0.6 for position 5. Finally, 0.5 for position 8. ‘C’ carbon atoms have two times higher density when in position 2 instead of position 6, as there are 0.5 and 0.33, respectively, the ‘G’ nucleotide positions 3 and 9 counts for 0.33 and 0.22, respectively. Position 10 of the nucleotide seems to have the highest count, -0.3. The case of H2O at position 7 is ‘A,’ which has 0.14 density.Position-specific trinucleotide propensity based on single-strand (PSTNPSS)PSTNPss, which stands for Position-specific trinucleotide propensity based on single-strand, is a computational method developed to analyze the characteristics of single-stranded DNA or RNA. This approach, outlined in studies25,26, focuses on understanding the statistical properties of trinucleotides within biological sequences27,28. With 64 possible trinucleotides (AAA, AAC, AAG,…,U), PSTNPs aim to capture each trinucleotide’s position-specific propensity within a given sequence. To achieve this, PSTNPss utilizes a matrix representation, typically of dimensions 64 × (L-2), where L represents the sequence length in base pairs. Each cell in this matrix corresponds to a specific trinucleotide at a particular position within the sequence. By analyzing the frequency and distribution of trinucleotides across different positions, PSTNPs provide insights into the positional preferences of trinucleotides along the single-stranded DNA or RNA sequences. This approach enables researchers to uncover patterns and trends indicating functional or structural elements within the genetic material.$$z = \left[ {\begin{array}{*{20}c} {z_{1,1} } & {z_{1.2} } & {…} & {z_{1,L – 2} } \\ {z_{2,1} } & {z_{2,2} } & {…} & {z_{2,L – 2} } \\ \vdots & \vdots & {…} & \vdots \\ {z_{64,1} } & {z_{64,2} } & {…} & {z_{64,L – 2} } \\ \end{array} } \right]$$
(4)
In the given formula:$$z_{i,j} = F^{ + } (3mer_{i} |j) – F^{ – } (3mer_{i} |j)$$
(5)

\(i\) ranges from 1 to 64, representing the 64 possible trinucleotides.

\(j\), ranges from 1 to \(L – 2\), where \(L\) is the sequence length, indicating the positions within the sequence.

\(F^{ + } (3mer_{i} |j)\;and\;F^{ – } (3mer_{i} |j)\), denote the frequency of the ith trinucleotide (3meri) at the jth position in the positive \(\left( {S^{ + } } \right)\) and negative \(\left( {S^{ – } } \right)\) datasets.

For instance, 3mer1 corresponds to AAA, 3mer2 corresponds to AAC, and so on, up to 3mer 64 features, which corresponds to TTT.

Therefore, the sample can be represented as follows:$$s = \left[ {\theta_{1} ,\theta_{2} , \ldots ,\theta_{L – 2} } \right]^{T}$$
(6)
where \(T\) denotes the transpose operator and \(\phi_{u}\) is defined as follows:$$\phi_{u} \left\{ {\begin{array}{*{20}c} {z_{1,u} } & {when\,N_{u} N_{u + 1} N_{u + 2} = AAA} \\ {z_{2,u} } & {when\,N_{u} N_{u + 1} N_{u + 2} = AAG} \\ {} & \vdots \\ {z_{64,u} } & {when\,N_{u} N_{u + 1} N_{u + 2} = UUU} \\ \end{array} } \right.$$
(7)
The PSTNP descriptor, utilizing a statistical approach based on single-stranded DNA or RNA characteristics, has been successfully applied in predicting DNA N4-methylcytosine sites29. It involves calculating the frequency of each trinucleotide at different positions in positive and negative datasets, representing them in a matrix format and has shown efficacy in its predictive capabilities.Adaptive skip dipeptide composition (ASDC)The Adaptive Skip Dipeptide Composition (ASDC) is an advanced form of dinucleotide composition designed to incorporate correlation information between adjacent and intervening residues30. In ASDC, the feature vector for a given sequence is represented as follows:$$ASDC = (f_{v1} ,f_{v2} , \ldots ,f_{vns} )$$
(8)
The formula determines the calculation of ASDC:$$f_{vi} = \frac{{\sum\limits_{g = 1}^{L – 1} {O_{i}^{g} } }}{{\sum\limits_{i = 1}^{16} {\sum\limits_{g = 1}^{L – 1} {O_{i}^{g} } } }}$$
(9)
where \(f_{vi}\) represents the occurrence frequency of all possible dinucleotides with up to \(\le L – 1\) intervening nucleotides. The ASDC descriptor has demonstrated successful applications in predicting anti-cancer peptides and cell-penetrating peptides31.Dinucleotide-based auto covariance (DAC)The Dinucleotide-based Auto Covariance (DAC) encoding32,33 measures the correlation of the same physicochemical index between two dinucleotides separated by a lag distance along the sequence34,35. The DAC can be calculated as:$$DAC(u,lag) = \sum\limits_{i = 1}^{L – lag} {\left( {\frac{{\left[ {P_{u} (R_{i} R_{i + lag} ) – p^{\prime}_{u} } \right]\left[ {p_{u} (R_{i + lag} R_{i + 2\& *lag} ) – p^{\prime}_{u} } \right]}}{L – lag – 1}} \right)}$$
(10)
where \(u\) is a physicochemical index, \(L\) is the length of the nucleotide sequence, \(P_{u} (R_{i} \;R_{i + 1} )\) is the numerical value of the physicochemical index \(u\) for the dinucleotide \(RiR_{i + 1}\) at position \(i, P^{\prime}_{u}\), and is the average value for physicochemical index \(u\) along the whole sequence:$$P^{\prime}_{u} = \frac{{\sum {p_{u} (Ri R_{i + 1} )} }}{L – 1}$$
(11)
The dimension of the DAC feature vector is \(N \times LAG\) N, and the number of physicochemical indices and LAG are the maximum lags \(\left( {lag \, = \, 1, \, 2, \, …, \, LAG} \right)\).Fused feature vectorIn this model, we applied five different feature encodings such as Mismatches (MisM), ASDC, DAC, PSTNPSS, and ANF to capture the nucleotide-based features keeping their residue ordering information. Moreover, to generate the high discriminative model representing the multi-perspective features, we serially concatenated the extracted features to form an individual vector covering the weakness of the individual feature vector as follows:$$Hybrid\_Vector = MisM \cup ASDC \cup DAC \cup ANF \cup PSTNPSS$$
(12)
SHAP feature selectionIt is not always straightforward to determine the biological significance of the selected descriptors. The machine learning algorithms are sometimes called “black boxes.” due to their complex inner structure. Discussions on data shape in machine learning refer to the structure or size of data groups used for various tasks. This aspect is crucial for data owners as they need to determine the dataset size and how to process it efficiently. Machine learning algorithms can exhibit differences in performance and usability depending on the dataset they are trained on. Understanding the data shape is essential for preprocessing steps such as splitting the data into training and testing sets, normalization, and feature selection. Properly shaping the data provides critical insights that guide decision-making, which is a key element in the data science pipeline. SHAPley Additive Explanations (SHAP) uses cooperative game theory to distribute credit among the contributions of input features in machine learning algorithms36,37,38. It assigns a specific quantitative value to each feature input and tells which predicts the value better. Measurement-wise, SHAP adds various contributions from the features of interest in and out of the model. Such difference shows the extent of an element’s influence on the result, and Eq. 14 shows the formal mathematical formulation. This Equation captures the incremental effect of adding feature \(\text{i}\), to different subsets of features.$$SHAP_{i} (x) = \phi_{i} = \sum\nolimits_{{s \subseteq N\backslash \{ i\} }} {\frac{\left| S \right|(\left| N \right| – \left| S \right| – 1)}{{\left| N \right|}}} \left[ {f(S \cup \left\{ i \right\}) – f(S)} \right]$$
(13)
where:

\(\phi_{i}\), represents the SHAP value for the feature \(i\).

\(N\), is the set of all features.

\(S\), is a subset of features excluding i.

\(f\left( S \right)\) is the model’s prediction given features in S.

\(f(S \cup \left\{ i \right\})\) is the model’s prediction given features in S plus feature \(\text{i}.\)

Figure 2, show the main features selected for this study. Every row in these charts is a different feature, and each dot shows the SHAP value for that feature in a specific example. Red dots mean the feature’s value is high, and blue dots mean it is low. The horizontal axis shows the SHAP values, which indicate how much each feature influences the model’s predictions. A positive SHAP value means the feature increases the chances of AVPs (presumably a positive outcome).Fig. 2SHAP-Based Feature Selection on the Hybrid Features. This figure illustrates the SHAP analysis used to select important features from the hybrid feature set, identifying those that contribute most significantly to the model’s predictions.In contrast, a negative value means it increases the chances of non-AVPs (a negative outcome). In our study, we tested different groups of features of varying sizes. However, we found that the group with the top 64 features greatly improved the performance of our proposed model. Moreover, to thoroughly investigate the instant-based analysis of the extracted features, we performed LIME analysis on the randomly selected instance after SHAP analysis to predict the targeted classes, as shown in Fig. 3. Where class 1 represents the positive class and class 0 represents the negative class.Fig. 3LIME analysis on randomly selected instances after SHAP feature selection. This figure presents the LIME analysis applied to randomly selected instances, following SHAP-based feature selection, to provide interpretability of the model’s predictions by highlighting the contribution of individual features.Samples visualization via tSNEIn order to investigate the effects of the extracted features, we used t-distributed Stochastic Neighbor Embedding (t-SNE) based feature visualization for the hybrid feature vector before and after applying feature selection as shown in Fig. 4. t-SNE maps revealed distinct clusters, facilitating the differentiation of positive and negative samples within them. False negative and false positive samples were notably situated between true negative and true positive samples, though their occurrences were infrequent. tSNE approach represents the local structural information as well as the global structural relationships39. The extracted features are further visualized using the t-SNE approach to convert the high-dimension vector into 2D space, as shown in Fig. 4. In Fig. 4A, the hybrid features show some degree of overlap between positive and negative samples, which is somewhat effective but does not accurately classify the targeted classes. However, in Fig. 4B, the data samples of both classes are clearly separable, demonstrating the effectiveness of SHAP-based optimal features in predicting between 5hmc and non-5hmc compared to the hybrid features in Fig. 4A.Fig. 4(A) t-SNE visualization of (A) Hybrid Training features (B) SHAP-based Selected features. This figure shows the t-SNE visualization for two feature sets: (A) hybrid training features and (B) SHAP-based selected features. The visualization helps illustrate the clustering and separation of data points in different feature spaces.XGBoostXGBoost is defined as the extreme gradient boosting library. This optimized distributed gradient boosting library is meant for a faster, more flexible, and scalable machine learning process. At its core, the approach used by XGBoost is called gradient boosting, one of several ensemble learning algorithms that create a predictive model by weighing the results of several weak learners, usually decision trees40,41. XGBoost provides a class of matrix that is meant for the exclusively efficient performance of XGBoost functions in data storage and access during model training and evaluation. The goal of regression tasks in XGBoost is to minimize the mean squared error (MSE) between the then-observed and the forecast values. XGBoost loss functions consist of squared error, absolute error, and Huber loss. The objective and loss functions in the XGBoost tutorial describe the criteria for the training optimization process. We use the objective function to measure the model overall and the loss function to compare the predicted and actual values. There are two stages of XGBoost during which training and evaluation occur repeatedly by adding new trees into the sum of the gain function. Each tree is non-parallelly matched to the negative slope of the loss function that showed the incorrect answers of the previous trees. XGBoost validation through cross-validation is an approach for ascertaining model success. It implies dividing the dataset into several subsets and using varied combinations of those sets to train the model at each stage and estimate the quality level. When building an XGBoost classifier, obtaining some binary or multiclass classification objective function, such as logistic Regression or Softmax, is often necessary. Users can move from the native API of XGBoost to the one that is scikit-learn powered; this allows them to switch both ways without restriction, which means compatibility with other machine-learning libraries can be achieved. This lays out the path through which the users will enjoy the function of the two APIs with their specific choice, as shown in Fig. 5.Fig. 5The Workflow of the XGBoost Algorithm. This figure depicts the step-by-step workflow of the XGBoost algorithm, including data input, boosting process, and final model output. It illustrates how the algorithm enhances prediction accuracy through iterative training and optimization.XGBoost works by constructing a set of decision trees that succeed one another and, in the end, make the predictions. It starts by assembling a dataset with the required input features (X) on one side and physical measurements (Φ) on the other. The structure is made up of decision trees called T1 to TK. The purpose of each tree is to rectify and correct the mistakes made by the previous tree. The algorithm computes the residuals of the previous model and subsequently fits a new tree in each iteration to minimize these residuals. This procedure continues until a certain amount of tree value (K) is reached. Then, the algorithm accumulates the separate models \(\left( {f_{k} \left( {X, \, \Phi_{k} } \right)} \right)\) proposed by the individual trees to create a final prediction. Sequentially, it revises the model predictions addressing errors from the previous iteration, and finally, the approach delivers XGBoost highly accurate and robust forecasting for complex tasks.XGBoost is an extreme gradient-boosting algorithm with various applications in machine learning, especially in regression and classification tasks. Know its efficiency, scalability, and high levels of predictive performance well. The logic is based on the concept of the tree ensemble, which represents an iterative process of building decision trees so that each tree rectifies the errors made by the previous ones. It optimizes a differentiable loss function with gradient descent, making it resilient and perform even in high-dimensional datasets. XGBoost main objective is to minimize the value of the loss function, which represents the divergence of the actual against the forecasted values. Here, several weak learners (decision trees), which are lightweight, are combined step-by-step into strong learners (ensemble model). The procedure sets the parameters of each tree as per the loss function gradient and gives them room to correct themselves over each iterative modification. The knowledge of the background and purposes of XGBoost constitutes a good base for studying the algorithmic details of the algorithm’s superior performance in the machine learning application.
XGBoost Algorithm: Start with a dataset containing input features1. Initialize Model Parameters:  • Set hyperparameters: Learning rate: η  • Maximum depth of trees: Number of trees: T2. Prepare Data:  • Load dataset: X (features) and y (labels). Split data into training and validation sets  • Initialize ensemble: \(\hat{f}^{0} \left( x \right) = mean\left( y \right)\) 3. For each iteration (t):  • Compute negative gradient: \(g_{i} = \frac{\partial }{{\partial \hat{f}^{t – 1} (x_{i} )}}L(y_{i} ,\hat{f}^{t – 1} (x_{i} ))\)   • Compute second-order gradient (optional): \(h_{i} = \frac{{\partial^{2} }}{{\partial \hat{f}^{t – 1} (x_{i} )^{2} }}L(y_{i} ,\hat{f}^{t – 1} (x_{i} ))\)   • Train decision tree \(h_{t}\)​ to predict negative gradient:\(h_{t} = \arg \min_{h} \sum\limits_{t = 1}^{N} {\left[ {g_{i} h(x_{i} ) + \frac{1}{2}h(x_{i} )^{2} + \Omega (h)} \right]}\)   • Update model: \(\hat{f}^{t} (x) = \hat{f}^{t – 1} (x) + \eta h_{t} (x)\) 4. Regularization:  • Apply regularization techniques and tree pruning: \(\Omega \left( h \right)\)   • Feature subsampling: \(\gamma\) 5. Evaluation:  • Evaluate performance on validation set using evaluation metrics: \(L(y,\hat{f}^{t} (x))\) 6. Stopping Criteria:  • Check convergence or the predefined number of iterations: \(t\) if converged or \(t = T\) stop training7. Output Model:  • Return final ensemble of decision trees:\(\hat{f}(x) = \sum\limits_{t = 1}^{T} {\eta h_{t} (x)}\)

Hot Topics

Related Articles