Detecting depression severity using weighted random forest and oxidative stress biomarkers

Two classification models are proposed; binary classification to determine whether the patient has depression or not and multiclass classification to detect the level of depression. Different combinations of factors were tested to determine the optimal combination of factors that contribute to the occurrence of depression and their importance. Biomarkers of oxidative stress [8-isoprostane, 8-OHdG, GSH, GSSG in addition to the glutathione ratio (GSH-GSSG)] were included in the first model as the main features. In the second model, all biomarkers, sociodemographic, genetic and health-related characteristics were included. Data imputation was performed following Jelinek et al.44 where classes are defined by the Cartesian product of class values and incomplete information dismissal and data completion techniques are applied to reduce features and impute missing values. This method comprehensively handles missing values in clinical data sets by considering all possible combinations of class values, filtering out incomplete information, and employing data competition to fill in missing values, thereby creating a more complete and reliable data set for analysis. In addition, it has proven effective in feature reduction and missing value imputation in similar clinical settings, improving the detection accuracy of data mining algorithms. Figure 3 shows a pictorial representation of the entire methodology.Figure 3Framework for detecting depression severity.Python and Scikit-learn libraries were used to implement machine learning algorithms. Firstly, the data set has been split into training and testing data. This study has used 70% of the data set as training data and 30% for testing. To improve machine learning models, fivefold cross-validation with grid search over hyperparameter values in the training set (“Multimedia Appendix 1”). The optimal hyperparameters found for each model during the training phase were then used to evaluate the model’s performance on the testing set. The performance of machine learning models has been evaluated primarily using the AUC. Additional performance metrics such as recall, precision, F1 score, the area under the Precision-Recall Curve (AUC-PR) and confusion matrices were also reported. SMOTE, WLR and WRF have been implemented to improve detection performance given the class imbalance in the data set, achieving higher accuracy. To identify the relative importance of different factors, we determine the weight assigned to each factor, which reflects the contribution of each factor to the final detection. To evaluate the importance of features in our study, we used the permutation importance technique available in the scikit-learn library40. Permutation importance measures the influence of individual features on the model’s performance by randomly permuting the values of each feature and measuring the subsequent change in performance. This provides valuable information on the contribution of each feature to the overall detection of the severity of depression45. In this study, we considered the importance of the model feature with the highest accuracy.Data pre-processingWe extracted common features from the data sets and used a standardization procedure to ensure that the variables exhibited a mean of zero and a standard deviation of one. This standardization ensures a consistent scale across the data set, which is essential for machine learning models. By eliminating the influence of varying scales, it improves model robustness and reduces sensitivity to scale variability.Subsequent to the standardization procedure, the data set was partitioned into a training set (70%) and a testing set (30%). The training set served as the basis for model development, while the testing set was used to evaluate the performance of the resulting trained model. To avoid overfitting of machine learning models, a fivefold cross-validation technique coupled with grid search across hyperparameter values was implemented in the training set (see “Multimedia Appendix 1”), in order to determine optimal configurations for each model.Machine learning modelsIn this study, we used Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbour (K-NN), Support Vector Machine (SVM), Naïve Bayes (NB), and Artificial Neural Network (ANN). These models, known for their efficacy, have shown high performance in previous studies focusing on predicting and detecting the severity of depression, as reported in the existing literature.Logistic regression (LR)LR is used for classification problems where the goal is to determine whether a new sample fits in a particular class. It is useful for binary classification that can be generalized to multinominal outcomes11,46. It also has the ability to predict the probability of a specific class47. Mathematically, the logistic function is given by Eq. (1). There are variants of LR to help overcome the problem of overfitting, such as \(L_1\) and \(L_2\) regularization. \(L_1\) regularization, also known as Lasso regularization, adds a penalty equal to the absolute value of the coefficients to the loss function. On the other hand, \(L_2\) regularization, or Ridge regularization, adds a penalty equal to the square of the coefficients to the loss function. In this study, both variants of LR were utilized.$$\begin{aligned} F(x) = \frac{1}{1 + e^{-(\beta _{0} + \beta _{1}x_{1} + \beta _{2}x_{2} + \beta _{3}x_{3} + \cdots )}} \end{aligned}$$
(1)
where \( x \) is the input variable, \( \beta _{0} \) is the intercept and \( \beta _{1}, \beta _{2}, \beta _{3}, \ldots \) are the slopes of the logarithmic odds as a function of \( x \).Random forest (RF)RF is a predictive modeling tool that builds decision trees and determines the average of the predictions of each decision tree48. Consequently, it combines simplicity and flexibility to increase predictive accuracy49. The RF algorithm has some stochastic behavior. The algorithm randomly selects samples from the original data set and creates a decision tree. It continuously repeats the creation of decision trees considering an independent subset of variables every time from the original data set, resulting in a wide variety of trees. This variety makes RF more effective than an individual decision tree model50.Furthermore, randomness in the generation of decision trees increases the generalizability of RF so that the classifier is less likely to overfit51. Built-in cross-validation is one of the characteristics of RF that adds value to allow the classification of variables from the most effective to the least associated with the outcome variable. However, to obtain high classification accuracy from the model, it is important to increase the amount of data so that different classes can be distinguished well from each other11.K-nearest-neighbor (K-NN)KNN is a supervised machine learning algorithm that is widely used for both regression and classification. It is an effective algorithm when dealing with data sets with linear or nonlinear relationships. It assumes that similar data are close to each other. Consequently, KNN classifies new data points based on their proximity to the most similar instances in the data set52. Three parameters are used. N neighbors, which indicates the number of neighbors required for classification, the distance metric and the p-value53.Support vector machine (SVM)SVM is a machine learning algorithm for regression and classification. It has been widely used in the bioinformatics field for its effectiveness in handling nondimensional data and its robustness when dealing with outliers54. This algorithm uses a hyperplane to classify future predictions. The hyperplane can be represented as a line or a plane in multidimensional space to classify the data into the corresponding classes by investigating the maximum space margin between the support vectors53.In the context of SVM, when working with a training data set containing (n) data points, denoted as \(\{(x_1, y_1), \ldots , (x_n, y_n)\}\), where each \(x_i\) is a sample in the n-dimensional input space associated with a binary output value \(y_i \in \{1, 0\}\), for each \(i = 1, 2, \ldots , n\), the SVM optimization problem can be mathematically expressed as follows (Eq. 2):$$\begin{aligned} \text {Objective: } \text {Minimize } \frac{1}{2} \Vert \beta \Vert ^2 + C \sum _{i=1}^{n} \xi _i \end{aligned}$$
(2)
$$\begin{aligned} \text {Subject to:} \quad&y_i (\langle x_i, \beta \rangle + \beta _0) \ge 1 – \xi _i, \quad i = 1, 2, \ldots , n \end{aligned}$$
(3)
$$\begin{aligned}&\xi _i \ge 0, \quad i = 1, 2, \ldots , n \end{aligned}$$
(4)
Where \(C\) serves as a constant that penalizes errors, and \(\xi _i\) represents slack variables that indicate the extent of misclassification; if an instance is misclassified, then \(\xi _i > 1\).Naive Bayes (NB)NB is a probabilistic supervised machine learning classification algorithm based on the Bayes theorem. It applies conditional probability between features given the values of class variables. This algorithm determines the probability of events based on the occurrence of previous events assuming independent features53.Artificial neural network (ANN)ANN is a deep learning algorithm inspired by the structure and function of the human brain. It consists of interconnected layers, each layer consisting of neurons (as shown in Fig. 4). Each neuron incorporates an activation function55. ANN used three layers with a rectified linear unit activation function (RELU) with 350 epochs to train the model.Figure 4ANN architecture with three inputs, one output and RELU activation function.Performance metricsAccuracyAccuracy is a performance metric that can be used to identify the percentage of correctly classified predictions. It can be expressed by Eq. (5).$$\begin{aligned} \text {Accuracy} = \frac{\text {Number of Correct Predictions}}{\text {Total Number of Predictions}} \end{aligned}$$
(5)
PrecisionPrecision is used to identify the percentage of positive attempts that were correctly classified against the total number of positive predictions and can be expressed by Eq. (6).$$\begin{aligned} \text {Precision} = \frac{\text {True Positives (TP)}}{\text {True Positives (TP) + False Positives (FP)}} \end{aligned}$$
(6)
RecallRecall is used to calculate the ratio of positive predicted outcomes to the total predictions in a given class (Eq. 7)48.$$\begin{aligned} \text {Precision} = \frac{\text {True Positives (TP)}}{\text {True Positives (TP) + False Positives (FN)}} \end{aligned}$$
(7)
F1-scoreF1 score is a better performance metric as it considers both recall and precision, particularly for imbalanced data (data with nonuniform distribution of class labels). It can be determined by the harmonic mean of recall and precision53 as presented in Eq. (8).$$\begin{aligned} \text {F1score} = \frac{\text {2(Recall + Precision)}}{\text {Recall + Precision}} \end{aligned}$$
(8)
Confusion matrixThe confusion matrix is a performance measurement tool that provides a breakdown of the predicted and actual outcomes of a classification model. It offers an in-depth understanding of the performance of any classification model, particularly in contexts where the implications of false positives and false negatives vary53.Area under the ROC curve (AUC)The Area Under the Curve of the Receiver Operating Characteristic (AUC-ROC) is a performance metric that quantifies the overall ability of a classification model to discriminate between classes. The ROC curve itself plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across various classification thresholds, providing a visual representation of the model’s performance. The AUC value, which represents the area under this curve, serves as a measure of the accuracy of the model56. In the context of multi-class classification models, the (Macro-average) is employed as a performance metric. This approach calculates the average performance across all classes, treating each class with equal importance, which is particularly beneficial when dealing with imbalanced data sets.Area under the precision–recall curve (AUC-PR)The area under the precision-recall curve (AUC–PR) is an essential metric for assessing the performance of classifiers, especially in situations involving imbalanced data sets where positive class is rare. In contrast to the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR), the Precision-Recall (PR) curve emphasizes precision and recall. This focus makes the PR curve more informative for imbalanced data scenarios. A higher AUC-PR value signifies superior performance, indicating the classifier’s ability to maintain both high precision and high recall across various thresholds57.Statistical tests between different modelsThe Friedman test was used on the training data to evaluate the significant performance differences in accuracy among the various models and classifiers used in this study. The training data was split using a tenfold cross-validation method, and the resultant performance of the ten groups was analyzed using the Friedman test. The Friedman test58, a non-parametric statistical test, calculates a chi-square statistic and the corresponding p-values to determine statistical significance between the models. Subsequently, pairwise comparisons were made using the Conover post hoc test, implemented through the \(scikit\_posthocs\) library, to assess significant differences between individual model pairs.Class imbalanceIn the context of the development of prediction models within healthcare settings, the mitigation of class imbalance emerges as a crucial concern to ensure robust and impartial model performance. Class imbalance occurs when there is an uneven distribution of target classes in the data set, posing the risk that the model favors the majority class at the expense of overlooking significant minority classes. Accurate prediction or classification of depression is particularly important in clinical practice, influencing the determination of appropriate treatment strategies and optimal results.Class weightTo this end, another approach is proposed to improve the accuracy of the predictive models and avoid the effect of class imbalance. This approach is derived from the incorporation of class weight, which is based on penalizing the algorithm for incorrect prediction by placing a heavier penalty on misclassifying the minority class. Each class is assigned a weight, but minority classes are given larger weights (higher misclassification penalty)59,60as represented by Eq. (9) below:$$\begin{aligned} W_j = \frac{n}{K \times n_j} \end{aligned}$$
(9)
where \(W_j\) is the weight of class \(j\), \(n\) is the total number of observations, \(K\) is the number of classes, and \(n_j\) is the number of observations in class \(j\).Synthetic minority over-sampling technique (SMOTE)SMOTE is a widely adopted method for addressing class imbalance in classification tasks. SMOTE tackles this issue by generating synthetic examples for the minority class rather than simply replicating existing instances. In doing so, SMOTE effectively increases the sample size of the minority class and promotes a more balanced class distribution. This synthetic augmentation helps mitigate the overfitting associated with random oversampling and improves the classifier’s ability to generalize61.

Hot Topics

Related Articles