Enhancing severe hypoglycemia prediction in type 2 diabetes mellitus through multi-view co-training machine learning model for imbalanced dataset

ObjectivesThe primary objective of this study is to develop a robust ML framework to accurately predict SH events in patients with T2DM. By utilizing a multi-view co-training approach on an imbalanced dataset, we aim to improve prediction accuracy by combining SSL and SL methods, leveraging both labeled and unlabeled data. Additionally, we seek to identify and utilize the most effective clinical and demographic features for predicting SH events, thus providing explainable artificial intelligence (XAI) insights and offering transparency and interpretability in the model’s predictions. The study also aims to compare the performance of the proposed multi-view co-training model with conventional single-view co-training models and other existing models, highlighting improvements in specificity, sensitivity, and overall accuracy. Ultimately, this research offers practical guidelines for clinicians on choosing between different models based on their priority for sensitivity or specificity in diagnosing SH, contributing to better patient management and early intervention strategies.Dataset descriptionStudy population The design and outcomes of the ACCORD trial have been published before37,38. Data from 10,251 enrolled participants with clinical diagnoses of T2DM were collected through the ACCORD study. The study’s participants were mostly middle-aged and elderly patients ranging from 40 to 82, with an average of 62.2 years and an average diabetes duration of 10 years. Of the total participants, the majority were white (64.8%) and male (61.4%). In our study, after performing missing data imputation, we proceeded with the analysis using a total of 10,244 observations. Table 4 in supplementary material (SM) displays the mean ± standard deviations and percentages (%) for the selected variables in the ACCORD dataset.Outcome and predictors We determined the response variable of the ACCORD dataset according to Fig. 2A. According to Fig. 2A, “Glucoselt50” is assigned as a value of 1 if the blood glucose level is below 50 mg/dl, 2 if it is above 50 mg/dl, and 3 if no information is available. “Medical Assist” is assigned 1, if medical assistance is required, 0 if not required. “Hospital Admit” is assigned a value of 1, if hospital admission is required, 2 if it is not required, and 3 if there is no information. Finally, the outcome is assigned as SH, non-SH, or Unknown based on the information provided by the patients. Patients who provided the following information were assigned as SH event; patients with a Blood Glucose level below 50 and either requiring Medical Assistance or Hospital Admission. For patients who provided the following information is assigned as non-SH; Blood Glucose is higher than 50. In addition, no medical assistance is required, and hospital admission is not needed assigned as value 0, indicating it is a non-SH event. Patients for whom we could not obtain information were assigned as Unknownt; if the Blood Glucose information is unknown or Medical Assistance is not required and unknown Hospital Admission. The ethics committee of Beth Israel Deaconess Medical Center determined that this study was exempt re: review and approval.To create predictors, first, like Ma et al.’s study39, created all 116 candidate risk features listed in SM Table 3. After that, as represented in Fig. 1B, we created 116 relevant risk factors in the ACCORD dataset, then selected the top-12 risk factors. The medical estimators of the ACCORD dataset were chosen as follows: hemoglobin A1c (HbA1c), fasting plasma glucose (FPG), general health check (g1check), diabetes education (g1diabed), nutritional education (g1nutrit), sulfonylurea, meglitinide, NPH or L (NPHL) insulin, regular insulin (Reg Insulin), long-acting insulin (La Insulin), other bolus insulin (Othbol Insulin), and premixed insulin. As some of the features are longitudinal, we computed the mean and standard deviations of the observations and this process resulted in the following 17 variables: hba1c mean, hba1c std, fpg mean, fpg std, g1check mean, g1check std, g1diabed mean, g1diabed std, g1nutrit mean, g1nutrit std, sulfonylurea mean, meglitinide mean, nphl mean, reg insulin mean, la insulin mean, othbol insulin mean, premix insulin mean (see the SM Table 9). Additionally, the ACCORD dataset participants were followed for approximately 4 to 8 years15. We decided to work on the 2-year prediction, because the least imbalanced rate is seen for the first year.Unlabeled dataset The ACCORD dataset contains 9068 unlabeled and 1176 labeled data. First, we started working with the labeled dataset, but we could not obtain significant results, and subsequently, we decided to include the unlabeled data in our analysis.Views In our study, we propose a multi-view co-training ML model as SSL. To begin the analysis, we first started with three different views. These are, glycemic variables (View 1): FPG, HBA1C; glycemic management and medications (View 2): g1check, g1diabed, g1nutrit, sulfonylurea, meglitinide, NPHL insulin, reg insulin, la insulin, othbol insulin, and premix insulin; (View 3): years of diabetes, live alone, education level, body mass index (BMI), participant waist circumference (cm), race, age, and gender. We examined and compared these three views and ranked View 1 and View 2 as more effective. Therefore, we generated two views for classification, glycemic variables based (View 1) and glycemic management and medications based (View 2).Missing data imputation In this study, we applied two different methods, namely the last-observation-carried-forward (LOCF)40,41, for the time-series observations, and the median imputation, for the non-time-series observations, to handle missing data.Feature selection and model validation The feature selection algorithm is a method that helps to identify the most relevant variables from the input data and reduces it to a lower-dimensional dataset. Feature selection methods for classification tasks can be categorized into two groups42: expert knowledge-based feature selection methods, and automatic feature selection methods such as filter, wrapper, and embedded feature selection algorithms. In particular, we utilized the Boruta, MRMR, and LASSO methods as automatic feature selection algorithms. In the MRMR method, researchers should define the number of features to be selected in advance. Therefore, we needed to determine how many features should be selected by the MRMR method. To do this, MRMR selected features from 1 to 17, and then we calculated the AUC value for each. Finally, we obtained the highest AUC with four features, as shown in SM Fig. 2. Furthermore, we not only evaluate the individual performances of these three feature selection methods but also consider the features that are selected by all of them as effective features. We incorporated a technique into our analysis, namely the “consensus and majority vote feature selection” rule43, where the feature is considered an important feature if it is selected by all of the base feature selection methods in agreement. We have provided an explanation of the feature selection algorithms in the “Feature selection methods” section of the SM and listed all the expert knowledge-based selected features in Table 1.Table 1 ACCORD dataset MD features selected by feature selection algorithms, including mean and standard deviation. Bold features are effectively selected by the consensus and majority vote rule.Proposed methodologyClassification pipeline All classifiers in this study are created using the caret package44 in R programming language 4.1.3. We start by employing conventional ML algorithms to process the entire labeled dataset. Across the entire study, the dataset is split into two sets (see Fig. 1A): A 20% sample of the data is used to test the classifier’s performance, while the remaining 80% is used to train the classifier. We built models using the training dataset and tested their performance with 5-fold cross-validation. We first assessed the performance of several distinct classifiers on the ACCORD dataset by calculating the classification accuracy. Specifically, we only used conventional machine learning methods, including Logistic Regression (LR), XGBoost, NB, Support Vector Machine (SVM), and Random Forest (RF). Then, we further evaluated the performance of single-view co-training and multi-view co-training models with NB and RF models; Naive Bayes classifier: A classification algorithm that operates on Bayes’ theorem and involves probabilities. Random forest: An ensemble classification or regression method that uses the decision tree algorithms45.Single-view co-training model It is obvious in Fig. 1C that labeled data is split as train and test set, and the model is initially trained (Step 1). Afterward, the trained model is used to estimate the unlabeled data (Step 2), and the most confident pseudo-labels are selected by the probability \(\Theta\) higher than 0.90. In the next step (Step 3), the pseudo-labeled data and labeled data are concatenated. The model then makes predictions on unseen test data (Step 4), and finally, the results are evaluated (Step 5). Step 1 and Step 5 are repeated until new unlabeled data can no longer be added. We also tested the heterogeneity of the data by applying the cross-validation method. We showed the mean accuracy metrics of test results for each iteration in single-view co-training for NB using MD features in SM Figure 9.Multi-view co-training model Blum46 introduced the co-training algorithm, which is a SSL algorithm, and numerous studies have been conducted on this topic46,47,49. The multi-view co-training method utilizes both views in tandem to supplement a much smaller number of labeled examples with unlabeled data. Blum46 first defined the labeled (L), unlabeled dataset (U) and unlabeled pools (\(U’\)) (created for each View 1 and View 2), and set the number of iteration k, then divided the input space \(X = X_1 \times X_2\), so that \(X_1\) and \(X_2\) corresponding to two distinct sufficient and redundant views (View1 and View2) of the X, and they trained each view from the labeled data (L) by \(h_1\) and \(h_2\) classifiers. Then, the co-training method allows \(h_1\) to label p positive and n negative most confident labels from the unlabeled (\(U’\)) set (for View 2) as a pseudo label and again \(h_2\) to label p positive and n negative most confident labels from the unlabeled (\(U’\)) set (for View 1), so this prevents from over-training. Finally, the algorithm adds these confident labels to L and deletes the selected confidence labels from \(U’\) (see Fig. 1D). Thus, the multi-view co-training method allows learning from both a few labeled and unlabeled data.Combining the views Instead of performing classification, multi-view co-training is typically utilized to generate larger labeled data. In order to use it as a classification tool, it is necessary to combine the views generated at the end of the iteration within the multi-view co-training. There are many methods to combine the views, but we prefer the naive AND and OR rule to combine final predictions coming from View 1 and View 2. The AND rule assigns the result as 1, if the results from the ith observation of both views are 1. The OR rule assigns the result as 1, if only one of the results from the ith observation from both views is 1. Multi-view co-training algorithm stepsThe proposed algorithm is a Multi-View Co-Training Machine Learning Model designed to predict severe hypoglycemia (SH) in patients with Type 2 Diabetes Mellitus (T2DM). This approach leverages both labeled and unlabeled data, integrating semi-supervised learning (SSL) and supervised learning (SL) techniques to enhance prediction accuracy, especially in imbalanced datasets.The proposed algorithm is a Multi-View Co-Training Machine Learning Model designed to predict severe hypoglycemia (SH) in patients with Type 2 Diabetes Mellitus (T2DM). This approach leverages both labeled and unlabeled data, integrating semi-supervised learning (SSL) and supervised learning (SL) techniques to enhance prediction accuracy, especially in imbalanced datasets.AlgorithmThe proposed algorithm is a Multi-View Co-Training Machine Learning Model designed to predict SH in patients with T2DM. This approach leverages both labeled and unlabeled data, integrating SSL and SL techniques to enhance prediction accuracy, especially in imbalanced datasets. We show the algorithm steps below:

Data preprocessing: Apply preprocessing steps as explained in the “Method” section.

Feature selection methods: Apply multiple feature selection techniques.

MD (Medical Selection Criteria): Based on clinical relevance and expert knowledge.

LASSO (Least Absolute Shrinkage and Selection Operator): Reduces the dimensionality by penalizing the absolute size of coefficients.

Boruta: A wrapper algorithm that selects all relevant features by comparing the importance of real features to shadow features.

MRMR (Minimum Redundancy Maximum Relevance): Selects features that are most relevant to the target variable while ensuring minimal redundancy among them.

Algorithm:

Input Dataset: ACCORD EHRs dataset.

Views: Create two distinct views for co-training:

Multi-View Co-Training Procedure:

* Initialization:

Split the labeled dataset into training and test sets (80% training, 20% testing).

Initialize two models, one for each view.

* Iteration Process:

Train each model on its respective view using the labeled data.

Use each trained model to label the most confident unlabeled data points (pseudo-labeling).

Add these pseudo-labeled points to the training set.

Repeat the process until constraints are reached.

* Combination of Views:

Model Evaluation:

Assess model performance using metrics such as Accuracy, Specificity, Sensitivity, Positive Predictive Value (PPV), and Negative Predictive Value (NPV).

Perform cross-validation to ensure robustness and avoid overfitting.

Hot Topics

Related Articles