Investigation of the risk factors associated with prediabetes in normal-weight Qatari adults: a cross-sectional study

Study PopulationWe obtained cross-sectional clinical, anthropometric, and demographic data of 5996 Qatari individuals aged between 18 and 86 years (3,229 Females and 2,771 Males) from Qatar Biobank (QBB), a national institute running a well-phenotyped cohort, by collecting data from the general population in Qatar since 201218. Inclusion criteria included being over 18 years and having a HbA1c < 6.5%. People with type 2 diabetes (HbA1c≥ 6.5%) and pregnant women were excluded. The Flowchart of data processing is shown in Fig. 1.The institutional review board approved the current project at the Qatar Biomedical Research Institute (IRB number: 2017–001) and QBB (IRB number: Ex-2018‐Res‐ACC‐0123‐0067). All participants gave written informed consent for their data and biospecimens to be used in medical research.Fig. 1Flowchart of data processing.Anthropometric and clinical measuresThe Qatar Biobank (QBB) provides 52 clinical measurements along with 9 additional measurements used to assess various aspects of health and physiology through medical tests and imaging, including Grip strength, 12-lead ECG, ultrasound scan of carotid arteries, Vicorder artery stiffness, retinal eye test, DXA scan of the whole body, treadmill walking test, lung function, and MRI for eligible participants. For the purpose of this study we only utilized the 52 blood test measurements along with height, weight, body fat measurement, blood pressure, and hip and waist measurements. Consequently, data on 57 variables were requested. Due to space constraints, a comprehensive list of these variables is not provided here. For detailed descriptions of the variables, please refer to the QBB website.https://www.qatarbiobank.org.qa/participate/description-measurements. The BMI (kg/m2) was calculated as weight in kilograms (kg) divided by measured height in meters squared (m2).For variable categorization, well-accepted clinical guidelines were used, when available. For BMI (in kg/m2), the Caucasian cut‐offs were used, categorizing BMI into four groups: underweight (BMI < 18.5 kg/m2), normal (BMI 18.5–24.9 kg/m2), overweight (BMI 25–29.9 kg/m2) and obese (BMI ≥ 30 kg/m2).Specifically, we will use here two groups: normal-weight (BMI 18.5–24.9 kg/m2) and overweight/obese (BMI ≥ 25 kg/m2); NW and OWO respectively.Plasma samples of patients fasting for at least 6 h were handled according to a standard protocol within 2 h of blood collection. Fasting plasma glucose (FPG), HbA1c, triglyceride (TG), total cholesterol (TC), low-density lipid cholesterol (LDL-C), and high-density lipid cholesterol (HDL-C) were analyzed with an automated biochemical analyzer at the central laboratories at the Hamad Medical Corporation in Doha.PreD cases were defined as those individuals with HbA1c between 39 mmol/mol (5.7%) and 47 mmol/mol (6.4%), whereas controls were those with HbA1c < 39 mmol/mol (5.7%).Two more variables were calculated, the Homeostasis model assessment of insulin resistance (HOMA-IR) and homeostasis model assessment of β-cell dysfunction (HOMA-B). HOMA-IR was calculated as = fasting insulin (µIU/L) \(\:\times\:\) fasting glucose (nmol/L)/22 or (I0 (µIU/mL) \(\:\times\:\:\)G0 (mmol/L)/22) and HOMA-B was calculated as = ((20 \(\:\times\:\:\)insulin)/(glucose − 3.5))/100 or ((20\(\:\:\times\:\:\)I0 (μIU/mL)/G0 (mmol/L) − 3.5))/10019.
Training and validation populations
A 65/35 split was used on the 1,160 samples. For the training of the machine learning (ML) models, we used a case–control design that included 109 cases and 645 healthy controls. To validate the models developed in the training stage, we used data from 59 cases and 347 healthy controls (see Fig. 1).Statistical analysisAll statistical analysis was carried out using R version 3.32.1.1, and R package “h2o” (version 3.17.0.4195) for building logistic regression and the other machine learning (ML) models. Variables with > 20% missing values were excluded. The unsaturated iron-binding capacity (UIBC) variable, although missing 21%, was kept for its importance. All the remaining variables had < 20% missing values, were imputed using the MICE package in R.Descriptive statistics were used to describe the baseline characteristics of participants. Continuous variables were expressed as means ± standard deviation (SD). Independent Student’s t-test was used to compare the means, where the \(\:\chi\:\)2-test was used to compare proportions and the dependence between the prevalence of PredD and the different factors. Statistical significance for all tests was set at p < 0.05 (Tables 1 and 2).Machine learning modelsIn this section, we employ a variety of machine learning algorithms including deep learning (DL), gradient boosting machine (GBM), random forest (RF), and generalized linear models (GLM). As a baseline, we also use a logistic regression model (LR) due to its simplicity and ease of implementation, making it accessible for researchers with limited machine learning experience and facilitating the creation of their own intent prediction systems. Additionally, other machine learning models excel at capturing complex, non-linear relationships in data, making them highly effective for nuanced pattern recognition.The package “h2o” (version 3.32.1.1)20 was used for building the machine learning (ML) models.Random ForestRandom forest (RF) belongs to the class of ensemble based supervised learning techniques. Random forest algorithm applies the general technique of bagging or bootstrapped aggregating to decision tree learners. By performing this bootstrapping procedure, we obtain better model performance as it decreases the variance of the model, without increasing bias. This means that though each tree is a weak learner and sensitive to noise within its respective data, the average/majority of many trees is not, as long as the trees are not correlated. Thus, this bootstrap sampling is used to de-correlate the trees by showing them different parts of the dataset. Random forests automatically rank the importance of variables in a classification problem by considering the average Information Gain corresponding to each variable for all the trees. We used R package caret to generate random forest models21.Gradient boosting machineWe used gradient boosting machine (GBM) another ensemble technique for building a Predictive model. The principle idea behind this algorithm is to construct the new baselearners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. We used R package caret for building a GBM predictive model21.Deep learningDeep learning (DL) is a more complex and less interpretable machine learning technique.Deep learning is vaguely inspired by information processing and communication patterns in biological nervous systems. Of late, Deep Learning based models have been successfully applied in computer vision natural language processing, bioinformatics etc. The problem of PreD identification is a classification problem. In the case of deep learning, we learn a non-linear mapping function that takes as input the feature set, xi, for a given sample and outputs a score ∈ [0, 1] i.e. t : xi → yi, where t is the mapping function. In this work, t is a Deep Fully Connected Feed-Forward Neural Network (DNN) that exploits the non-linear interactions between the input features to make its prediction. A feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous functions under certain mild assumptions on the activation function.Performance measureResults are presented as Odds Ratios (OR) with associated 95% confidence intervals (CI) for 1-SD increase of the independent variables. The predictive value for preD of each index was determined by the area under the curve (AUC) in the Receiver Operating Characteristic curve (ROC) analyses. The cut-off point was selected according to the Youden index (sensitivity + specificity − 1). Statistical significance was set at p < 0.05.To compare the performance of machine learning models and logistic regression, we focused exclusively on the ROC curve and AUC.
Sensitivity (true positive rate)
Sensitivity is the proportion of actual positive cases that are correctly identified by the classifier. It is calculated as.
$$\:sensitivity=\frac{TP}{TP+FN}$$​Specificity (True Negative Rate): Specificity is the proportion of actual negative cases that are correctly identified by the classifier. It is calculated as:$$\:specificity=\frac{TN}{TN+FP}$$Where TN is the true positive, FN is the false negative, FP is the false positive.
ROC curve
To plot an ROC curve, we calculate the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings. Then, we plot sensitivity on the y-axis against 1-specificity on the x-axis for each threshold setting. This gives us a curve that shows how sensitivity and specificity change with different threshold values.
AUC is then calculated by measuring the area under the ROC curve. A perfect classifier has an AUC-ROC close to 1, while a completely random classifier.These metrics are particularly suitable for risk score prediction as they provide a comprehensive evaluation of the model’s ability to discriminate between different risk levels, independent of any specific threshold. The ROC curve illustrates the trade-off between sensitivity and specificity, while the AUC quantifies the overall performance across all possible thresholds, ensuring a robust assessment of the model’s predictive capabilities in identifying prediabetes risk.

Hot Topics

Related Articles