Predicting non-responders to lifestyle intervention in prediabetes: a machine learning approach

Sources of data and participantsThe database used for developing the predictive algorithm was the International Center for the Assessment of Nutritional Status (ICANS, University of Milan, Milan, Italy) database, which contains data of a large ongoing open-cohort nutritional study. As part of the protocol of the study, all patients at baseline receive a full nutritional assessment, they are prescribed a lifestyle intervention and eventually also a pharmacological intervention, and a follow-up examination is scheduled. At follow-up, a more limited number of parameters are routinely collected to evaluate changes in weight, body composition, and laboratory exams. For the development of the algorithm, all patients with prediabetes enrolled between 2009 and the beginning of 2019 were included. The complete database include 18.973 baseline observations, and a total of 45.148 follow-up observations. In this study we have included a total of 59 variables from the database.Patients included in this study were self-referring patients seeking a weight loss program, mainly resident in Milan or nearby cities, with a new or recent diagnosis of prediabetes. Eligibility criteria were: age ≥18 years; not pregnant and not nursing; no condition severely limiting movements and physical activity; no severe cardiovascular, neurological, endocrine, or psychiatric disorder; prescribed only a lifestyle intervention. The lifestyle intervention consisted of an hypocaloric omnivorous diet, with macro- and micronutrient levels set according to the Italian recommended daily allowances [5], and with a Mediterranean pattern; physical activity recommendation were also provided according to the WHO physical activity guidelines [6].The study complied with the principles established by the Declaration of Helsinki, and written informed consent was obtained by each subject. The ethical committee of the University of Milan (n. 6/2019) approved the study procedures.Outcome and predictorsThe outcome was normalization of glycemia within 1 year of starting the lifestyle intervention (dichotomous, fasting glucose <100 mg/dL).A total of 59 predictor variables were used in the analysis:

demographic data: age, sex, education, occupation, marital status

anthropometry: height, weight, arm length, arm circumference, wrist circumference, waist circumference, biceps skinfold, triceps skinfold, subscapular skinfold, suprailiac skinfold, arm muscle area, arm fat area, body density, fat mass, fat free mass

bioimpedance analysis: intracellular water, extracellular water

abdominal ultrasound: sternum subcutaneous adipose tissue, sternum visceral adipose tissue, abdomen subcutaneous adipose tissue, abdomen visceral adipose tissue

indirect calorimetry: oxygen consumption, carbon dioxide production, respiratory quotient, resting energy expenditure

medical history: family status, menstruation, pregnancies, diet status, diet history, physical activity, smoking, pharmacological treatments, clinical signs, weight history

vital signs: heart rate, systolic pressure, diastolic_pressure

blood and urine exams: white blood cell count, red blood cell count, hemoglobin, mean corpuscular volume, glucose, total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, gamma-glutamyl transferase, thyroid stimulating hormone, creatinine, uric acid, urea

Statistical and machine learning analysis methodsAll eligible patients at time of study were included, determining the sample size (no a priori calculations were made).For algorithms requiring complete data, we imputed missing data in the pre-processing phase using k-nearest neighbors imputation (Gower’s distance, number of neighbors = 5).Maximum predictive strength was sought through optimization of the correct classification fraction (CCF) and the receiver operating characteristic area under the curve (AUROC). Between accuracy and discrimination ability, accuracy was selected as the most relevant metric in the clinical settings (ie. maximization of the CCF).Several statistical and machine learning models were compared using 10-fold cross-validation resampling. For models requiring tuning parameters, a grid made of several combinations of tuning parameters was tested via 10-fold cross-validation.Prior to model selection, per-model preprocessing steps were defined in order to guarantee the best predictive ability for the specific model. To capture uncertainty about non-deterministic data manipulation, all preprocessing steps were repeated in each cross-validation fold.Principal component analysis (PCA) was employed as an optional preprocessing step aimed to reduce the dimensionality of the dataset. In these cases, PCA was used to transform the set of predictors in a reduced number of predictors designed to capture the maximum amount of information in the original variables. A potential benefit of this approach, other than the dimensionality reduction, is the production of statistically independent predictors that can ameliorate the problem of inter-variables correlations in the dataset.The following models were evaluated :

logistic regression

linear discriminant analysis

quadratic discriminant analysis

naive Bayes, tuned for kernel smoothness, and Laplace correction

K-nearest neighbour, tuned for number of nearest neighbors, and distance weighting function, Minkowski distance order

ridge regression and LASSO, tuned for the amount of regularization, and the proportion of LASSO penalty

decision trees, tuned for tree depth, minimal node size, and cost-complexity parameter

bagged trees, tuned for the cost/complexity parameter used by CART models, the maximum depth of a tree, the minimum number of data points in a node that are required for the node to be split further, and a cost value to assign to the class corresponding to the first factor level

random forest, tuned for number randomly selected predictors, number of trees, and minimal node size

boosted trees, tuned for tree depth, the number trees, the learning rate, the number randomly selected predictors, the minimal node size, the minimum loss reduction, the proportion observations sampled, and the number iterations before stopping

linear support vector machine, tuned for cost, and insensitivity margin

single layer neural network, tuned for the number of hidden units, the amount of regularization, and the number of epochs

Sensitivity, specificity, positive predictive value, and negative predictive value, were calculated for the best model as: Sensitivity =TP/(TP + FN), Specificity =TN/(TN + FP), positive predictive value = TP/(TP + FP), negative predictive value = TN/(TN + FN), where FN, false negative; FP, false positive; TN, true negative; TP, true positive.All statistical analysis was performed with R 4.1.1 [7]. Model preprocessing, tuning, resampling, and fitting were performed with the addition of the Tidymodels package to R (for algorithm-specific packages see the appendix).

Hot Topics

Related Articles