Key findingsComparison of LR model and machine learning models in the analysis of DR related factorsBased on four machine learning models, we identified six key factors related to DR. They were ALB_CR (Urinary albumin/creatinine ratio), UPR_24 (Twenty-Four Hours Urinary Protein), CP (C-peptide), HBA1C (Glycated hemoglobin A1c), BU(Blood urea), and SCR (Serum creatinine). The LR model found 27 related factors, These are HBA1C (Glycated hemoglobin A1c), NEPHROPATHY (Nephropathy), and BP_HIGH (Systolic blood pressure), CEREBRAL_APOPLEXTY (Carotid artery stenosis), CHD(Coronary heart disease), ENDOCRINE_DISEASE(Endocrine disease), GLU_2H (2-h postprandial blood glucose), BU(Blood urea), HB(Hemoglobin), ALB_CR(Urinary albumin/creatinine ratio), AGE (Age), HYPERLIPIDEMIA (Hyperlipidemia), RENAL_FALIURE (Renal faliure), LEADDP (Lower extremity arterial disease), MEN(Men), DIGESTIVE_CARCINOMA (Digestive carcinoma), GYNECOLGICAL_TUMOR (Gynecolgical tumor), LUNG_TUMOR (Lung tumor), OTHER_TUMOR (Other tumor), GLU (Glucose), UPR_24 (Twenty-Four Hours Urinary Protein), CP (C-peptide), and PCV (Packed cell) volume, ESR (Erythrocyte sedimentation rate), PT (Prothrombin time), CRP (C-reactive protein), among the 27 relevant factors, five factors were found by machine learning models.From the above results, we can find that machine learning models such as GBDT can automatically learn from a large number of features and identify the most important factors, which is easier for doctors or researchers to pay attention to and interpret. The LR model identified 27 disease-related factors, which may contain some redundant or less important factors, failing to identify the most critical factors as accurately as the GBDT, but providing a more comprehensive picture of disease-related factors.Although machine learning models such as GBDT are often better than LR statistical models in terms of prediction performance and can identify key factors, the internal working mechanism of the models is complex and not as intuitive and easy to understand as LR models. This paper adopts SHAP feature importance analysis method to understand which features contribute more to the prediction results and how they affect the prediction results. The interpretability of GBDT is increased.Significant contributions in model developmentPresently, there are two main categories of machine learning-based DR prediction models. The first category of DR prediction models is constructed on the foundation of fundus images. For instance, a study conducted by Indian researchers proposed a multi-path convolutional neural network (CNN) and a machine learning classifier for DR to analyze 36,769 fundus images for model development and validation. Post-feature extraction using the M-CNN approach, machine learning classifiers such as SVM, RF, and DT were employed to classify the images into various categories, achieving a DR prediction accuracy exceeding 90%30. Casanova et al. utilized graded fundus photography and systematic data from 3,443 participants in the ACCORD-Eye study, employing double cross-validation to estimate RF and logistic regression, resulting in a DR prediction accuracy of 75%31. Despite their high performance, these models are mainly used in a few developed countries due to the requirement of specialist ophthalmologists and costly medical equipment32. Numerous developing countries are currently unable to utilize these models for DR screening.The second category of DR prediction models relies on physiological and biochemical indicators to determine characteristic values. These models primarily use demographic data, medical history, and test results. Most published studies utilize SVM, artificial neural networks (ANN), RF, logistic regression, and decision trees for DR predictive classification. For example, Tsao et al. compared the performance of four machine learning algorithms using 10 features on 536 diabetic patients, employing a fivefold cross-validation. Their findings indicated that SVM achieved 79.5% accuracy in DR classification, outperforming decision trees, artificial neural networks, and logistic regression33. Yao et al. used 530 Chinese residents (including 423 type 2 diabetic patients) as their study population and utilized univariable and multivariable logistic regression (MLR) to analyze the correlation between DR and biochemical metabolic parameters. Based on the MLR results, they developed a Back Propagation Artificial Neural Network (BP-ANN) model to classify the 423 patients, revealing that the AUC values of the BP-ANN model surpassed those of the MLR model (0.84 vs. 0.77)34.A recent investigation, using data from the Korean National Health and Nutrition Examination Survey (KNHANESV-1 and KNHANESV-2 databases), compared learning models like ridge, elastic net, and LASSO with conventional DR indicators. The LASSO-based sparse learning model demonstrated an AUC of 0.82 and an accuracy of 75.2%, proving effective in predicting DR. Furthermore, Blighe et al. undertook an environment-wide association study using NHANES data, analyzing over 400 laboratory parameters linked to DR for predictive purposes. They employed parallel univariable regression models, principal component analysis (PCA), penalized regression, and Random Forest for the selection of independent variables (features). The RF model outperformed the others, with an AUC of 0.8435.In comparison to other studies, our research has shown several improvements. First, given the large number of features in the dataset, we employed recursive feature elimination for iterative model construction. This method eliminates irrelevant and redundant dimensions from the 85 features, retaining only those that are beneficial for learning classification. This approach addresses the issues of overfitting, memory consumption, and time overhead of the model while avoiding the subjectivity, inaccuracy, and unstable results that might be introduced by traditional dimensionality reduction approaches. Second, we explored the performance of four machine learning classification models with hyperparameter optimization and selectively constructed the GBDT model. This model showcases high adaptability across various data types and outperforms other data mining techniques in medical tasks for classification prediction. The application of GBDT in DR classification has not been documented in existing literature. Third, when compared to the classical logistic regression model, our model successfully identified six correlated factors for DR with an AUC value of 0.8672. Conversely, the logistic regression model identified 28 correlated factors but achieved a lower AUC value of 0.8341. This demonstrates the superiority of our model in terms of both the number of correlated factors and the classification prediction accuracy.Examination of influential factorsThis research identified ALB_CR (Urinary albumin/creatinine ratio), UPR_24 (Twenty-Four Hours Urinary Protein), BU (Blood urea), HBA1C (Glycated hemoglobin A1c), SCR (Serum creatinine), and CP (C-peptide) as significant correlates of DR, with six pairs of factors demonstrating some interactive effect on the prevalence of DR.ALB_CR and UPR_24 are commonly used to measure micro-urinary protein levels. The presence of proteinuria, an essential marker of damage to the vascular endothelial system, often indicates widespread microangiopathy in the body36. Studies by Li Rui and Li Meifang suggested that diabetic patients with proteinuria had a higher incidence of DR compared to those without proteinuria. They reported a relative risk of 2.638 for the group with microproteinuria (i.e., higher ALB_CR, UPR_24) and 2.702 for the group with substantial albuminuria, thus concluding that microprotein was closely associated with the development of DR37,38.Research conducted by Huang Shufang, Ai Wei, and Fan Ruilei suggests that increased levels of micro-urinary protein indicate an independent risk factor for DR. They proposed that testing for micro-urinary protein can serve as a predictive tool for the progression of DR, enabling early detection of diabetic microangiopathy and reducing the incidence of DR through timely clinical intervention39,40,41. The findings of this study are consistent with the SHAP fovea and dependency plots, where elevated ALB_CR and UPR_24 were significantly associated with DR. Furthermore, a notable rise in the incidence of DR was observed with increased ALB_CR levels compared to UPR_24.Serum creatinine (SCR) is a byproduct of meat consumption and muscle tissue metabolism. Variations in SCR concentrations are primarily determined by the glomerular filtration rate (GFR) and the filtering capacity of the kidneys. Elevated SCR levels often signify kidney damage, although they are insensitive indicators of kidney parenchymal damage. Extensive renal damage affecting more than half of the kidney can lead to an increase in SCR. However, SCR levels do not indicate an early or mild decline in renal function. Conversely, our research revealed a significantly higher risk of DR development when SCR levels exceeded 0.02 (60 μM/L before data pre-processing), even within the normal value range (44‒133 μM/L). Therefore, monitoring SCR within the normal range can aid in the early detection of DR rather than raising concerns only when SCR levels exceed normal values. Studies have demonstrated a statistically significant difference in SCR levels between DR and NDR groups (p < 0.05)42,43, proving SCR as an independent risk factor for DR. Individuals with SCR levels exceeding 133 mM/L are 2.006 times more likely to develop DR than those with normal SCR44.Blood urea (BU) is the primary end product of protein metabolism and is removed from the body through glomerular filtration. Research conducted by Wang Yangzhong, Liu Hongfang, and Ma Yue has revealed that BU levels were significantly higher in the DR group compared to the non-DR group, demonstrating a statistically significant difference (p < 0.05). BU levels in type 2 diabetic patients were associated with DR development45. The SHAP plot results from our study depicted that as BU levels increased, there was a shift from suppression to promotion in the model, predicting a positive result. When the BU index exceeded 0.15 (5 mM/L before data pre-processing), the proportion of DR patients significantly increased, corroborating with the findings of Song Yanan et al46.UPR_24, ALB_CR, SCR, and BU are key renal function indicators closely associated with DR risk. This might be attributed to the similar characteristics between the kidney and retina, such as their origin, development, capillary network structure, and filtration barrier function47. DR and renal disease may share multiple pathogenic mechanisms. DR development could be influenced by activation of the renin-angiotensin system, impairment of renal function, and genetic, hemodynamic, and lipid metabolism48. Moreover, pathogenesis may involve the accumulation of glycosylation end products, activation of the polyol pathway, oxidative stress, growth factors, endoplasmic reticulum stress, inflammatory mediators, and complement activation49.HBA1C reflects not only blood glucose control over time but also plays a role in DR onset and progression. Chronic high blood glucose levels are the primary DR development catalyst. In vivo and in vitro studies have demonstrated that high glucose levels induce pericyte apoptosis50. Pericytes provide structural support to capillaries; their loss can lead to localized bulging of capillary walls, contributing to microaneurysm formation, which is the earliest DR clinical sign51. Poor glycemic control may increase DR risk. Higher levels of HbA1c in diabetic patients can exacerbate vessel wall damage and capillary occlusion through increased aggregation of erythrocytes, potentially causing tissue ischemia and hypoxia, which can trigger retinal metabolic disorders52. Our analysis of the SHAP plot data revealed a strong correlation between HBA1C and DR. The DR prevalence risk increased when HBA1C exceeded 7% and also dropped below 6%. As HBA1C values decreased, the resistance to DR prevalence decreased as well, thus elevating the likelihood of DR prevalence, consistent with previous studies53. However, this does not imply that lower glycosylated hemoglobin is always beneficial for diabetic patients. Glycosylated hemoglobin levels maintained below 6% have been associated with increased hypoglycemic episodes and higher mortality. Therefore, it is essential to maintain control over glycosylated hemoglobin. Some scholars recommend maintaining glycosylated hemoglobin levels between 6.5% and 7% for optimal management54.C-peptide (CP) serves as an indicator of the body’s insulin secretion level and is a reliable gauge of the reserve function of pancreatic islet cells. Numerous domestic and international studies have revealed that CP, an active hormone, specifically binds to endothelial cells in a stereotactic manner. In a high-glucose environment, it stimulates Na+-K+-ATPase on the cell membrane surface of endothelial cells, inhibits nuclear factor-kb, and activates endothelial-type nitric oxide synthase, leading to improved blood flow and vascular permeability within the retinal vasculature55. Furthermore, in diabetic patients, C-peptide activates AMP-activated protein kinase alpha to inhibit intracellular reactive oxygen species-mediated endothelial apoptosis, thereby improving endothelial dysfunction56. A study by Bo et al. observed that the lowest fasting CP levels in type 2 diabetic patients correlated with the highest incidence of DR at baseline and at follow-up, while the risk of retinopathy in diabetic patients negatively correlated with the highest fasting CP levels55. A study by Wang et al. (2018) categorized four different groups (Q1, Q2, Q3, and Q4) based on fasting CP levels57. The results indicated a progressive decrease in the prevalence of DR as fasting CP levels increased. The results of our SHAP plot also confirmed CP as a protective factor against DR. A lower CP value correlated with a higher likelihood of DR, whereas an increase in CP value significantly decreased the likelihood of DR.The SHAP plot results in this study revealed a significant potential interaction between six pairs of characteristics ‒ UPR_24 and ALB_CR, BU and ALB_CR, CP and ALB_CR, HBA1C and UPR_24, CP and UPR_24, and HBA1C and BU ‒ on the prevalence of DR. Particularly, the interaction between UPR_24 and ALB_CR became evident at lower ALB_CR values. The risk of developing DR increased with increasing UPR_24 levels. BU exhibited a partial interaction with ALB_CR, with changes in BU non-significantly affecting the predicted outcome, but diabetic patients with lower ALB_CR values had a higher risk of developing DR. Moreover, CP displayed interactions with ALB_CR in patients with both higher and lower CP values, influencing the risk of DR. The difference in DR risk linked to UPR_24 was dependent on HBA1C levels, with higher HBA1C levels increasing the risk of DR at lower UPR_24 values. In the interaction between HBA1C and BU, the probability of DR was increased when the HBA1C value was below 7% and the BU value was higher. Conversely, if the HBA1C value exceeded 7%, the likelihood of DR increased even with a lower BU value. These interactions suggest that paying careful attention to one indicator while testing another could provide a more comprehensive prediction of DR. Furthermore, this conclusion offers a new research direction for the pathological mechanism of DR in patients who maintain good control of one indicator.Limitations and outlookDespite the improvements and novel findings in this study, several limitations persist: (1) Our study used a dataset of 3000 records, which is a relatively small sample size, limited to only Chinese patients. It is recommended that the scope of the study be broadened to include diabetic populations from multiple countries, thus allowing our model to predict and diagnose DR more comprehensively. (2) Numerous fields in the sample had missing data, and filling them with mean values introduces a certain level of error, consequently affecting the predictive performance of our model. Future research should consider using cluster interpolation and model interpolation to reduce this error and select a dataset with fewer missing values. (3) The duration of the disease was not included in our data, which has a significant correlation with DR. Hence, future data collection should include the duration of patients’ diseases58,59. (4) The understanding of the impact of the interaction of two characteristics on DR prediction remains in the exploratory stage, with limited literature supporting most of these interactions. Future research should further investigate these potential interactions and their implications for DR prediction and management.