Efficacy of automated machine learning models and feature engineering for diagnosis of equivocal appendicitis using clinical and computed tomography findings

Study designThis study was conducted as a single-center, observational, retrospective analysis from April 2011 to November 2019, with approval from the Institutional Review Board of Ajou University Hospital (IRB no. AJOUIRB-MDB-2021-291). Due to the retrospective nature of the study, the requirement for informed consent was waived by the Institutional Review Board of Ajou University Hospital.To investigate the capabilities of AutoML models in predicting AA in patients with ambiguous CT findings, we followed the TRIPOD guidelines (as detailed in Online Supplementary Table S1) and established an AutoML framework31. All methods were performed in accordance with the ethical standards of the Declaration of Helsinki and TRIPOD guidelines.Enrolled patientsThe study population comprised patients aged 15 years and older who underwent intravenously enhanced abdominal CT scans for the differential diagnosis of AA. AA was considered a differential diagnosis in the CT reports due to ambiguous (equivocal) CT findings, totaling 335 patients. The CT report system classified the likelihood of AA into five categories, namely definitive “appendicitis,” “probable appendicitis,” “indeterminate” (equivocal CT findings), “probably not appendicitis,” and “normal appendix.” Of the 335 patients, 303 were included in the study, and 32 patients were excluded for incomplete medical records (Fig. 1).Fig. 1Study flowchart. AG appendicitis group, CT computed tomography, NAG non-appendicitis group.Imaging methods and interpretationIn this study, all CT scans were performed using a 16-slice multidetector CT scanner (Brilliance 16, Philips Healthcare, Eindhoven, Netherlands), with intravenous contrast material administration. No oral contrast medium was used. The scans covered the abdomen from the diaphragm to the symphysis pubis. The technical parameters for the scans included a collimation of 1.5 mm, rotation time of 0.75 s, and a pitch of 1.188. The images were reconstructed into axial and coronal sections with a thickness ranging from 3 to 5 mm. The tube voltage and current settings were 120 kVp and 150–300 mA, respectively. Contrast enhancement was achieved using iohexol (Omnipaque 350, GE Healthcare, Princeton, NJ, USA) and iopamidol (Pamiray 370, Dongkook Pharmaceutical, Seoul, Korea), administered 60 s post an initial dose of 2 mL/kg body weight. The contrast medium was infused at a rate of 4 mL/s through an antecubital vein. A retrospective analysis of abdominal CT scans with equivocal findings was performed by an experienced abdominal radiologist, who had over 15 years of expertise. The radiologist searched for CT signs indicative of AA, which included cross-sectional appendix outer diameter measurements, peri-appendiceal fat stranding or fluid, appendiceal wall enhancement, appendiceal and cecal wall thickening, intraluminal air, peri-cecal lymph nodes, and fluid filled small bowel. The measurement of the appendix diameter was conducted on the axially enhanced sections, specifically from the appendix’s largest visible portion.Clinical data and conventional scoring system (AAS)The present study conducted a detailed examination of patient demographics, clinical presentations, and laboratory findings, including blood and urine analyses, sourced from electronic medical records. Data collection was performed systematically by a single emergency medicine physician unrelated to this study. Patients with ambiguous CT results were divided into two cohorts based on their ultimate diagnosis: the appendicitis group (AG, N = 115) and the non-appendicitis group (NAG, N = 188).Final diagnoses were determined as follows: for patients who underwent surgical intervention, the diagnosis of appendicitis was confirmed through histopathological examination, which showed transmural infiltration by neutrophils in the appendix. For those who did not undergo surgery, a review of their medical records over a two-week period was conducted to establish a diagnosis. Patients who sought treatment at other facilities were contacted by telephone. Similarly, patients who were referred to other medical centers received follow-up calls to verify their diagnostic outcomes.The AAS was selected for its superior performance in terms of the area under the receiver operating characteristic curve (AUROC) compared with other traditional scoring systems in prior research4,32. The criteria for the AAS included migratory pain to the right lower quadrant (RLQ) (2 points), direct RLQ tenderness (3 points for men and those over 50 years, 1 point for women aged 16–49 years), rebound tenderness (2 points for mild; 4 points for moderate to severe), elevated white blood cell (WBC) count (1 point for 7.2 ≤ WBC < 10.9 [× 109/L]; 2 points for 10.9 ≤ WBC < 14.0 [× 109/L]; 3 points for ≥ 14.0 [× 109/L]), and increased C-reactive protein (CRP) levels, with scores adjusted for symptom duration (< 24 h: 2 points for 4 ≤ CRP < 11; 3 points for 11 ≤ CRP < 25; 5 points for 25 ≤ CRP < 83; 1 point for CRP ≥ 83, and for symptoms lasting ≥ 24 h: 2 points for 12 ≤ CRP < 152; 1 point for CRP ≥ 152)13.Data preprocessing and feature engineeringBefore the development of the models, all variables, with the exception of age and the duration from onset of pain to CT scan, were converted into nominal categories to enhance their practicality and comprehensibility. The AutoGluon model utilizing only clinical findings incorporated 10 clinical and 4 laboratory variables as input. When integrating clinical and CT findings, the model included 10 clinical variables, 4 laboratory variables, and 8 radiologic CT variables. Prior to the training phase for all algorithms, numerical data were standardized. One-hot encoding was used for handling categorical variables within the ML algorithms. A body temperature equal to or higher than 37.3 ℃ was classified as elevated. Laboratory reference values specific to our hospital were used, defining leukocytosis as a WBC count exceeding 10.3 (× 109/L), an elevated neutrophil count as above 80% of the total WBC count, and an elevated C-reactive protein (CRP) level as above 8 (mg/L). Additionally, an appendix diameter of 6 mm or more was considered significant.In this study, the autofeat library was methodically utilized through three stages of the automatic feature engineering process to generate a comprehensive set of features, encapsulating complex non-linear relationships and interactions between variables. This procedure was executed using the AutoFeatRegressor class, with the verbose parameter set to 1 to enable real-time progress logging. The feateng_steps parameter was adjusted to 3, facilitating in-depth feature engineering across three distinct steps. This methodology allows the model to efficiently identify and assimilate important non-linear patterns and interactions present within the data, thereby enhancing the accuracy of disease diagnosis and prediction, particularly within structured data environments. Such an approach markedly broadens its utility in medical data analysis. (Supplementary code) Datasets are available in the supplementary information files. (Supplementary data and model information: Supplementary_model_information.zip). The supplementary file “Supplementary_model_information.zip” contains a pkl file storing global variables from the model development phase, the dataset used, and detailed information about the AutoGluon model. The README.txt file within the zip file explains the usage instructions and important variables.AutoML model developmentIn this study, we utilized AutoGluon, an advanced open-source AutoML framework developed by Amazon, to develop two models for diagnosing equivocal AA. The first model, named “AutoGluon-clinical,” was based solely on clinical findings, while the second model, “AutoGluon-clinical-CT,” integrated both clinical and CT findings. Several key factors justified the choice of AutoGluon:

(1) AutoGluon automates essential aspects of ML, such as data preprocessing, feature engineering, model selection, and hyperparameter tuning. This significantly reduces the time and effort required for model development, enabling researchers without extensive expertise in ML algorithms or coding to construct high-performance models.

(2) The AutoGluon framework is designed to achieve high predictive performance by leveraging advanced ensemble techniques and comprehensive model evaluation. These models are particularly suited for clinical settings where diagnostic accuracy is paramount. AutoGluon efficiently handles various data types, making it suitable for the complex and diverse datasets typically encountered in medical research. Furthermore, AutoGluon performed extensive hyperparameter optimization and model evaluation, training various algorithms such as neural networks, random forests (RF), and gradient boosting, and combining them into a robust ensemble model. This approach maximized the predictive performance and generalization capability of the model.

(3) AutoGluon’s automated approach to feature engineering, particularly through the use of the autofeat library, allows for the creation of sophisticated and non-linear features that can enhance model performance. This capability is crucial for capturing complex patterns and interactions within the data, thereby improving the accuracy and robustness of diagnostic models.

Prior to model development, we standardized numerical data and applied one-hot encoding to categorical variables to ensure compatibility with ML algorithms. Additionally, clinical and laboratory variables were transformed into nominal categories to enhance practical applicability and comprehensibility. Specific preprocessing steps included classifying body temperatures ≥ 37.3 ℃ as elevated, defining “leukocytosis” as a WBC count > 10.3 × 109/L, marking neutrophil counts > 80% of total WBC count as elevated, considering CRP levels > 8 mg/L as elevated, and treating an appendix diameter ≥ 6 mm as significant.During the model development process, we used the TabularPredictor class from AutoGluon to build models optimized for tabular data analysis. The presets parameter was set to “best_quality” to leverage the most advanced modeling capabilities of the framework, and the auto_stack parameter was enabled to facilitate model stacking. Additionally, we used the autofeat library for automated feature engineering, which generates novel features by capturing complex non-linear relationships and interactions between variables. This process was conducted in three stages using the AutoFeatRegressor class, enhancing the model’s ability to detect intricate patterns within the data. The optimization process, by using the k-rule bootstrap aggregation algorithm with k values usually between 5 and 1029, aimed to improve the model’s generalizability and prevent overfitting, ensuring the experimental results were both reliable and precise.For the analysis, individual ML models, including logistic regression (LR), least absolute shrinkage and selection operator (LASSO) regression, ridge regression, support vector machine (SVM), decision tree (DT), RF, and extreme gradient boosting (XGBoost), were used. The implementation of these models was facilitated through specific R packages: “glmnet” for LR, LASSO, and ridge regression; “e1071” for SVM; “rpart” for DT; “randomForest” for RF; and “xgboost” for XGBoost.The dataset was partitioned into training and test sets at a 7:3 ratio, comprising 213 individuals in the training set and 90 in the test set. The training set was dedicated to model training, while the test set was used to evaluate the model’s performance without modifying its parameters or effects. A tenfold cross-validation (k = 10) technique was applied within the training set to develop and gauge the performance of the ML models.Evaluation of performance of autoML models, single ML model, and conventional scoring systemTo determine the most effective modeling approach, the single ML model with the highest AUROC was identified. This model was subsequently compared with the AutoGluon-clinical model, AutoGluon-clinical-CT model, and AAS. A comprehensive assessment of each model’s performance was conducted using a range of established evaluation metrics, including AUROC, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the F1 score. These metrics were calculated and depicted through AUROC. The development of the AutoGluon models was performed using Python version 3.10.13. For the comparative analysis of models based on probabilities derived from the AutoGluon models, the R software (version 4.4.1) and its packages, including caret, epiR, and pROC, were used.Feature importanceTo evaluate feature importance in the model, the feature_importance attribute of AutoGluon’s TabularPredictor was used, employing permutation importance. This method assesses the decrease in prediction performance of the model when the values of a single column are randomly shuffled row-wise. Features ranked highly in this assessment contribute significantly to the accuracy of AutoGluon. Features with non-positive importance scores contribute minimally to the model’s accuracy or may even detrimentally affect it if included in the dataset. Therefore, these scores do not explicitly reveal the directional impact of each feature on predictions. Assessing the effect of each variable on the model’s accuracy is crucial. However, feature-importance scores provide insights into the importance of features. Although the absolute value of the importance score is significant, the contributions from variables should be evaluated in terms of their relative magnitudes. The resulting DataFrame includes the feature names (index), estimated importance scores (importance), standard deviations of the scores (stddev), p-values indicating the statistical significance of feature importance, number of estimations used for scoring (n), and percentiles defining confidence intervals (p95_high, p95_low).Statistical analysisFor the development and evaluation of AutoML and autoFE models, Python version 3.10.13 was used along with AutoGluon (version 1.1.1) and autofeat (version 2.1.2). Single ML models were developed using R (version 4.4.1). The analyses were conducted on a computer system operating on Windows 11, powered by a 13th Generation Intel(R) Core(TM) i9-13900 K CPU, and equipped with an NVIDIA GeForce RTX 2080 Ti GPU. For continuous variables, descriptive statistics such as mean ± standard deviation or median (interquartile range) were reported, contingent upon the results of normality tests. Categorical data were presented as counts and percentages. Continuous data comparison between the appendicitis group (AG) and the non-appendicitis group (NAG) was conducted using the Mann–Whitney U test or the independent t-test, based on normal distribution analysis results. The chi-square test was applied to compare the distribution of categorical data between the two groups. Statistical significance was established at a P-value less than 0.05. Odds ratios and their 95% confidence intervals (CIs) were calculated using the R-package “stats.” ROC curves and AUROC analyses were performed using the R-package “pROC.” The optimal threshold for each model was determined using Youden’s index; the P values for the AUROC were calculated using the bootstrap resampling method with 1000 replicates. Additionally, accuracy, sensitivity, specificity, PPV, NPV, and the F1 score were determined using the R-packages “caret” and “epiR.”A prior study indicated that a sample size of 230 subjects was necessary to achieve a precision of 0.10 for either sensitivity or specificity, with an alpha error of 0.05 and a power of 80%, assuming a sensitivity of 82.0%, a specificity of 53.9%, and an incidence of appendicitis of 24.8% at a cut-off value for the AAS4,32. In this study, a total sample size of 303 subjects was deemed sufficient to satisfy the desired statistical power requirements.

Hot Topics

Related Articles