Comparison of machine learning models for the prediction of hypertension in transgender patients undergoing gynecologic surgery

DataThe American College of Surgeons National Surgical Quality Improvement Program (ACSNSQIP) database is a national surgical registry utilized to measure risk-adjusted outcomes of multiple surgical procedures spanning multiple surgical specialties. Over 700 hospitals report over one million surgical cases a year in the NSQIP dataset. The data is audited for accuracy and prospective variables are collected by trained clinical reviewers.Study populationPatients in the ACSNSQIP database who were coded as having a gynecologic or obstetric surgery within January 2005 through December 2019 and were coded as having a Male sex met the inclusion criteria for the transgender cohort for this study and patients coded as having a Female sex met the inclusion criteria for the cisgender cohort for this study. This study was exempt from IRB review pursuant to section 4ii of the of the IRB Exemption requirements and Brown University’s Institutional Guidelines and agreement with the ACSNSQIP data use agreement was required. The American College of Surgeons collects the ACSNSQIP data with informed consent and provides the data to medical researchers; therefore, there was no necessity to reobtain patient consent.Cohort developmentWe assembled 3 cohorts of patients for 2 main experiments (Table 1). The first cohort consists of all transgender patients that met the inclusion criteria (we will refer to this as the transgender cohort). The second cohort consists of a volume-matched and class ratio-matched cisgender and transgender patients available in the data (which we will refer to as the cisgender cohort). The goal for the cisgender cohort was to create a smaller dataset that emulated the observed compositions of the transgender patients and cisgender patients in the NSQIP dataset. The third cohort consists of all the transgender patients and cisgender patients who met the inclusion criteria of being recorded in the ACS NSQIP within January 2005 through December 2019 and were coded as having an obstetric or gynecologic surgery (which we will refer to as the combined cohort).Table 1 Cohort development breakdownThe cisgender cohort and transgender cohort were derived from the combined cohort. Cisgender patients were selected at random from cisgender patients in the combined cohort and transgender patients were selected at random from the transgender cohort to create the volume and ratio matched cisgender cohort of predominantly cisgender patients.In the cisgender cohort, the total number of patients selected were equal to the total number of patients in the transgender cohort, in order to have a consistent sample size during model development. This cisgender cohort was intended to be a microcosm of the combined cohort and was therefore volume matched to the lower sample size of the transgender cohort to get a fairer comparison between ML models developed on the 2 cohorts. The ratio of cisgender to transgender patients in this cisgender cohort were directly predicated on the ratio of observed cisgender to transgender patients in the combined cohort, to emulate the observed ratios of transgender and cisgender patients in a real medical database. Therefore, this cohort was predominantly cisgender, due to the lower representation of transgender patients in the ACSNSQIP, representative of the lower proportion of transgender patients documented in most medical databases.OutcomeThe primary outcome variable analyzed was a diagnosis of hypertension severe enough to require medication, which may impact the patient’s risk for cerebrovascular, renal and cardiac disease. To be documented as a positive, the patient’s hypertension must be recorded in their medical record and their hypertension must be severe enough that to warrant administration of antihypertensive medication (like calcium channel blockers, diuretics, beta blockers, and ACE inhibitors) within 30 days prior to their index surgery, or during the time the patient is being considered as a candidate for surgery. Furthermore, the patient must have been receiving or required (if noncompliant) long-term treatment of their chronic hypertension exceeding 2 weeks to be coded as a yes for this outcome. Although this dataset consisted of surgical patients because this variable was solely recorded preoperatively, it can be used to model and predict hypertension in nonsurgical candidates as well.The class balance ratio for the hypertension outcome variable was kept consistent between the combined cohort and the volume-matched cisgender cohort. The ratio of cisgender patients that had hypertension to cisgender patients who did not have hypertension in the large, combined dataset were preserved in the development of the smaller, volume-matched cisgender cohort to emulate the real, observed distribution of hypertension cases in cisgender patients. For transgender patients in the volume-matched cisgender cohort, the same ratio of transgender patients with hypertension to transgender patients without hypertension were kept constant to the observed ratio in the transgender cohort.Machine learning modelsAny patients carrying blank/NULL values for the outcome variable column were removed to eliminate any uncertainty/inaccuracy from the training. These patients with missing values were omitted from the analysis to avoid any ascertainment bias in erroneously classifying a positive case as a negative case and vice versa. The recording of these values are audited by the NSQIP and quality checked to ensure that they are accurately documented. Then, blank data were handled by multivariate iterative imputation in order to reduce bias in the data. Binary values that were imputed through multivariate imputation were rounded to the nearest whole number (0 or 1) to maintain medical consistency and interpretability within the data. The outcome variable was removed from the data frame prior to this process and was appended back on after imputation to avoid introducing inaccuracies in model development.The cohort was split at the patient level such that no training data could appear in the testing set. All variables studied in the analysis were included in the model to optimize the predictive potential of the model and preserve intervariable correlations to optimize model performance.Selecting individuals was done randomly to assemble all cohorts. For each of the 3 cohorts, a 75–25% stratified train test split was performed to preserve the hypertension class ratio between the training set and test set. The test set for all models developed on all cohorts was a set of 25% of the patients in the transgender cohort, unique from the patients in the training set for transgender patients. This was done to ensure that the predictive potential of all models specifically in the prognosis on cardiovascular outcomes in transgender patients was being evaluated and compared. The scikit learn package’s train-test-split function was used as a random assortment algorithm were used to segment cohorts into training and testing sets to reduce bias. Blinding was not possible due to need to develop ML models, but no patients were fully observed at the individual level, patient data in the NSQIP is de-identified, and aggregate patient data was stored in the form of variables to mitigate bias.ML models were selected based on existing literature2,12 and narrowed to supervised models due to their higher accuracy rates and the presence of labeled data in the training set. ML models were hyperparameter optimized through a grid search and was validated through a 5-fold cross validation to obtain the optimal hyperparameters yielding the best results on the testing set.Variable importanceVariable importance was determined based on the model. For the random forest model, variable importance (VI) is determined using the mean decrease in Gini index/impurity. High mean decrease in the Gini Index indicates more importance. For the logistic regression model, VI is found by taking the absolute value of coefficients of the ultimate model, ranking the coefficients by magnitude; a larger coefficient value indicates higher importance. For the XGBoost model, VI is calculated for a single tree’s importance by improving the node purity, and then summing the importance over each boosting iteration. The VI averages all importances across each variable for all decision trees to formulate a ranking. For this model, we used the gain of each tree to formulate the importance rankings, where a larger gain indicates higher importance12.Statistical analysisDescriptive statistical analysis was utilized to assess differences in the mean clinical features for the cisgender and transgender cohort. Measurements were taken from distinct samples. Initial analysis was done by conducting an independent, one-way analysis of variance (ANOVA) test, equivalent to a 2-tail t-test when done for two independent groups, of every independent variable included in the models, segmented between the cisgender and transgender cohorts, to compare if these features were represented more in transgender vs cisgender cohorts.After ML models were developed on the cisgender, transgender, and combined cohorts, they were assessed on the testing set of transgender patients, unique from model development, by calculating the area under the curve (AUC) of the model’s receiver operating characteristic (ROC), which was obtained through bootstrapping. The threshold-independent nature of discrimination of the AUC makes it a strong metric for our analysis. A salient limitation of using AUC ROC for imbalanced datasets include sensitivity to changes in predictions for the minority class. For example, if there are a low number of patients for positive class, then the AUC score may vary widely depending on how the model predicts for the positive class, which may not be indicative of how the model would prospectively perform given the real distribution.Furthermore, AUC scores in imbalanced data may be artificially inflated because false positive rates do not drop as drastically when the number of total true negatives is very large. This is why metrics like the F1 score that account for precision (which is highly sensitive to false positive rates irrespective of high true negative values) help to better contextualize model performance. Because AUC ROC metrics can be affected by class imbalance present within the data, the unweighted F1 score and Matthew’s Correlation Coefficient (MCC) metrics were also obtained for each model, along with a 95% confidence interval for each metric across each model. The MCC is a statistical test evaluating model performance by calculating the total discrepancy between the model prediction and true value.To compare the statistical significance between the performance of the ML models developed on the transgender, cisgender, and combined cohorts, 5 by 2 cross validation fold hypothesis testing was utilized between the ML models developed on the transgender and cisgender cohorts and between the ML models developed on the transgender and combined cohorts13. Only ML models of the same type, developed on the different cohorts, were compared against each other. This hypothesis testing framework was chosen over other frameworks like ten-fold cross validation due to its relatively lower Type I error, its ability to be modified to overcome lack of independence in the data, and its ability to obtain a strong estimate of generalization error and variance of the generalization error between the performance of the 2 compared models. In 5 by 2 cross validation, a paired t-test is conducted between the performance of the 2 models compared and a p value is generated under the null hypothesis that that both models perform equally well on their given dataset.All analyses were conducted using the Sklearn version 0.24.2 package and pandas version 1.5.0 package in Python (Python Software Foundation) and R 4.1.0.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles