Comparative study of machine learning and statistical survival models for enhancing cervical cancer prognosis and risk factor assessment using SEER data

Data sources and inclusion criteriaThis study utilized the nationwide SEER database, which has been maintained by the National Cancer Institute in the United States since 1975, and is one of the largest and highest-quality cohort studies. It is the most widely used Medicare database for cancer datasets. This study first included female patients diagnosed with cervical cancer between 2013 and 2015. By selecting a slightly older cohort, we ensured that the data had been thoroughly vetted and that all pertinent follow-up information was available. This helps to reduce potential biases and inaccuracies that may arise when using more recent, but possibly incomplete, data. Furthermore, limiting the cohort to a specific timeframe promotes consistency in treatment practices and diagnostic criteria, which can change over time. This makes the results more comparable and reliable over the chosen time period.Patients with cervical cancer were identified when the “Site recode ICD O3.WHO 2008” field was filled in with “Cervix Uteri”. However, some SEER patients did not receive the recommended therapies due to patient preferences or comorbidities, which may have shortened their survival and reduced the model’s performance. According to the National Comprehensive Cancer Network guidelines, cervical cancer patients should have surgery (local tumor excision such as cone biopsy, hysterectomy, modified hysterectomy, and radical hysterectomy) or radiotherapy, depending on their clinical stage. Taking the treatment-related variables in SEER into account, patients who received surgery or radiotherapy via SEER were classified as having standard treatment.Data preprocessing and selectionVariables for each CC (Cervical Cancer) patient was collected and analyzed, including age at diagnosis, marital status, race, tumor stage, use of radiotherapy and chemotherapy, tumor size, TNM staging, household income, survival months, and vital status(alive/dead). Several columns in the selected instances were missing values. Missing values were replaced with the mean or mode based on the attribute type, and a few rows with a high number of missing values were removed. After preprocessing, the cervical cancer dataset contains 3810 instances. The selected instances were stored as CSV files. The data instances are separated into training and test data sets for training and evaluating prediction models.The “Survival months” attribute, a continuous variable that indicates the number of months a patient survived from the date of diagnosis, represents the primary outcome of interest in this study: the patient’s survival time. When determining the patient’s survival status at the conclusion of the follow-up period and identifying censored data, the “Vital status” attribute—a binary variable that indicates whether the patient is alive (0) or dead (1)—is used.MethodsSemi-parametric survival modelsThe Cox regression model is a semi-parametric survival model that does not require a fixed distribution for the baseline hazard, making it more flexible than fully parametric survival models. It combines parametric and nonparametric components, with a focus on estimating covariate hazard ratios14. The Methodology for Estimating the baseline hazard function includes15:

Partial Likelihood: The Cox model first estimates the regression coefficients (β) of the covariates using partial likelihood, which does not require the baseline hazard $\:{h}_{0}\left(t\right)$ to be specified. This method leverages the ordering of events to estimate the effects of covariates without explicitly estimating $\:{h}_{o}\left(t\right)$.

Baseline Hazard Estimation: After estimating the coefficients β, the baseline hazard function $\:{h}_{0}\left(t\right)$ is derived using the Breslow estimator, which provides a step function estimate of the cumulative baseline hazard $\:{H}_{0}\left(t\right)$. The baseline hazard at each event time $\:{t}_{j}$ is estimated as:$${H_0}\left( {{t_j}} \right)=\mathop \sum \limits_{{i \leqslant j}} \frac{{{d_i}}}{{\mathop \sum \nolimits_{{l \in R\left( {{t_i}} \right)}} {\text{exp}}\left( {{x_l}\beta } \right)}}$$where $\:{d}_{i}$ is the number of events at time $\:{t}_{i}$ and $\:{R}_{i}$ is the risk set at time $\:{t}_{i}$.

Hazard Function: The hazard function $\:h\left(t|X\right)$ is then obtained by differentiating the cumulative hazard function $\:{H}_{0}\left(t\right)$ with respect to time. The hazard rate $\:h\left(t|X\right)$ is assumed to be equal to a baseline hazard $\:{h}_{0}($t) modified by p covariates X = ($\:{X}_{1},{X}_{2},.{X}_{p}$) and their respective coefficients β = ($\:{\beta\:}_{1},{\beta\:}_{2},\dots\:{\beta\:}_{p})$:
$$h\left( {t{\text{|}}X} \right)={h_0}\left( t \right){e^{{\beta ^T}X}}$$

Usually, the primary interest lies in estimating the coefficients β, as the baseline hazard $\:{h}_{0\left(t\right)}$ is considered a nuisance parameter16,9. The estimation of $\:{h}_{0}(t$ ) is done after the covariate effects have been determined using Cox’s partial likelihood.Parametric survival modelsThe hazard rate in parametric survival models is typically assumed to have some shape (flat, monotonic, etc.). One advantage of using a parametric distribution is that it is possible to predict the time to event for the observed data well after the period in which events occurred17. Among the popular parametric survival regression models, the simple parametric model, and the Weibull model of regression were considered in this study. In the Weibull model, the baseline hazard function $\:{h}_{0}\left(t\right)$ is assumed to follow a Weibull distribution, defined as:$${h_0}\left( {t{\text{|}}\lambda ,\alpha } \right)=\alpha {\lambda ^\alpha }{t^{\alpha – 1}}:\left( {\lambda ,\alpha >0);~t \geqslant 0} \right)$$It has two positive real parameters λ and α, to be estimated from a training set.When covariates X = ($\:{X}_{1},{X}_{2,}\dots\:,{X}_{p}$) are included, the hazard function $\:h(t|X)\:$is expressed as:$$h(t|X)=\alpha\lambda^\alpha {t^{\alpha – 1}}{\text{exp}}\left( {{\beta ^T}X} \right):\left( {\lambda ,\alpha >0);~t \geqslant 0} \right)$$ where $\:\beta\:$ is the vector of coefficients for the covariates. This formulation allows the model to account for individual differences in risk based on their covariate values.Key differences between semi-parametric and parametric models:

In parametric models like the Weibull model, the baseline hazard function is specified as a parametric function (e.g., $\:{h}_{0}\left(t\right)$= $\:\lambda\:\alpha\:{t}^{\alpha\:-1}$ for the Weibull distribution). This allows for direct estimation of the baseline hazard parameters18.

In the semi-parametric Cox model, the baseline hazard function is not specified in advance, providing more flexibility and reducing the risk of model misspecification. However, this also means that $\:{h}_{0}\left(t\right)$ must be estimated non-parametrically after the covariate effects are determined19.

Random forests for survival analysisRandom Survival Forests (RSFs) are an ensemble tree method for survival analysis of right-censored data, derived from random forests20. Here, Random survival forests were chosen because they can handle high-dimensional data, capture complex interactions between variables, and are resistant to overfitting, all while providing variable importance measures that aid in interpreting the results, outperforming other machine learning survival techniques21. Random forests use decision trees, which have high variance but can capture complex interactions, to average their characteristics. This approach transforms weak individual trees into strong ensembles22.Randomness is introduced into RSFs in two ways: by bootstrapping several patients at each tree B times and selecting a subset of variables to grow each node. To grow a survival tree, binary splitting is applied recursively to each region (a node) on a specific predictor to maximize survival differences between daughter nodes while minimizing differences within them. Splitting ends when a specific criterion is met, referred to as terminal nodes. The log-rank test by Segal23 and the log-rank score test by Hothorn and Lausen are the most widely used splitting criteria24. Each terminal node should have at least a set number of unique events.Survival trees are based on the principle of event conservation. Ensemble mortality is a predicted outcome for survival data based on the ensemble cumulative hazard function, similar to the Cox model’s prognostic index. This principle asserts that the sum of estimated cumulative hazard estimates over time is equal to the total number of deaths, therefore the total number of deaths is conserved within each terminal node11. RSFs are capable of handling both large sample sizes and a large number of predictors. They assume no particular functional form and treats outcomes as binary, dividing data into high and low survival groups based on features. This allows for complex interactions. Furthermore, they can achieve remarkable stability by combining the results of multiple trees. Combining multiple trees significantly reduces their intuitive interpretation compared to a single tree.
Model training
The split sample approach was used, dividing data randomly into two parts: a training set (2/3) and a test set (1/3) with equal event/censoring proportions. To fine-tune a model, 5-fold cross-validation was applied in the training set for the Random survival Forest. It is a common and recommended practice in the machine learning community. A 5-fold cross-validation enables fair comparisons of different models and methodologies while striking a reasonable balance between computational efficiency and performance estimation accuracy25.Training data was divided into five folds. Four folds were used to train a model, followed by the remaining fold to validate its performance. This process was repeated for all fold combinations. The hyper-parameters were tuned through grid search, and the final models’ performance was evaluated on a test set. Analyses were carried out using R 3.5.3. and python.Assessing predictive performance on test dataThe survival curve, the concordance index, the Integrated Brier Score (IBS), and the error rate evaluate the models’ predictive performance. Survival curves are important for visualizing the probability of survival over time, allowing for a direct comparison of different groups’ survival experiences or the predictive performance of various models. We used the log-rank test to compare the survival curves of the Weibull, Cox, and Random Survival Forest (RSF) models against the observed survival data. A low p-value (< 0.05) indicates a statistically significant difference between the survival curves, implying that the models predict survival differently. This method is especially useful for determining whether one model has significantly better predictions of survival outcomes than others.The concordance index is a popular measure of model performance in survival contexts. The models’ ability to discriminate was assessed using the concordance index (C-index). A higher C-index denotes better model discrimination. It evaluates the model’s accuracy in ranking the survival times according to predicted risk scores. A model’s ability to accurately distinguish between people who have different outcomes is referred to as its discrimination ability. It shows how well the model can distinguish patients with varying survival times based on their estimated risk scores in the context of survival analysis. The model’s discrimination ability is measured on a scale of 0.5 to 1, with higher values indicating greater ability and 0.5 indicating no discrimination. The concordance index provides a time-independent rank statistic for observations26.The Integrated Brier Score (IBS) is a comprehensive metric for determining the accuracy of probabilistic forecasts in binary and multi-class classification models. As an extension of the Brier Score, the IBS calculates the mean squared difference between predicted probabilities and actual outcomes, combining calibration and sharpness into a single score27. Specifically, the Brier Score for binary classifications is calculated as,$${\text{BS }}=\frac{1}{N}\sum\nolimits_{{i=1}}^{N} {{{({p_i} – {o_i})}^2}}$$ where N is the number of predictions, $\:{p}_{i}$ is the predicted probability of the positive outcome for instance i, and $\:{o}_{i}$ is the actual binary outcome.The Integrated Brier Score averages these measures across multiple time points or forecasts, providing a useful summary of a model’s predictive performance by capturing both its reliability in probability estimates and its ability to confidently discriminate between classes28.Additionally, the models’ performance was evaluated using the error rate, which is the percentage of incorrect predictions the model made. Improved model accuracy is indicated by a lower error rate.Interpretability of the modelsInterpreting models is important for the medical community. Survival models such as the Weibull model, the Cox proportional hazards model, and Random Survival Forests all provide different approaches to analyzing time-to-event data, with varying degrees of interpretability. The Weibull model is valued for its parametric simplicity and clear interpretation via its scale and shape parameters, which represent the hazard rate’s time scale and trend, respectively29. In contrast, the Cox proportional hazards model, a semi-parametric approach, allows for estimating hazard ratios without specifying the baseline hazard function, making it highly interpretable regarding how covariates affect the hazard, subject to the proportional hazard’s assumptions9. Here, the Weibull model and the Cox proportional hazards model are examined separately. Taking advantage of the semi-parametric nature of the Cox model, it was applied without defining the baseline hazard function h(t). On the other hand, the Weibull model was considered as a parametric model, where h(t) was subject to the Weibull distribution. As an alternative, we decided to analyze each model independently to emphasize its unique strengths. It is possible to combine these models by using a Weibull distribution as the baseline hazard in the Cox framework, which allows for simultaneous estimation of Weibull parameters and Cox coefficients.On the other hand, Random Survival Forests take a non-parametric approach, dealing with complex interactions and non-linear relationships without making predefined assumptions about data structure, but this comes at the expense of straightforward interpretability due to their ensemble nature and decision tree complexity21. While RSFs are useful tools for assessing variable importance and partial dependence, their overall complexity and the nature of ensemble methods make them less interpretable than simpler models. Efforts to improve the interpretability of RSFs, such as visualizing decision trees and employing supplementary interpretability techniques, can help bridge this gap, but they require more effort and expertise11,30.Thus, while the Weibull and Cox models provide more direct insights into the effects of covariates on survival times, Random Survival Forests provide robust predictive capabilities while taking a nuanced approach to interpreting variable importance and effects on survival outcomes.

Comparative study of machine learning and statistical survival models for enhancing cervical cancer prognosis and risk factor assessment using SEER data

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data | BMC Bioinformatics

DRCTdb: disease-related cell type analysis to decode cell type effect and underlying regulatory mechanisms

Hot Topics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data | BMC Bioinformatics

Popular Articles

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny