CT-based radiomics for predicting breast cancer radiotherapy side effects

Clinical data collection and curationThe dataset consisted of 252 breast cancer patients who underwent radiotherapy between 2012 and 2016 in the Rechts der Isar university hospital of the technical university of Munich (TUM). For the patient data acquired at TUM, retrospective analysis of patient records and data is generally allowed following Article 27 of the Bavarian Hospital act (Bayerisches Krankenhausgesetz) from the Landeskrankenhausgesetz des Freistaates Bayern. Informed consent for treatment was obtained from every patient. Institutional Review Board (IRB) was acquired from the review board of TUM (reference number 466/16 s. Clinical variables were defined based on a literature review on known clinical predictors from previous publications. Moreover, variables were selected based on broad availability of data that hindered the assessment of other predictive factors23,24: smoker status, chemotherapy received, radiotherapy boost, the maximum prescribed radiation dose in equivalent dose at 2 Gy (EQD2, \(\:\propto\:/\beta\:\) = 3), TBV, and the two targets of prediction: (i) moist cell epitheliolysis as surrogate for common terminology criteria for adverse effects (CTCAE) grade 2 skin inflammation25 (33 positive cases; referred henceforth simply as moist epitheliolysis); and (ii) presence of any edema (26 positive cases).Radiomics data collection and curationPrior to RT treatment, planning CT images of the breast were conducted. Figure S1 shows que acquisition parameters for these CT images. Exclusion criteria encompassed breast implant and mastectomy cases. Two separate volume of interest (VOI) definitions were segmented, creating two radiomics cohorts: TBV, containing radiomics information from the whole breast tissue; and glandular tissue (GT), which contained radiomics information only from this tissue. Patient outcome assessment was performed retrospectively by a medical student after thorough teaching by a radiation oncologist (JCP). All methodology has been conducted in accordance to the relevant guidelines and regulations.Segmentation of the volumes of interest was manually performed by NW, using 3D Slicer26. GT was defined using the fast growcut function. BSpline interpolation was used to perform isotropic resampling to obtain a voxel size of 1 × 1 × 1 mm. Image discretization was carried out with a fixed bin width of 10. Laplacian of Gaussian filtering was used for image reconstruction (Sigma values of 1.0, 2.0, 3.0, 4.0 and 5.0).Radiomics features were extracted and filtered from the CT images and both segmentations using the Python library PyRadiomics27 (version 3.0.1; Python version 3.8.10). A total of 104 features were obtained for each of the radiomics cohorts, which included first-order, shape, and texture features (the latter is composed of “gray-level co-occurrence matrix”, “gray-level size-zone matrix”, “gray-level run-length matrix”, “neighboring gray-tone difference matrix”, and “gray-level dependence matrix” features). Figure 1 shows a diagram of the clinical and radiomics features and side effects collection process from the patients. Further, Fig. S2 shows the distribution of patients across all clinical features and side effects measured.Fig. 1Patient and data flowchart. In the left and central branches, the clinical and radiomics features can be found, respectively. The right branch shows the three RT side effects used as prediction targets.Feature pre-processing and hyperparameter optimizationRepeated nested cross-validation was employed to train and validate the models. Normalization of the radiomics features was performed using min-max normalization, in order to conserve the original distribution in the [0, 1] range.For each cohort, the most interesting features were selected and evaluated in two different ways: the first one, with a double Spearman rank correlation test, first within each dataset with a cut-off value of 0.9 to remove redundant features; and then towards each side effect prediction target, in order to keep the most relevant features. The second option was selecting features using minimum redundancy-maximum relevance (MRMR; version 1.0.2), which incorporates both tests in a single step28. In both cases, an estimation of the information density and, therefore, of the number of features to select, was made using Principal Component Analysis (PCA). For the TBV radiomics feature set, an average of 23 and 39 features were selected when using MRMR and a double Spearman rank correlation test, respectively. For the GT radiomics feature set, on the other hand, an average of 26 and 44 features were selected when using each of the feature selection techniques, respectively.Before finding the optimal hyperparameter values, the class imbalance of the different side effect prediction targets was corrected depending on the level of disproportion. Moist epitheliolysis and edema had a ratio of 6.64:1 and 8.69:1 of negative to positive class sizes, respectively, and were therefore corrected using a combination of synthetic minority over-sampling technique (SMOTE; imbalance-learn library version 0.11.0)29 to a ratio of 2:1, and random under-sampling of the majority class to a ratio of 1.25:1. The choice of ratios for each step was made to find a balance between avoiding excessive oversampling and losing too many samples while undersampling. Balanced accuracy (BA) was the metric used as optimization criteria for the values of the hyperparameters, capable of handling the small remainder of class imbalances. Hyperparameter optimization was conducted using an exhaustive grid search, where all combinations of hyperparameter values are tested in the validation set of the innermost fold until the optimal values are found.Machine learning modelingFour ML algorithms were implemented and evaluated: logistic regression (LR), used for its simplicity and efficiency in binary classification tasks with a low feature set dimensionality30,31; least absolute shrinkage and selection operator (LASSO), a variant with an optimizable regularization term that can potentially better handle imbalanced datasets32; support vector machine (SVM), a high flexibility algorithm thanks to the implementation of multiple kernels and explore non-linear relationships in the data33; and random forest classifier (RF), an ensemble learning, decision tree-based method that is more robust to overfitting effects34. All models were imported from the python library scikit-learn (version 1.0.2)35. These models were contrasted against clinical model baselines.After comparing the four model types for each of the radiomics cohorts and feature selection types, the best models were retrained and optimized adding clinical data in order to assess whether a combined model yields a better performance in predicting the presence of any side effect. The workflow followed by the ML pipeline is shown in Fig. 2. In addition, larger reference images of the respective VOIs can be seen in Fig. S1.Fig. 2Workflow of the pipeline used in the study to analyze both clinical and radiomics data. On the left half of the workflow: clinical features were obtained from all patients with CT imaging available, the respective VOIs (TBV and GT) were segmented, and the subsequent radiomics features extracted. On the right half of the workflow: for each evaluated dataset, a 50-repeat nested cross-validation was performed. Within the inner fold normalization, feature selection and an exhaustive grid search for optimal hyperparameters was performed.Feature selection has been analyzed for all relevant models, estimating a score based on the feature importance assigned by the models and how often each feature was selected. The resulting score is calculated as \(\:Score=\:Feature\:Importance/[\left(n+1\right)-m]\), where n is the number of models, and m is the number of times the feature has been chosen.Finally, the correlation between the breast volume and the prediction probability of the best model has been analyzed to study the overall impact of the breast volume in the predictive value of radiomics features. An additional model was evaluated where radiomics features that highly correlated to the breast volume were excluded (Spearman correlation higher than 0.8), using the best performing configuration. The objective was to assess the impact of volume-correlated features on the performance of radiomics models.Statistical analysisTraining and validation of the different models were performed using 50 repetitions of nested cross-validation (5 outer folds, 4 inner folds). This resampling technique provides additional statistical robustness, resulting in 250 final models that were aggregated to the final test results.In order to gather more information from the radiomics features, PCA was employed as an estimation of the information density within this dataset. The variance retention by the components of PCA was used to understand the intrinsic dimensionality of our dataset. However, since the components generated by PCA are a different combination from the original features and, generally, more packed, these components should not be used as a feature selection replacement, but as an estimation. The reason behind it is the inherent added difficulty of tracing the feature importance back to the original features.In the inner fold of the nested cross-validation normalization, feature selection and class imbalance correction were applied, in order to avoid data leakage from any training split to the validation (inner fold) or test splits (outer fold).One of the two feature selection techniques mentioned in this study is the use of a double Spearman rank correlation test. This approach is intended to optimize feature selection by addressing redundancy and relevance in two distinct steps. First, redundancy is removed so that features that do not provide additional information are eliminated. Second, the Spearman rank correlation test is applied again comparing the dataset and the predictor, selecting instead the features that are most relevant to the prediction target.The performance of the aggregated models was measured using a combination of metrics: BA, F1, precision, recall, specificity, area under the receiver-operator curve (AUROC) and Matthew’s correlation coefficient (MCC). Metrics are given with 1.96 standard errors for a confidence interval of 95%. ROC curves were also used to evaluate the trade-off between the sensitivity and specificity across different decision thresholds, and to assess the discrimination power between classes of each of the models.

Hot Topics

Related Articles