Development and performance evaluation of fully automated deep learning-based models for myocardial segmentation on T1 mapping MRI data

The study was conducted in three phases. In the first phase, the inclusion criteria for data collection were defined and the data was exported from the clinical systems. In the second phase, four different models based on a U-Net architecture were trained over several training epochs. In the third phase, the results of all models were evaluated using a test data set and compared with the ground truth data set of both evaluators. An overview of the overall study design can be seen in Fig. 1.Figure 1Entire study design flowchart divided in three phases: Phase I (left) contains the data inclusion criteria definition process as well as the data collection and annotation procedure performed by two independent raters who created the ground truth data set; Phase II (center) contains image preprocessing and file operations as well as model specifications and the training loops; During Phase III (right) all post-processing procedures and final analysis were evaluated.Patient selection and study designWe included 50 healthy participants whose imaging data from a previous prospective volunteer study20 were used and 75 patients who underwent a clinically indicated CMR at 1.5 T between 2020 and 2022 at our institution. All MRI examinations were performed on a 1.5 T MRI scanner (Avantofit Siemens Healthineers, Erlangen, Germany). Details of imaging protocols and the process of creating the T1 maps have been published previously20. Patients were retrospectively selected by searching our institution’s radiology information system. Patients were excluded for the following reasons: severe image artifacts (n = 12), children (n = 3), number of short-axis T1 maps ≠ three (n = 4) and data set errors (n = 4). Figure 2 shows the data collection process.Figure 2Description of complementary dataset: The visualization shows how the two independently assessed datasets of healthy subjects (n = 50) and patients (n = 75) from routine care were assembled, how many views exist and on which partitions the dataset is distributed.In the first step, MRI cine images were collected from the 50 healthy volunteers in two rounds and the native T1 maps generated by the scanner were exported. The volunteer dataset consisted of 300 SAX and 100 4Ch image data. In the second step, data from another 75 patients (225 SAX images) were extracted from the clinical systems and segmented by two assessors R1 and R2. In addition, a subset of the patient data-set, with 30 patients (n = 90 images), was segmented by both assessors to evaluate interrater variability.The Table 1 below shows how each view is distributed across the training, validation, and test data partitions.
Table 1 Data partitions overview.Ethical approval and informed consentThis study was approved by the institutional review board (Ethics Committee, University Medical Center Rostock) and written informed consent was obtained from all 50 healthy volunteers free from cardiovascular disease prior to enrollment. For the additional 75 patients the requirement for informed consent was waived due to the retrospective nature of this part of the investigation. The study conformed to the ethical guidelines of the Helsinki Declaration (revised version from 2013).Manual Image segmentationBoth data sets, volunteer data and patient data, were annotated independently by two experienced assessors. Manual segmentation was performed using an open source software solution (ITK-Snap, version 4.0.0).Image preprocessingWe selected a U-Net architecture to automatically segment the image data. To achieve good results with smaller training data sets, U-Nets can be effectively trained in combination with data augmentation techniques such as mirroring, rotation, and scaling21. These properties are particularly helpful because in the first preprocessing step before training the resolution of the existing MRI images was scaled from (256 × 218, pixel size 1.41 × 1.41 mm) to a uniform format (512 × 512, pixel size 0.70 × 0.60 mm). For practical implementation, scaling makes sense in order to convert the data into a uniform input format and thus be independent of image sizes and pixel spacing. We initially considered performing image padding to avoid warping and maintain square pixels. However, this reduced segmentation accuracy and was therefore not implemented.To ensure that the residual interpolation error for each individual segmentation mask remained as small as possible, we evaluated various scaling methods (bilinear, Lanczos, Bicubic, pixel-area relation, and nearest neighbor) and determined the influence on the error. For this purpose, mean T1 times were compared between original images (resolution 256 × 218 pixels) and scaled images (resolution 515 × 512). Nearest neighbor interpolation was found to introduce minimal error. No additional data augmentation techniques were used to increase the data volume. Grey-scale images were used without intensity normalization.Network architectureConvolutional layers were used in both the encoder and decoder parts, followed by Rectified Linear Unit (ReLU) or LeakyReLU activation functions. The encoder sublayers were followed by the MaxPool layer to extract global features from the input images, which led to loss of spatial information. In the decoder part, the convolutions were transposed to reconstruct the spatial resolution step by step and create detailed segmentation maps by transforming the activations in higher layers back to the original input size. Skip connections were inserted between the encoder and the decoder and were used to pass features from the encoder to the decoder. This is useful for combining both local and global information and obtaining fine details. The final layer had only one output channel to which the sigmoid activation function was applied to generate the pixel class probabilities, allowing the creation of binary segmentation maps.The effects of two activation functions were compared during model performance evaluation, as proposed in a previous work22.Therefore, the model was evaluated separately using each of these functions in image segmentation. The activation functions were implemented in the decoder block of the model. For the standard activation function ReLU, the output of the neuron corresponds to the level of activation if it is in the positive value range. The output of a neuron can therefore increase arbitrarily with increasing activation. If activations x < 0 the neuron remains deactivated. An updated version of ReLU was also used, the LeakyReLU. It is one of its upgrades and passes information with a scaling factor α = 0.1.TrainingTo train the deep learning model, cardiac T1 maps of the left ventricle in short-axis view (SAX) and four-chamber view (4Ch) with the corresponding segmentation masks were used as input. The complementary data set, consisting of 625 DICOM images, was randomly divided into three independent subsets: training, validation, and testing data with, 80% of the data set defined for training, 10% of the data set defined for interim validation (during training), and the remaining 10% of the data used for testing. Model training was performed on a Workstation (CPU i7-12,700 2.10 GHz / 32 GB RAM; NVIDIA RTX A2000 / 6 GB GRAM) with CUDA 11.6, Python Tensorflow 2.12.0, Numpy 1.23.5, PyDICOM 2.4.2 and Medpy 0.3.0.During the training loop, the performance of each model was evaluated using binary accuracy (pixel-wise accuracy) and a metric (DSC and IOU). In particular, two metrics \(DSC\left({\text{S}}_{1},{\text{S}}_{2}\right)=2\left|{\text{S}}_{1}\cap {\text{S}}_{2}\right|/\left|{\text{S}}_{1}\right|+\left|{\text{S}}_{2}\right|\) and \(IOU\left({\mathbf{S}}_{1},{\mathbf{S}}_{2}\right)=\left|{\text{S}}_{1}\cap {\text{S}}_{2}\right”https://www.nature.com/”{\text{S}}_{1}\cup {\text{S}}_{2}|\) have been proposed in the relevant literature22,23,24 to measure the congruence of two pixel sets \({\text{S}}_{1}\) and \({\text{S}}_{2}\)25.The training data set was divided into batches of 16. A training loop of 5 to 70 training epochs was implemented in steps of 5 epochs using the ADAM optimizer26 in combination with variable activation layers and a decreasing learning rate.Testing and performance evaluationThe mean T1 times of the ROIs for all trained models were calculated using the test data set, consisting of 63 images in different views (4Ch and SAX). This dataset can be viewed as an external validation dataset in previously unseen images. In addition, the absolute differences in T1 (ΔT1) between the predicted segmentations and the ground truth segmentations as well as the metrics DSC and IOU were determined.The best model was considered as the model minimizing the error in T1 time quantification (\(\overline{\Delta \text{T}1}\)) and maximizing both metrics of segmentation accuracy (\(\overline{\text{DSC}}\) and \(\overline{\text{IOU}}\)). For this purposea function \(D(\Delta \text{T}1,\text{ DSC},\text{ IOU})\) is defined in Eq. (1), which was calculated for all 56 models and all 63 images from the test data set. The arithmetic averages of \(\overline{\Delta \text{T}1}\), \(\overline{\text{DSC}}\) and \(\overline{\text{IOU}}\) were calculated over all 63 absolute values for each model. Because \(\overline{\Delta \text{T}1}\) should be as small as possible and \(\overline{\text{DSC}}\) and \(\overline{\text{IOU}}\) as large as possible, the smallest \({\overline{\Delta \text{T}1} }_{j}\) average and the largest of \({\overline{\text{DSC}} }_{j}\) and \({\overline{\text{IOU}} }_{j}\) were then sought and the minimum of the function \(D\) was determined using the least squares method.We determined the best fitting model as the one minimizing the function \({D}_{j}\) with the definition$${D}_{j}^{2}:= {\left({\overline{\Delta \text{T}1} }_{j}-\text{min}\left({\overline{\Delta \text{T}1} }_{j}\right)\right)}^{2}+ {\left({\overline{\text{DSC}} }_{j}-\text{max}\left({\overline{\text{DSC}} }_{j}\right)\right)}^{2}+ {\left({\overline{\text{IOU}} }_{j}-\text{max}({\overline{\text{IOU}} }_{j})\right)}^{2},$$
(1)
where \(\text{j}=1,\dots ,56\) is the model number.In the second step, the identified best model was compared with the ground truth segmentations of the first rater (R1) and the second rater (R2) as well as the interrater variability, as shown in Fig. 3.Figure 3SAX ROI comparison between both Raters and Prediction: In the comparison between original T1 map both raters R1 and R2 and the predicted mask show some slight differences in lower rows Err1 (Prediction-R1 reference) and Err2 (Prediction-R2 reference), Binary masks (left) and intensity masks (right) with a zoom factor of 2.3.Statistical analysisGeometrical segmentation accuracy was evaluated by DSC (Sørensen-Dice-Coefficient) and IOU (Intersection over union). Error in quantitative mapping results was quantified as the error in T1 relaxation times (ΔT1). Bland–Altman analysis was performed to obtain limits of agreement and mean bias between model results. In order to compare model performance and inter-rater variability for the best model, differences in ΔT1, DSC, and IOU were compared between model vs. human and human vs. human (interrater variability) using the Mann–Whitney U test. A value of p < 0.05 was considered statistically significant. Statistical analyzes were performed using the Python (3.11.2) libraries Scipy 1.10.1 and Seaborn 0.12.2

Hot Topics

Related Articles