Deep learning application of vertebral compression fracture detection using mask R-CNN

Data source and preprocessingThe dataset used in this study was obtained as lateral thoracolumbar radiographs of patients from Ansan Hospital, the University of Korea. The collected dataset contained 487 radiographs with fractures, and 141 normal radiographs. Only X-rays confirmed as compression fractures based on MRI results were collected and labeled. The X-ray was de-identified before being used, so that each patient’s personal information was removed according to the ethical guidelines. Overall, 598 segmentation masks of marked fractures were extracted from 487 lateral thoracolumbar X-rays and used to train and test each model.A total of six MRI-based class labels were defined and locations were marked during data preprocessing : L1 , L2 , L3 , L4 , T11, T12 fractures. Two orthopedic experts labelled the location and the type of vertebra, using an open source labeling software ‘labelme’, version 5.0.2 (https://github.com/labelmeai/labelme)16. Each polygon mask included fracture information on fractures in the six classes (L1-T11), and coordinates of identified fractures at each point of the polygon. Figure 1 shows an example of labeled data used in training. Multiple VCFs were identified in approximately 20% of the patients, and were also labeled as separate polygons.Figure 1Example of labeled data. Each fracture is labeled with a polygon on the thoracolumbar radiograph based on the MRI results. Each polygon mask contains the x and y coordinates of the polygon mask surrounding it. Each bounding box consists of upper left x, y coordinates, width, and height. The entire labeling process was conducted by two trained orthopedic experts.Study settingsIn this study, approximately 70% (346 radiographs) of the dataset were used to train the neural network, and approximately 15% each were allocated to validation (71 radiographs) and test data (70 radiographs). Train, validation, test data were split in a stratified manner to consider classwise distribution. Radiographs with no fractures were used only in the test phase. We used stochastic gradient descent considering momentum as the optimization method. The learning rate was set to decrease with a weight decay of 0.0001 and a momentum of 0.9. We used transfer learning17 to enhance model performance. Each model was trained starting from the pretrained weights of the COCO instance segmentation dataset18. Augmentation of horizontal flip and random rotation of 10 degrees were applied. The summary of our VCF dataset is listed in Table 1.Table 1 Summary of VCF dataset.Mask R-CNNMask R-CNN is an instance segmentation model based on the Faster R-CNN model19. Mask R-CNN20 introduced the segmentation branch, which is composed of four convolutions, one deconvolution, and one convolution to process instance segmentation. Moreover, ROI Align was introduced to fix the information loss of ROI pooling due to the misalignment of feature maps and ROIs (Region of Interest), and significantly improved the segmentation accuracy. The backbone of Mask R-CNN is ResNet21 and Feature Pyramid Networks (FPN)22. The backbone used residual learning to precisely extract object features, and a feature pyramid to fuse multi-scale features to construct high-quality feature maps. Subsequently, ROIs were extracted from the feature maps using region proposal networks (region proposal networks). The ROIs were then aligned and pooled by ROI Align. After the pooling layer, the model predicted segmentation masks using fully convolution networks. The structure of Mask R-CNN is shown in Fig. 2. Mask R-CNN has several applications in instance segmentation. Mask R-CNN incorporated the structure of previous RCNN models and improved them to make a faster, more accurate, and more effective instance segmentation model.Figure 2Mask R-CNN model architecture.Backbone networkThe backbone network was used to extract features from the input radiograph. We implemented ResNet101 with FPN as the backbone network to extract reliable feature maps. In the bottom-up pathway, ResNet extracts low-level features such as corners and edges of the object, while deeper layers extract high-level features such as texture and color. Then in the top-down pathway, FPN was used to concatenate feature maps of different scales to better represent objects. The feature maps were used in the RPN and ROI Align to generate candidate region proposals for detection. The structure of the backbone network is shown in Fig. 3.Region proposal network and ROI alignThe RPN generates ROIs using the feature map inputs from the backbone network. A 3 x 3 convolutional layer was used to scan the image using a sliding window to generate anchors for different scaled bounding boxes. Binary classification was performed to determine whether each anchor contained the object or represented a background. The structure of the RPN is shown in Fig. 3. The bounding box regression generated samples and calculated the intersection over union (IoU) value. If the sample had IoU higher than 0.7, it was defined as a positive sample, otherwise a negative sample. Non-maximum suppression (NMS) was applied to keep regions with the highest confidence score. The feature maps from the backbone network and ROIs from RPN were passed to ROI Align for pooling. ROI Align was performed next stage to obtain fixed size feature vectors and feature maps. ROI Align is a method proposed to avoid misalignment issues identified in the ROI pooling layer, which rounds the locations of the ROIs on the feature map. A bilinear interpolation operation was performed on the sample points in each grid cell before pooling.Mask predictionThe feature vector output from the previous stage was used to calculate the class probabilities of each ROIs for classification, and to train bounding box regressors to refine the location and size of the bounding box to accurately include each object. The mask branch predicted binary masks for each ROI classwise using fully convolutional network (FCN). Figure 3Backbone network and region proposal network. (a) Backbone network is shown. Feature maps from ResNet are upsampled and resized with 1 x 1 convolution to be concatenated with different scaled feature maps. (b) The region proposal network generates candidate regions for objects by sliding-window, referred to as anchor box on feature maps. Each anchor box performs both classification and bounding box adjustments.Evaluation metricsTrue and False positives were defined by the value of the IoU. IoU was calculated by dividing the overlap between the predicted area and the ground truth area by the union of these. If the IoU of the predicted and actual regions exceeded a certain threshold, the detector’s prediction was determined to be correct, and it was defined as True Positive (TP). On the contrary, if the IoU value was less than the threshold, the result was defined as False Positive (FP). When the detector failed to predict any fracture, it was defined as False Negative (FN). Specificity was calculated using the dataset containing no fractures. We defined True Negative (TN) when the detector did not predict any fractures from normal radiographs, and false detection as False Positive (FP). We calculated sensitivity, specificity, accuracy, and F1-score with the defined confusion matrix.Sensitivity is calculated with Eq. (1), specificity with Eq. (2), accuracy with Eq. (3), and F1-score with Eq. (4)The cumulative value was determined by listing the detected regions in order of confidence score. As the regions were listed, we calculated a precision-recall curve with the accumulated values and computed the AP from the area below. Mean average precision (mAP) was calculated as the average AP score of each class and evaluated as an overall evaluation metric among each DL models. AP scores were computed with Eq. (5).$$\begin{aligned} Sensitivity&= \frac{TP}{TP + FN} \end{aligned}$$
(1)
$$\begin{aligned} Specificity&= \frac{TN}{FP + TN} \end{aligned}$$
(2)
$$\begin{aligned} Accuracy&= \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$
(3)
$$\begin{aligned} F1-score&= 2 * \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$
(4)
$$\begin{aligned} AP&= \frac{1}{6}\sum _{confidence}Precision(Recall) \end{aligned}$$
(5)
Ethical approval and consent to participantsThis study was conducted according to the Helsinki declaration. This study was approved by the Institutional Review Boards of Korea University Ansan Hospital, and was conducted in accordance with the approved study protocol (IRB No. 2022AS0198). Due to the retrospective nature of this study, informed consent was waived by Korea University Ansan Hospital Institutional Review Board and Ethical Committee.

Hot Topics

Related Articles