Performance enhancement of deep learning based solutions for pharyngeal airway space segmentation on MRI scans

The following experiments were conducted under the Pytorch framework. Models are trained until convergence. The reported dice scores were obtained by calculating the dice score for each MRI scan, followed by averaging across the MRI scans.ConfigurationHere, experiments were conducted to determine the configuration for later experiments. This configuration include the capacity of the models, the data augmentation strategy, the preprocessing method, and components of the models. The MRI images were resized from 320×290 to 320×288 to be compatible with the 4 layers of 2×2 max pooling.A poly learning rate schedule, where the power is set to 0.90, an initial learning rate of 0.001, and the Adam optimizer with \(\beta _{1} = 0.9, \beta _{2} = 0.999 \) was used. With no augmentation, 120 epochs was sufficient for the 2d U-Net to converge. Each epoch consisted of passing each MRI scan in the training set through the network. With data augmentation, the models were trained for 240 epochs as a default. If it was judged that the model have yet to converge, training was continued. Cases, where the models are trained for greater than 240 epochs and resulted in improvement, are marked with an *. The difference between two results are considered statistically significant if the p-value of the corresponding two-sided T-test is lesser than or equal to 0.05.Capacity and data augmentation strategyTo determine the best capacity and data augmentation strategy, a 2d U-Net was trained with different number of parameters and data augmentation strategy, keeping the preprocessing method and the components of the 2d U-Net fixed. For preprocessing methods, the MRI scans were normalized across the entire training batch, akin to batch normalization. For components, ReLu, residual connection, and batch normalization was used. 2d U-Nets with 1.7/7.0/28.0 million parameters were considered corresponding to 16/32/64 filters in the first convolutional layer with subsequent layers being adjusted following the pattern of the original U-Net in15. With regards to augmentation strategy, no augmentation, affine transformation (i.e. translate, rotate, and flip) (Aug1), affine transformation and Gaussian noise (Aug 2), and affine transformation, gaussian noise, and elastic deformation (Aug 3) were considered. All possible combinations between data augmentation strategy and capacity were tested. The result is shown in Table 1.Table 1 Dice score on validation set for different capacities and augmentation strategies.In terms of the average dice score, 16 filters (1.7 million parameters) with Aug 3 produces the best result. The difference between the best setup and all the other 32 filters setup except No Aug is not statistically significant. Nevertheless, since it presents the best average dice score, 16 filters and Aug 3 were used in later experiments.Preprocessing methodHere, 2d U-Nets were trained with different preprocessing methods to determine the best one. The following preprocessing methods were considered: normalize across the entire training dataset (Batchnorm), normalize across each MRI scan (Layernorm), and clip and scale (to 1) each MRI scan based on the 90/99/99.9th percentile of pixel values in the entire training dataset. For the first and third preprocessing methods, the validation dataset is normalized and clip and scale according to values derived from the training dataset. The number of parameters for the 2d U-Net are set to 1.7 million parameters (16 filters). Affine transformation, Gaussian noise, and elastic deformation (Aug 3) are used for data augmentation. Additionally, the components were fixed to ReLu, residual connection, and batch normalization. The results are shown in Table 2.Table 2 Dice score on validation set for different preprocessing methods.Clip 99 presents the best average dice score. However, it is not statistically significantly better than the other preprocessing methods except for Layernorm. Nevertheless, it was decided to proceed with Clip 99 as the preprocessing method.ComponentsIn the following experiments, the effect of using different types of activation (GeLu/ReLu), normalization layers (layernorm/batchnorm/groupnorm with 16 groups), and connection (residual/skip) was considered. Results are shown in Table 3.Table 3 Dice score on validation set for different components.Both GeLu, Residual, and Groupnorm and ReLu, Skip, and Groupnorm provide equal as well as the best average dice score, but the difference is not statistically significant when compared to a majority of the other combination. The effect of the different activation, connection, and normalization were also compared by fixing the respective component and combining the results for the other components. For example, for GeLu vs ReLu, the runs from every type of connection and normalization were combined, which results in 30 samples, and a pair t-test was performed. It was found that GeLu is better than ReLu, residual connection is better than skip connection, groupnorm is better than both layernorm and batchnorm, and batchnorm is better than layernorm. Overall, since GeLu is better than ReLu, residual connection is better than skip connection, and provided that models with skip connection converge slower than those with residual connection, GeLu, Residual, and Groupnorm was chosen as the default in later experiments.Single stage methodsIn this section, the difference between two results are considered statistically significant if the p-value of the corresponding two-sided T-test is lesser than or equal to 0.05.2d U-NetHere, the result from the best configuration obtained from the preceding experiment is presented. The average dice score from 5-fold cross validation was 0.9180 ± 0.0111. Additionally, a 2d U-Net was trained with this configuration on a separate train/validation/test split with 80:10:10 ratio, resulting in 48/6/7 MRI scans in each of the respective sets. The validation set was used for early stopping. Validation/test dice score of 0.9135/0.8968 were obtained. Looking at each of the cases individually, the dice score ranges from 0.8582 to 0.9428 for the MRI scans in the test set. A selection of the individual cases are shown in Figs. 3, 4, 5, and 6.Fig. 3Selected slices from a case, where an overall dice score for the entire volume of 0.95 was achieved. The first column shows the MRI scan, the second column shows the ground truth mask, the third column shows the predicted mask, and the last column overlaps the MRI scan with the ground truth mask and the predicted mask. In the last column, correct prediction are seen in yellow, under segmentation are seen in green, and over segmentation are seen in purple. This case can be considered successful, because the dice score between two manual segmentation is typically in this range3,6. It is seen that under segmented parts in the present slide appear in succeeding slides, while over segmented parts in the present slide appear in neighboring slides of the ground truth. This error is understandable considering the spacing between slices are only 1.3 mm, which would be hard to label accurately even for humans.Fig. 4A less successful case, where a dice score of 0.85 was achieved. This figure follows the same format as Fig. 3. After looking through the training cases, it was found that the under segmented portion of this MRI scan does not appear in the training cases, although it appears in another test case.Fig. 5A less successful case, where a dice score of 0.88 was achieved. This figure follows the same format as Fig. 3. It is seen that the same kind of under segmentation and over segmentation as in Fig. 3. However, the errors accumulate into a relatively lower dice score.Fig. 6The 3d models were constructed from the first case, where the dice score was 0.95. The left volume is the 3d model generated by the predicted segmentation map of the 2d U-Net, while the right volume is generated by the ground truth segmentation map. The two volumes are shown from three different perspectives. Qualitatively, we see that the left and right volume have the same outline in general although some minute details were missed.3d U-NetsThe results for the 3d U-Nets along with the results of other models are presented in Table 4. 3d U-Nets were implemented similarly to 2d U-Nets, but with 3d 3×3 convolutions replacing 2×2 convolutions. The images were resized from 36x320x290 to 32x320x288 to be compatible with the 4 layers of 2x2x2 max pooling in the encoder of the 3d U-Nets. Two different 3d U-Nets with 8/16 filters and 1.3/5.1 million parameters respectively were tested. The 3d U-Nets with 16 filters had a higher average dice score than the one with 8 filters but this difference was not statistically significant. Additionally, the t-test results indicate that the 2d U-Net is better than 3d U-Nets with 16 filters, but not better than 3d U-Nets with 8 filters.Swinv2 UNETRFor Swinv2 UNETR, the Swinv2 implementation by Hugging Face was used as the Swinv2 backbone for the model. Firstly, layernorm was replaced with groupnorm in the backbone, and the source code was edited to use GELU activation instead of ReLU activation. The Swinv2 implementation by Hugging Face has the option to specify the type of activation, but it was found that it still uses ReLU activation in spite of that. Secondly, in the original paper, the authors used volumetric images as input the model, but 2d slices were used here instead. Consequently, 3d convolutions in the original papers were replaced with 2d convolutions. Lastly, a smaller version of this model with the number of embedding dimensions equal to 12 (1/4 of the original number of embedding dimension) and subsequent layers adjusted accordingly was also tested. This resulted in a smaller model with around 1.8 million parameters, comparable to the 2d U-Net, as opposed to the 27.5 million parameters of the larger model. The larger model had a higher average dice score than the smaller model, and this difference was statistically significant. In addition, the 2d U-Net was statistically significantly better than the smaller model but not the larger one. Nevertheless, the smaller model still produces good segmentation result with less training/inference time.SegFormerThe Segformer is another Transformer based model, like the Swinv2 UNETR. Similar to Swinv2, implementation of SegFormer by HuggingFace was used and layernorm was replaced with groupnorm. The smallest variant of SegFormer, denoted as “MiT-b0” with 3.7 million parameters was chosen, because it had around the same capacity as the best performing 2d U-Net. Nevertheless, it should be noted that only larger variants of SegFormer achieve state of the art performance in the ADE20k semantic segmentation task with performance increasing from 37.4 to 42.2/46.5/49.4/50.3/51.0 when the parameters increased from 3.8 million to 13.7/27.5/47.3/64.1/84.7 million parameters repsectively21. In SegFormer’s last transformer block, the output resolution is reduced by 32 in each dimension. Therefore, the MRI scans were resized from 320×290 to 256×256 to keep the images as faithful to the original images as possible without increasing computation cost. The output of the SegFormer model is a segmentation map with resolution reduced by 1/4 in each dimension. The output was resized back to the original image resolution, before calculating the dice score.For SegFormer, a much worse performance was observed compared to the other models above. What the other models have in common is an encoder-decoder structure that gradually connects higher resolution feature maps with less spatial information with lower resolution feature maps with more spatial information until the original resolution is recovered. Conceptually, this could enable the model to recover lost local spatial information while maintaining global context.Deeplabv3The Deeplabv3 is another purely convolutional neural network. However, like the SegFormer, it has shown good performance in the general semantic segmentation task, but their use in the medical segmentation task has been limited. The original paper18 considered two backbones, ResNet-50/ResNet-101 for the model. The larger model (ResNet-101) showed better performance in the PASCAL VOC 2012 dataset. Nevertheless, prior experiments indicate that this extra capacity is unnecessary for the current dataset. In pytorch vision, Deeplabv3 implementation with MobileNetV3-Large backbone is available. This implementation has 11.0 million parameters as opposed to 42.0/61.0 million parameters for ResNet-50/ResNet-101 backbone respectively. Therefore, Deeplabv3 with MobileNetV3-Large backbone was chosen instead.Yolov8The Yolov8 perform similarly to the other models from the general semantic segmentation literature. Once more, the Yolov8 only contained connections from a maximum resolution of 1/8 of the full resolution and the output segmentation map is at 1/8 of the original image resolution. This would hinder the model from precisely segmenting the image at finer scale. Additionally, another aspect of the Yolov8 is that it does not optimize the dice loss directly. Instead, it optimizes a combination of intersection over union (bounding boxes and segmentation), distributed focal loss, and binary cross entropy32. Potentially, this would mean that the comparison between Yolov8 and the other models are not fair.3D MRU-NetThe 3D MRU-Net was reimplemented using a similar setting on the same dataset. Both performance and computational cost in terms of the number of parameters are compared with the other methods in the paper. The details of the reimplementation are as follows.Due to some details being omitted, some related settings were self-determined while reimplementing the 3D MRU-Net. For instance, in the original 3D MRU-Net paper, the residual blocks in Fig. 1 differ from those in Fig. 2. We chose to follow the design in Fig. 1, implementing the residual block as a sequence of convolution, GELU activation, convolution, addition, and GELU activation. This approach allowed us to creatively interpret the model while staying true to the original concept. In addition, we chose to use GELU activation instead of ReLU activation to make it consistent with the other models we used in our paper. Furthermore, the illustration in Fig. 2 of the original 3D MRU-Net paper has feature maps after each layer of the modified MobileNetv2 backbone monotonically increases in size. Thus, the image should start with 320×320 (or 640×640) and progressively decrease in image size instead.The ambiguity in whether the image size starts with 320×320 or 640×640 in Fig. 2 of the original 3D MRU-Net paper is due to details of the last bottleneck residual block and the extracted feature map from the MobileNetv2 backbone being omitted so it was not possible to infer the exact detail of the last bottleneck residual block. We assumed that the last bottleneck residual block had 320 output channels. Therefore, the number of channels from the concatenated feature maps extracted from the MobileNetv2 backbones was 960. Additionally, it seems like unconventional stride was used in some residual blocks in Fig. 2 of the original 3D MRU-Net paper. This would not be compatible with most image sizes, so we adjusted the total output stride to 16. Anyhow, this choice should not affect the total parameters of the network, but it could influence the computation time and the performance of the network. For the upsampling blocks, we set the number of output channels to 160/80/40/20 for stage 1/2/3/4 respectively. The prediction head was implemented as a 3d convolution layer with kernel size 1x1x1 and 1 output channel followed by sigmoid activation. In total, this resulted in a model with 5.0 million parameters. Therefore, the models are almost 3x larger than the 2d U-Net with 16 filters, around the same size as the 3d U-Net with 16 filters, and around 74\(\%\) the size of the Hybrid U-Net which is the best performing model in our paper.In terms of the performance, at first, we intended to perform 5-fold cross validation for 3D MRU-Net, similar to what we did for the other models in our paper. The 3D MRU-Net achieved an average dice score of 0.8777 ± 0.0197. We speculated that the lower performance is due to the characteristic of the MRI dataset. Each scan consists of 36 slices with resolution 320×290. The scan is highly anisotropic. Therefore, after feeding the inputs in 3 different perspectives to the 3 MobileNetv2 backbones, the outputs have to be heavily resized before concatenating and feeding the concatenated feature maps to the upsampling blocks. This assumption was tested – albeit only on the first fold – by reducing the number of MobileNetv2 backbone to only one. A dice score of 0.8764 was achieved compared to the original 0.8537.Table 4 Dice score on validation set for various models.Statistical testTwo sided paired T-Tests were performed at \(\alpha \)=0.05 to find whether the difference between the models are statistically significant. For the best model, the 2d U-Net, it was found that the difference between the model and the other models is statistically significant except for the original Swinv2 UNETR and 3d U-Net with 8 filters. Aside from this, it was found that the difference between models from the medical literature , except for the 3D MRU-Net, and the models from the general literature are statistically significant in favor of the former.Multiple stage methodsBelow, the results from multiple stage methods are reported. The methods were implemented using the model from the original paper as well as the 2d U-Net, which was the best model from the previous set of experiments. The results are summarized in Table 5. The difference between two methods was considered statistically significant if a two-sided paired T-test resulted in a p-value lesser than 0.05.Table 5 Dice score on validation set for multiple stage methods.CropFor the crop method, two of the same models were used in succession. Both models were trained to segment the MRI scans. However, the inputs to the second model are cropped based on the outputs of the first model. The smallest bounding box that covers the entire airway as derived from the output of the first model, not the ground truth, is found. This preliminary bounding box is expanded by a certain percentage, keeping its center constant. For the 3d U-Nets, a 3d bounding box was constructed for each scan. On the other hand, for 2d U-Nets, a 2d bounding box was constructed for each slice.For the 2d U-Nets, the same set of augmentations used for the final 2d U-Net was used. The crop margin was set to 0.25, and the cropped input was resized to (160,160) before passing through the second network. The final segmentation map is the output of the second network placed in its original position – resized to its original size – with the rest of the pixels being set as the background class. The loss/score is calculated on this final segmentation map. The results are shown in Table 5. Each fold of this two stage 2d U-Net shows worse performance than the plain 2d U-Net and this difference was statistically significant.However, for the 3d U-Nets, when the same set of augmentations was used, the network fails to converge and its performance is dramatically worse than other methods tested so far. This can be seen in Table 7. Initially, only rotations, aside from flipping, were removed from the set of augmentations, and different crop margin and the size to resize the cropped inputs to were adjusted. The crop margin of 0.25 and 0.5 and a size of (32,256,256), (32,160,160), (16,256,256), and (16,160,160) were tested on the first fold. Additionally, in the case where the input has depth 16, downsampling was not performed in the first stage of the U-Net. At the end of training, the training loss for all of these combination remained at around 0.3. The result is shown in Table 6.Table 6 Dice score on validation set of Fold 1 for 3d U-Nets crop with different crop margin and image size.Afterwards, different set of augmentations were tested on the first fold with the crop margin and size fixed to 0.5 and (16,256,256) respectively, which were the hyperparameters that gave the best result in the previous experiment. The set of augmentations tested were full augmentation, no rotation (translate, elastic deformation, and Gaussian noise), no rotation and elastic deformation (translate and Gaussian noise), no affine transformation (elastic deformation and Gaussian noise), and no augmentation. The result is shown in Table 7. The second and third combination converges to a higher training loss, while the fourth and fifth combination converges to a lower train loss but overfitted drastically. On the validation set, all the combinations achieved a score of approximately 0.89 except for full augmentation.It was decided to proceed with the fourth combination and the crop margin/size were kept at 0.5/(16,256,256) respectively for the rest of the folds, because this provided the best result in the previous experiments. Performance is reported in Table 5. It is seen that the performance of simply cropping for the 3d U-Nets is worse than that of a plain 3d U-Net and this difference is statistically significant.Table 7 Dice score on validation set of Fold 1 for 3d U-Nets crop with different set of augmentations.AttentionFor the attention network, the attention module was implemented as described in9 except that the max pooling layer in their module was removed. In9, the max pooling layer in their attention module would have to be implemented with a stride of 1, which is unconventional, and the authors did not give any mention regarding this issue. Therefore, it was decided to remove the max pooling layer. Nevertheless, the intended functionality should still be preserved to a good degree. The performance for the 2d/3d variants are reported in Table 5. Additionally, for the 3d variant, rotation was removed from the set of augmentation, being that it is the minimum amount of augmentation that needed to be removed for the training loss to converge. Both variants were worse than their respective plain version. The difference between the plain 2d U-Net and 2d U-Net Attention was statistically significant. However, the difference between plain 2d U-Net and 3d U-Net Attention was not statistically significant, but the resulting p-value was very low – 0.056.HybridFor the hybrid methods, a 2d U-Net and 3d U-Net along with a layer that combines the two models are trained. First, a 2d U-Net was trained to segment the inputs. Afterwards, a 3d U-Net takes the images along with the last feature map, before the dense prediction head, of the 2d U-Nets as inputs. The dense prediction head of the 3d U-Nets is omitted. However, it is replaced by a dense layer, the so-called “hybrid fusion layer”, that takes in the last feature map of the 2d U-Net and 3d U-Net as input, generating the final segmentation output. Hence, the 3d U-Net and the hybrid fusion layer are trained jointly in the second step with the weights of the 2d U-Net being frozen. In the last, step the whole model, with the weights of the 2d U-Net being unfrozen, is fine-tuned.For each fold, this method presented improvement from the standard 2d U-Net. However, this improvement is only marginal and does not yield statistically significant improvement at p=0.05. The results can be seen in Table 5.Semi-supervisedTwo semi-supervised were considered. These are based on image reconstruction11 and masked image modeling12.In the original paper11 for image reconstruction, a 2d U-Net was used. Therefore, this method was also implemented here using only a 2d U-Net. For this method, the residual connection of the 2d U-Net was removed, and it was trained on a large set of unlabelled MRI scans, consisting of 670 scans, where 536/134 were used for training/validation respectively. The loss used was the mean squared error, calculated between the reconstructed image and the original image. The best weight from this step was then used to initialize the 2d U-Nets for each fold of the labelled dataset, which was then trained to reconstruct the image as well as segment the image at the same time. The loss used was a combination of dice loss between the ground truth and the predicted segmentation map and the mean squared error between the reconstructed image and the original image. In this step, the the reconstructed image and segmentation map passes through the same pathway, except that the residual connections were omitted in the pathway of the reconstructed image, and at the end of the two pathways, two separate dense prediction heads were utilized. The two losses were balanced by a parameter, \(\alpha \). At \(\alpha \)=10, the dice loss and the mean squared error are both around 1 for randomly initialized weights. \(\alpha \)= 1, 10, 100 were tested here. Results are shown in Table 5, where the method is abbreviated as “2d U-Net IR”. In general, there were no improvement for each of the fold at all level of \(\alpha \), and overall, the average dice score decreases. The difference between the plain 2d U-Net and this pretrained 2d U-Net was statistically significant when \(\alpha \) was set to 1 and 100.In the paper for masked image modeling12, the Swinv2 transformer model was used. The inputs to the Swinv2 was masked using patches of size 32×32. The ratio of unmasked to masked pixels was fixed to 0.6, and the amount of patches were calculated accordingly. The location of these patches were randomized. This procedure was implemented by replacing the embedded patches (input tokens) to the Swinv2 by a learnable mask token vector. A dense prediction head was added on top of the final feature map, whose resolution was 32 times less than the original image. This prediction was then upscaled before the loss was calculated between the reconstructed image and the original image. The original paper found that l1/l2 loss had roughly the same performance. For this implementation, the Swinv2ForMaskedImageModeling by HuggingFace was utilized. The best weight in this step was utilized to initialize the encoder of the Swinv2 UNETR in the second step. The second step training procedure follows the same procedure as the single step Swinv2 UNETR training procedure in the previous section. Results for Swinv2 UNETR pretrained with masked image modeling are shown in Table 5. In general, there were improvement for each of the fold except for fold 4. The average dice score also increases from 0.9156 to 0.9170. However, this improvement was not statistically significant.To adapt masked image modeling for 2d U-Net, the pixels of the feature map after the first convolutional layer of the 2d U-Net, corresponding to masked regions, were replaced by learnable mask vector instead. For example, for the case were the number of output channels of the first convolution layer was 16, the mask vector consisted of 16 learnable parameters. Unlike the first semi-supervised method, the residual connections were not removed, because the images were masked so there was no concern about the U-Net simply learning the identity map. The loss used was the mean squared error between the original image and the reconstructed image. The best weight in this step was used to initialized the 2d U-Nets in the second step. The second step training procedure follows the same procedure as the single step 2d U-Net training procedure in the previous section. The result is reported in Table 5. Each fold shows a slight increase in performance except for Fold 3. On average, the dice score for the masked image modeling-pretrained 2d U-Nets were higher but this difference was not statistically significant.

Hot Topics

Related Articles