A novel approach for automatic classification of macular degeneration OCT images

DatasetOur proposed method was evaluated on two publicly available databases. The first database was collected from the Heidelberg SD-OCT imaging system at Noor Eye Hospital in Tehran, Iran19. This database, comprises over 16,000 retinal OCT images from 441 cases, including 120 normal cases, 160 cases with DRUSEN, and 161 cases with CNV. To facilitate training and comparison, we algorithmically selected the OCT images with the poorest results from each case in the database. For instance, if a patient was detected with DRUSEN, only the images identified as DRUSEN in that patient’s OCT images were retained. Ultimately, 12,649 images were selected for training and testing, as outlined in Table 2. We divided the selected dataset into training and testing sets in a 9:1 ratio, result in 11,353 images in the training set and 1,288 images in the testing set. The second database, known as the University of California San Diego (UCSD) database20, consists of both training set and testing set. These sets are further categorized into four distinct classes: DRUSEN, CNV, DME, and NORMAL. The training set comprises a total 108,312 retinal OCT images, as outlined in Table 2. On the other hand, the testing set consists of 1,000 retinal OCT images, with 250 images allocated to each class.Table 2 Distribution information of NEH dataset and UCSD dataset.Experimental evaluation metricsIn this study’s experiments, we primarily calculate the accuracy, sensitivity, and specificity for each class based on the confusion matrices obtained from image classification. Then, we aggregate the results for each class and compute their averages. The confusion matrix is a matrix used to summarize the classification performance of the model, including the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Accuracy represents the overall performance of the model in correctly classifying samples, with higher accuracy indicating better overall performance. Sensitivity indicates the model’s ability to correctly predict positive samples among the actual positive class, reducing the risk of false negatives. On the other hand, specificity represents the model’s ability to correctly predict negative samples among the actual negative class, reducing the risk of false positives. The formulas for accuracy, sensitivity, and specificity are as follows.$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(13)
$$Sensitivity=\frac{{TP}_{i}}{{TP}_{i}+{FN}_{i}}$$
(14)
$$Specificity=\frac{{TN}_{i}}{{TN}_{i}+{FP}_{i}}$$
(15)
Here, \(i\) represents the corresponding value for class \(i\) when it is treated as the positive sample.Experimental settingsIn this study, we resized all images to 224 × 224 × 3. Moreover, we employed a five-fold cross-validation method to ensure the rationality of model training, enhance the model’s generalization ability, and reduce overfitting. Furthermore, to increase the diversity of the data and improve the generality of the proposed model, we applied data augmentation techniques. These techniques encompassed random rotations, translations, changes in brightness, scaling, and horizontal flipping to the four subsets used for training in each training iteration. However, the validation set was not subjected to augmentation. Table 3 provides detailed specifications of our data augmentation approach.Table 3 Data augmentation scheme.All networks were trained using stochastic gradient descent (SGD) optimizer with cosine annealing to adjust the learning rate. The momentum for SGD optimizer was set to 0.9, weight decay to 1e−4, initial learning rate to 0.001, and the minimum learning rate for scheduling was set to 0.0005 to prevent the learning rate from decreasing too low and causing the model to fall into local minima or early stagnation. The number of epochs was set to 100 to ensure full convergence of the model, with a batch size of 32. The cross-entropy loss function was used to measure the training status and adjust the model’s parameters. The entire training process was end-to-end.Results and analysisWe train the model on the NEH19 and UCSD20 datasets and evaluate the model’s classification performance using 3-class and 4-class confusion matrices, respectively. The performance evaluation metrics include accuracy, sensitivity, specificity, and the model’s parameter count as mentioned above.First, on the NEH dataset, we compared our model with several state-of-the-art networks known for their excellent classification performance, such as VGG14, ResNet16, DenseNet17, EfficientNet18, EfficientNetV232 and recently proposed models for retinal OCT image classification. Among these models, two also have multi-scale architectures, and there are models based on improvements to EfficientNet18. Table 4 shows the average performance of all models on the NEH dataset by using five-fold cross-validation.Table 4 Classification performance evaluation on the NEH dataset.The results from Table 4 indicate that our model (MSA-NET) outperforms other models in terms of classification accuracy, sensitivity, and specificity. MSA-NET achieves a classification accuracy of 98.1%, sensitivity of 97.9%, and specificity of 98.0%. Notably, MSA-NET exhibits superior performance in accuracy, sensitivity, and specificity compared to well-known neural networks such as VGG1614, ResNet5016, DenseNet12117, and EfficientNetB018. Particularly, when compared to the EfficientNetB0 network, the enhanced MSA-NET demonstrates a significant improvement in accuracy by 12.7%, sensitivity by 13.4%, and specificity by 5.9%. The comparison between the two datasets clearly highlights the substantial enhancement in the performance of MSA-Net for retinal OCT image classification within the EfficientNetB0 framework. Furthermore, MSA-NET exhibits superior performance compared to the advanced version of EfficientNet, EfficientNetV2, as well as the AlexNet model implemented by Kaymak et al. In addition, when compared to the multi-scale structures proposed by Thomas et al.29 and Saman et al.30 respectively, the multi-scale structure in MSA-NET clearly demonstrates a more significant improvement in model performance. Furthermore, the utilization of depthwise separable convolutions as the foundation of the multi-scale structure ensures that MSA-NET has a lesser increase in parameter count. While the ensemble network model constructed by Moradi et al.31, integrating VGG16, EfficientNetB3, and DenseNet121, exhibits a significant improvement in model performance, this approach also makes the model more complex and greatly increases computational demands. Moreover, according to the comparison of parameter counts in Table 4, it can be seen that our method maintains a relatively small parameter count while ensuring high accuracy. Particularly, compared to the high-performing ensemble methods proposed by Moradi31, our method significantly outperforms theirs in terms of parameter control while also showing a noticeable performance advantage. Therefore, it is evident that the proposed multi-scale structure and attention mechanism have a significant impact on model improvement. The incorporation of these two structures enables the model to more accurately classify OCT images.Another comparative study that was conducted on the UCSD20 dataset not only reflect the classification performance of our model but also demonstrate its generalization ability. In the case of MSA-Net, we directly fine-tuned the parameters trained on the NEH dataset directly using the UCSD dataset. The performance evaluation of the model was also represented by the results of five-fold cross-validation.For comparison, we selected the FPN-VGG16 model proposed by Saman et al.30 and the ensemble network model proposed by Moradi et al.31, both of which demonstrated excellent performance in the initial experiment. Similar to our model, these two methods also utilized parameters that trained on the NEH dataset to train on the UCSD dataset in order to assess the models’ generalization ability. Additionally, we compared two studies conducted by Fang et al.27,28 In one study, they introduced a feature fusion strategy for iteratively combining features within convolutional neural network layers. In the other study, they utilized a Lesion Detection Network (LDN) to generate attention maps, which were then integrated into the classification framework. Both of these methods were trained directly on the UCSD dataset20. Table 5 presents the average performance of all models on the UCSD dataset under five-fold cross-validation.Table 5 Classification performance evaluation on the UCSD Dataset.Table 5 shows that MSA-NET still demonstrates superior performance on the UCSD dataset. With an accuracy of 96.7%, sensitivity of 96.7%, and specificity of 98.9%, these metrics highlight the excellent classification performance of MSA-NET on the UCSD dataset, showcasing its strong generalization ability. Compared to the methods proposed by Saman et al.30 and Moradi et al.31, MSA-NET also exhibits advantages in performance. However, it is noteworthy that the ensemble network model proposed by Moradi et al.31 performs comparably well. Nevertheless, due to its lightweight nature, Moradi et al.’s ensemble network model possesses a more complex structure and relatively more parameters. Whereas MSA-NET, with the addition of the multi-scale structure and spatial attention mechanism, has fewer parameters, making the model more concise. Compared to the two methods proposed by Fang et al.27,28, MSA-NET demonstrates superior performance.The experimental data from both experiments validate the multi-scale structure and attention mechanism in MSA-NET, not only in improving model performance but also in ensuring exceptional generalization capabilities.To further demonstrate the effectiveness of our proposed structure, we incorporated the proposed MSA (Multi-scale Structure and Attention Mechanism) into DenseNet121 and ResNet50 models and conducted three-class classification experiments on the NEH dataset using five-fold cross-validation. The results are shown in Table 6.Table 6 Experimental results of combining multi-scale structures and attention mechanisms with base network architectures.The results in Table 6 show that incorporating our proposed MSA into other model architectures also improves their classification performance, which demonstrates the effectiveness of our MSA. Particularly, it performs best when combined with the EfficientNetB0 network, as EfficientNetB0 has excellent classification performance. Inserting the MSA module into the inverse residual modules of EfficientNetB0, which are repeated seven times in the network, maximizes the feature extraction capabilities of the MSA module, leading to better results compared to other network combinations.Ablation experimentIn order to investigate the specific impact of the two proposed structures on model performance, we conducted ablation experiments on the NEH dataset. These experiments involved combining the structures with the EfficientNetB0 model in various configurations. The evaluation metrics utilized were accuracy, sensitivity, and specificity. It is important to note that none of the combinations use pre-trained parameters.The first combination involves training the EfficientNetB018 model alone. In the second combination, we add the spatial attention mechanism to EfficientNetB0, denoted as EfficientNetB0 + SA. The third combination combines the multi-scale module with EfficientNetB0, placing the multi-scale module before the SE attention mechanism, denoted as EfficientNetB0 + Multi-Scale. The final combination integrates both the spatial attention mechanism and the multi-scale module with the EfficientNetB0 network, denoted as EfficientNetB0 + SA + Multi-Scale. Four experimental tasks were conducted to explore the effects of the different structural combinations. The results of these ablation experiments, carried out under five-fold cross-validation, are presented in Table 7.Table 7 Results of Ablation Experiments on different structure combinations on the NEH Dataset.The results presented in Table 7 indicate that both our proposed multi-scale structure and attention mechanism contribute to improving the model’s classification performance. Notably, the multi-scale structure exhibits the most substantial improvement in performance. By incorporating both of these structures into the model, we achieve the highest level of performance, albeit with a longer training time. However, the enhancement in performance is significantly more pronounced. These findings suggest that placing the multi-scale structure before the attention mechanism optimizes the extraction of features from multi-scale feature maps. By enhancing feature representation in both the channel and spatial dimensions, the model is able to prioritize crucial image features such as the macular region.To investigate the impact of each convolution kernel on the multi-scale structure, we divided the three different kernel sizes into three combinations: (3, 5), (3, 7), and (5, 7). We combined each of these with the EfficientNetB0 model and conducted three-class classification experiments on the NEH dataset using five-fold cross-validation. Each experiment incorporated our proposed attention mechanism. The results are shown in Table 8.Table 8 Results of ablation experiments on different scale combinations on the NEH dataset.The results in Table 8 show that each of these three combinations improves the model’s classification performance to some extent. When combining all three different kernel sizes, the model achieves the best results in terms of accuracy and sensitivity among all comparisons, and the performance improvement is the greatest. This is because the inclusion of features from different-sized kernels allows the model to extract more diverse and richer features, leading to a more significant performance enhancement.Visualization experimentThe pathological features identified by the model hold immense significance for both doctors and patients, making it a crucial aspect diagnosis. Therefore, in this section, we will use the Grad-CAM33 technique to generate Class Activation Maps (CAM) for visualizing the model and discuss its in identifying image category features. Figures 6 and 7 show examples of CAM generated by MSA-NET for the four classes of images in the UCSD dataset and the three classes of images in the NEH dataset at the feature extraction layer.Figure 6Illustrates examples of class activation maps (CAM) generated by MSA-NET on the UCSD dataset using Grad-CAM. In particular, (a) CNV case, (b) DME case, (c) DRUSEN case, and (d) NORMAL case.Figure 7Illustrates examples of CAM generated by MSA-NET on the NEH dataset using Grad-CAM. The left column shows the original images and the right column shows the CAM images: (a) CNV case, (b) DRUSEN case, and (c) NORMAL case.Based on the heatmap examples in Figs. 6 and 7, it can be observed that MSA-NET’s feature extraction layer predominantly focuses on the macular region. For the CNV, DME, and DRUSEN categories, MSA-NET accurately highlights the macular degeneration lesions. These CAM examples visually present the model’s focal points, demonstrating that our model has a certain level of interpretability and indicating that our proposed multi-scale structure and attention mechanism are highly effective in the feature extraction process. For both doctors and patients, CAM images can be used to showcase the specific performance and application value of the model. Of course, to enhance the model’s focus on image detail features, we will place greater emphasis on improving the model’s attention to detail in future research.

Hot Topics

Related Articles