Selection of pre-trained weights for transfer learning in automated cytomegalovirus retinitis classification

We collected the study’s dataset from the Department of Ophthalmology at Siriraj Hospital, Mahidol University, with Siriraj Institutional Review Board approval (SiIRB#552/2015). The research was conducted according to the 2002 version of the Declaration of Helsinki. Informed consent was obtained from all subjects for unidentifiable use of their images.Dataset and preprocessingThe dataset for this study was collected over a decade, from January 2005 to December 2015, at the Department of Ophthalmology, Siriraj Hospital. We utilised Kowa VX-10i and VX-20 fundus cameras to capture the retinal images. These images, in RGB format with a resolution of 1280 × 960 pixels, were saved in JPEG format. While most images focused on the central retinal areas, some also included lesions in the far peripheral retina.For our analysis, the images were meticulously screened for clarity and readability. Subsequently, they were classified into two categories: CMVR and Normal images. The CMVR diagnosis was confirmed through clinical examination or molecular identification of CMV in the ocular fluid. The three main clinical presentations of CMVR are fulminant (classic) form: sectoral, full-thickness, yellow-whitish, retinal infiltrations with retinal haemorrhages; indolent (granular) form: peripheral, granular, whitish retinal opacity; and frosted branch angiitis (perivascular) form: perivascular infiltrations without retinal involvement. All CMVR images included one or a mixture of these characteristics. Normal photos were sourced from patients who underwent routine eye screenings, primarily for DR and exhibited no retinal abnormalities.Our dataset comprised a total of 955 images from 94 patients, with 524 categorised as CMVR and 431 as Normal. The patient’s demographics and characteristics are demonstrated in Table 1. The dataset was pre-partitioned for training, validating, and testing purposes, ensuring a comprehensive evaluation of the model’s performance. To enhance the robustness of our training dataset, we employed various image augmentation techniques, including flipping, mirroring, brightness adjustment, shifting, rotation, and zooming. Specific augmentation parameters included a width shift and height shift of 0.5, a rotation range of 90 degrees, horizontal and vertical flips, a brightness range from 0.1 to 2.0, and a zoom range of 0.25.
Table 1 Demographic data, characteristics, and diagnoses of patients.CNN modelOur study employed DenseNet121 as the foundational model to assess various TL strategies. DenseNet, depicted in Fig. 1, was introduced by Huang et al. in 201722. This architecture is a derivative of ResNet and is distinguished by its utilization of shortcut connections, enabling each layer to be directly connected to every other layer. In DenseNet, the input to each convolutional layer comprises the aggregated feature maps from all preceding layers, and its output is fed into all subsequent layers. This unique approach of feature map concatenation enhances DenseNet’s computational and memory efficiency.Figure 1A schematic diagram of DenseNet121 architecture.The DenseNet architecture begins with an initial sequence of a 7 × 7 convolution, batch normalization, ReLU activation, and max pooling. This is followed by four dense blocks (Dense_blocks) and concludes with global average pooling, fully connected, and classification (softmax) layers. Each dense block is interspersed with a transitional layer consisting of batch normalization, ReLU activation, a 1 × 1 convolutional layer, and average pooling. Within each Dense_block is a series of Conv_blocks, which are combinations of 1 × 1 and 3 × 3 convolutional layers. The specific number of Conv_blocks varies depending on the DenseNet model variant. DenseNet121, in particular, contains a total of 120 convolutional layers.In our experiments, we modified the network by replacing the original top fully connected and classification layers with two new fully connected layers, a 50% dropout layer, and a 2-class classification layer, as illustrated in Fig. 2.Figure 2Proposed method to evaluate the transfer learning strategies.Transfer learningWe compared different pre-trained weights used in TL. Since there is no established consensus in the selection of weights for retinal image datasets, we explored the feature weights trained from 3 sources: ImageNet, sequentially trained from ImageNet to APTOS2019, and sequentially trained from ImageNet to CheXNet datasets. All three weights are transferred to classify our retinal images via different fine-tuning levels (Fig. 2).

ImageNet weight is a CNN algorithm trained on the ImageNet dataset from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) to classify and localise 1,000 object classes23. The dataset contains over one million colour images of natural daily lives, such as cats, dogs, and vehicles. Many state-of-the-art CNN models are pre-trained on this dataset, for instance, VGG, ResNet, Inception, and DenseNet. The pre-trained ImageNet weights were obtained via the Keras application24.

A sizeable retinal image dataset is from the Asia Pacific Tele-Ophthalmology Society (APTOS) 2019 Blindness Detection competition provided by APTOS on the Kaggle website25. The competition aims to classify the DR images. The dataset contains five classes of 3,662 full-colour retinal fundus images ranging from no DR, mild DR, moderate DR, severe DR, and proliferative DR. Briefly, the model was trained on the APTOS2019 dataset using transfer learning from ImageNet under the DenseNet121 model. The final weights were then collected for our experiment from the Kaggle website26.

CheXNet weights are results from a deep learning algorithm that can detect and localise 14 kinds of diseases from chest X-ray images27. Based on DenseNet121 and ImageNet transfer learning, the model is trained on the ChestX-ray14 dataset from the National Institute of Health, containing 112,120 frontal view X-ray (black and white) images from 30,805 unique patients28. The pre-trained weights were obtained through the GitHub website29. The chest X-ray images displayed some similarities to retinal images as they are medical pictures with stereotypic and spatial preservation. They are more abundant and publicly available compared to the retinal image dataset, which may serve as a potential intermediate source for sequential TL.

First, we evaluated the depth of fine-tuning methods among the three pre-trained weights in 2-class (CMVR vs Normal) identifications. After determining the best depth, we further compared the diagnostic performance of the three best models. We assumed the sequential TL from a similar target domain would offer the best result. The statistical analyses were performed using one-way analysis of variance (ANOVA). We considered p-values < 0.05 as statistically significant. All analyses were conducted using SPSS version 18.0 (SPSS, Chicago, IL, USA).Performance evaluationThe experiments were performed on an Intel® Core™ i9-10940X CPU @ 3.30 GHz with 252 GB of RAM and an NVIDIA GeForce 3090 12 GB for 100 epochs. The training lasted 8 h. The best and 10-time average performances were assessed. We adopted many performance indices for model evaluation. Accuracy and Loss are two primary metrics considered during the model training. Accuracy is the proportion of correct prediction (where the predicted values are the same as actual values) over the total predictions. Loss is a continuous variable displaying the uncertainty of how much the prediction varies from the true value. For the classification task, the default loss function is cross-entropy. The optimiser will learn and adjust the weights in each iteration to reach the maximal accuracy and minimal loss in the model development. The formulae for accuracy and cross-entropy function were defined as:$$Accuracy= \frac{No. \, of \, correct\, predictions}{Total \, No.\, of\, predictions}$$
(1)
$$Cross\, entropy= -\sum_{i=1}\sum_{j=1}{y}_{i,j}\text{log}\left({p}_{i,j}\right)$$
(2)
where \({y}_{i,j}\) denotes the true value i.e. 1 if sample i belongs to class j and 0 otherwise and \({p}_{i,j}\) denotes the probability predicted by the model of sample i belonging to class j.For model evaluation, we used a confusion matrix to visualise and summarise the performance of a classification algorithm. It represents counts of predicted and actual values as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The performance indicators from the confusion matrix are sensitivity (recall), specificity, positive predictive value (precision), accuracy, and F1 score.$$Sensitivity\, \left(Recall\right)=\frac{TP}{TP+FN}$$
(3)
$$Specificity=\frac{TN}{FP+TN}$$
(4)
$$Positive\, predictive\, value \left(Precision\right)=\frac{TP}{TP+FP}$$
(5)
$$Accuracy=\frac{TP+TN}{TP+FP+TN+FN}$$
(6)
$$F1 \, score=\frac{2TP}{2TP+FP+FN}$$
(7)
Activation mapsFor a better understanding of the model activities, class activation heatmaps were produced to identify the predictive areas on the retinal image. We applied Class Activation Mapping (CAM) architecture for this purpose30. In brief, CAM works by modifying the structure of the CNN model, particularly towards the end of the network. It replaces fully connected layers with global average pooling layers, followed by a classification layer. This alteration allows for the generation of maps using the weights of the classification layer. These maps are essentially heatmaps that highlight the influential areas of the input image for the classification task. Then, the heatmaps were upscaled and applied to the original image. We presented class activation heatmaps from models pre-trained with the three feature weights to identify hot spots triggering the classification.

Hot Topics

Related Articles