An analysis of decipherable red blood cell abnormality detection under federated environment leveraging XAI incorporated deep learning

The subsequent part contains the final findings of our implementation, as well as an evaluation of how well our models function to identify RBC abnormalities. For each model, the values for precision, recall, F1 score, confusion matrix, AUC score, ROC curve, accuracy, and loss function are shown. The Eqs. ((4)), ((5)), ((6)), ((7)) and ((8)) demonstrate the many formulae that are used to compute accuracy, precision, recall, f1-score, and specificity. Our key objective was to get a greater test accuracy while simultaneously reducing the amount of model loss. Every model was executed for a total of fifty epochs using Adam as the optimizer and setting the learning rate to 0.00001. Following the training, the best-performing DL model was selected to become the global model based on the values obtained from the results. Then we train it in an FL environment. Finally, we summarize our efforts by making comparisons to other state-of-the-art methods that have been previously published.$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(4)
$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(5)
$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(6)
$$\begin{aligned} f1-score = \frac{2\times TP}{2\times TP+FP+FN} \end{aligned}$$
(7)
$$\begin{aligned} Specificity = \frac{TN}{TN+FP} \end{aligned}$$
(8)
Performance evaluation of DL modelsWe trained VGG16, Inception v3, and ResNet50 models for 50 epochs each. The accuracy and the loss curves for the models are given in Figs. 8,  9, and  10:Figure 8VGG16 accuracy (a) and loss (b) curves.Figure 9Inception v3 accuracy (a) and loss (b) curves.Figure 10ResNet50 accuracy (a) and loss (b) curves.If we follow the training pattern of the three models, we can see that all the models reach near their peak training scores at around 20 epochs. The training accuracy and loss patterns for Inception v3 and VGG16 showcase smooth patterns, representing proper learning without any issues. Meanwhile, the validation loss curve for ResNet50 architecture showcases a large spike at the beginning, which is likely caused by overstepping local minima due to a larger than necessary learning rate. However, the model gradually rectifies it as the training goes on. Except for ResNet, all other models’ training and validation accuracy-loss curves run smoothly side by side; no abnormality is seen.In Fig. 11, the classification results for the three models are given. A common pattern of not being able to classify the hypochromic RBC images is present in the classification reports, as the precision, recall, and f1-score are low for that class. The failure to classify the hypochromic RBC images is also evident in the confusion matrices given in Fig. 12.Figure 11Classification reports of VGG16 (a), Inception v3 (b) and Resnet50 (c).Figure 12Confusion matrices of VGG16 (a), Inception v3 (b) and Resnet50 (c).Figure 13ROC curves and AUC scores of VGG16 (a), Inception v3 (b) and Resnet50 (c).As visible in Fig. 12, the shortage of samples in the hypochromic RBC image class makes it difficult for the models to classify them properly. However, the ResNet model still manages to classify them fairly accurately. The prediction accuracy for all the other classes is satisfactory, as visible in Fig. 11. Afterward, we analyzed the ROC curve and the AUC score of the models. The analyzed ROC curves are shown in Fig. 13. The models achieved the perfect AUC score for the acanthocyte class and the lowest AUC score for the hypochromic class. In addition to the hypochromic class, a satisfactory AUC score was achieved for all the other classes across all the models.Figure 14Number of trainable parameters in the models.Among the three models, the VGG16 model comes to the top with 96% overall accuracy across all the classes. In addition, VGG16 also has the least number of parameters compared to the other two models, as visualized in Fig. 14. The less amount of parameters should make the model smaller in size, making the model easier to transfer from client to server and vice versa during the federated learning communications. Consequently, VGG16 turned out to be our preferred model for the FL environment.Performance evaluation under federated learningIn the process of creating the FL environment, we divided the data into five segments, each segment representing a single client. A separate test set was also kept to analyze the test performance of the federated global model. The FL simulation was run for 50 communication rounds, where each communication round represents one epoch on each client’s dataset.Vanilla averagingFigure 15Federated learning global model accuracy with vanilla averaging (a) and loss (b) curve.In the vanilla averaging, the global accuracy and loss curve across the communication rounds visualizes a healthy rate of changes (Fig. 15). The accuracy quickly increased until the 15th communication round and then gradually increased with occasional spikes until the 50th communication round. Both the accuracy and the loss curves reached a plateau after the 50th round, therefore, we decided to keep the record until that point.Figure 16Federated learning classification report (a) and confusion matrix (b).Surprisingly, as evident in Fig. 16, while the FL global model performs well on the non-hypochromic RBC image classes like the centrally trained models, it also performs well in classifying the hypochromic RBC images, something that the centrally trained models failed to achieve. In the centralized environment, the models failed to properly classify the hypochromic RBC images due to the low number of samples. Meanwhile, in the FL environment, the VGG model had to work on a non-IID dataset due to the nature of the environment, which likely led the model to work well on classes with low sample distribution. The ROC curve for the global model under the FL environment is given in Fig. 17. As mentioned previously, FL seems to solve the issue of classifying hypochromic classes accurately, with a 0.91 AUC score for the corresponding class. Meanwhile, a lower AUC score was achieved for the codocyte class compared to the centrally trained models. Apart from the mentioned two classes, a balanced AUC score was maintained across the central and decentralized models.Figure 17ROC curve for FL global model.Despite having a decent result based on trusted clients, the basic averaging method is mostly not sustainable in a more unpredictable environment where there might be problematic clients with poor data or with ill intentions. Therefore, some form of basic safeguard against such threat is required and a weighted averaging process can provide such safeguard.Weighted averagingThe weighted averaging is used where the model weights are scaled by the scores achieved by the client model on a separate test set in the server end. This weighting mechanism should perform better than the vanilla averaging theoretically due to it accommodating the model qualities in the averaging mechanism. This averaging procedure provides greater emphasis to the better models while providing lesser priority to the worse model, so that the worse performing models do not drag down the performance of the averaged server model.Figure 18Federated learning global model accuracy with weighted averaging (a) and loss (b) curve.Figure 19Federated learning classification report (a) and confusion matrix (b) for weighted averaging.Figure 20ROC curve for FL global model with weighted averaging.On the normal dataset and client distribution, we can see from Fig. 19 is that the FL framework incorporating weighted averaging achieves 95% accuracy, a slight improvement compared to the framework with weighted averaging. Comparing between Figs. 15 and 18, we can notice a smoother curve in terms of loss per epoch, showcasing a more stable learning behavior of the framework with weighted averaging mechanism. In addition, we can observe an more uniformly distributed ROC score per each class, as visualized in Fig. 20. Although the performance improvement is rather minimal in the general client and data distribution, the main benefit of the averaging procedure should be visible if there are dedicated clients with really bad data, or intentional data poisoning attacks with few clients. Effects of such behaviour has been empirically shown and discussed in the ablation study section.The practicality of the averaging mechanism ultimately comes down to the properties of the clients in the FL environment. The privacy concern is the obvious major tradeoff, as individual client performance is observed with a test set, and the clients with bad data can easily get exposed. Such exposures may discourage clients from participating in the federated loop. Thus, if a federated environment mostly consists of trusted clients with good data, the sacrifice of privacy is likely not worth it as performance improvement should be mostly minimal. However, an open source FL environment exposed to clients with bad data might significantly benefit from such averaging, even if some privacy trade off is there.Comparison against literatureOverall, under the federated setting, the VGG16 model managed to score an accuracy of 94%, which is quite close to the 96% accuracy score achieved by the centrally learned VGG16 model. This demonstrates the fact that even under the FL setting, it is possible to classify RBC deformation nearly as accurately as in the normal setting. In FL, there was a sacrifice of only a 2% accuracy score with a better distribution of precision and recall scores across the classes. As the FL environment ensures data privacy and provides opportunities for open-source training integration, the added benefits are huge. Therefore, a decrease of 2% accuracy score is a worthy trade-off from our point of view. This research thus proves the effectiveness of FL for the classification of RBC image data. Finally, we compared our FL approach to the State Of The Art (SOTA) result from another paper37 on the same dataset. Results are given in Table 2.Table 2 Comparison against literature.Observing Table 2, we can say that the proposed FL architecture conclusively achieved better sensitivity across most of the classes. Meanwhile, the specificity scores across the classes are generally lower compared to the literature. With a better sensitivity score, the proposed FL architecture should be able to detect a particular RBC deformation type fairly well. Additionally, the overall accuracy score for both the literature and the proposed FL architecture is 94%, showcasing the prevalence of the FL system. Even with e decentralized learning structure, FL performs quite competitively against models from the literature.Despite having competitive performance against traditional learning mechanisms, one important drawback of federated learning is the communication overhead. VGG, Inception, and ResNet are mostly large CNN architecture, and sending them back and forth between client and server will cause some delay. It is difficult to empirically discuss the communication overhead due to its variability depending on some factors, such as the configuration of the client devices, their internet connectivity, and so on. However, we can note that the communication overhead is a less significant issue in this case as it is only present during the time of federated training. During inference time, the class is directly predicted by the global model, and the communication overhead does not matter at that time.Ablation studyThe ablation study provides insights into the performance of different variants of VGG16 architecture on both IID and Non-IID datasets in the Federated Learning environment. The details of the Non-IID distribution are illustrated in Fig. 21.Figure 21Per client class distribution in the non-IID setting.In the non-IID setup, we made sure that each class is unevenly distributed among the clients, with some clients having no samples of particular classes. Performance against these uneven distribution classes should provide us with a better insight into the robustness of the model in the federated framework.Table 3 Ablation study with different variations of VGG16 architecture on IID and non-IID dataset in a centralized learning environment with vanilla and weighted averaging.From Table 3, we can notice that the basic architecture of the VGG16 model (5 blocks) archives the highest accuracy across both datasets with minimum numbers of trainable parameters. On the other hand, an increase in the number of trainable parameters is noticed for the reduced number of blocks in the architecture. This increment is occurring due to the model outputting more filters to the classifier layers when later convolution blocks are removed. The convlution layers at the 4th and 3rd block outputs a lot more filters compared to the fifth block. As the number of parameters in Conv2D layers is proportional to the square of the number of filters, the increase in width (number of filters) leads to a quadratic increase in parameters. Besides, reducing the number of blocks also reduces the number of pooling layers. Due to the lack of enough pooling layers, the spatial dimensions remain larger, leading to more parameters in the fully connected layers. In addition, the input to the remaining fully connected layers becomes larger, resulting in a higher parameter count. This study highlights the importance of balancing model depth and width to achieve optimal performance without unnecessary complexity. The basic architecture of the VGG16 model strikes this balance effectively, distributing parameters across multiple layers and leveraging pooling operations to control the parameter count. In addition to that, it is also important to use lightweight models in any distributed networks. Therefore, the basic architecture of the VGG16 model is the best fit for red blood cell abnormality detection under a distributed learning framework.Table 4 Ablation study with different variations of model on IID and Non-IID dataset.One of the key challenges in decentralized learning is data poisoning. In federated learning, this type of adversarial attack occurs when malicious participants deliberately modify their local data with the aim of corrupting the global model being trained across distributed devices. To evaluate the impact of data poisoning on our baseline models in a federated learning setup, we manually created a poisoned dataset for training purposes. This involved flipping labels and adding different types of images (such as white blood cell images, lung X-ray images, etc.) to two clients’ data, thereby forming the poisoned training set. The performance comparison of baseline models trained on IID (Independent and Identically Distributed), Non-IID, and Poisoned datasets in a decentralized learning environment is illustrated in Table 4. The table showcases the results for both Vanilla Averaging and Weighted Averaging techniques. It is evident that models trained with weighted averaging in the federated learning setup consistently outperform those trained with vanilla averaging. Among the three baseline models, the VGG16 (5 blocks) model demonstrates superior performance across all scenarios, including the poisoned dataset. Therefore, weighted averaging improves model resilience against data poisoning in federated learning, with VGG16 (5 blocks) demonstrating the best overall performance among the tested models.Interpretation of global modelAs VGG16 is our best-performing global model in the federated environment, it was the chosen model for the generation of Explainable AI (XAI) based outputs. For generating XAI outputs, we used GradCam and picked the final convolution layer of the global VGG16 model.Figure 22Interpretation of prediction by the global model using GradCAM. (a) Interpretation on Acanthocyte, (b) interpretation on Codocyte, (c) interpretation on Elliptocyte, (d) interpretation on Hypochromic, (e) interpretation on normal, (f) interpretation on Spherocyte, (g) interpretation on Stomatocyte, (h) interpretation on Dacrocyte.In Fig. 22, samples of the RBC image of each class and the corresponding GradCam output have been given. The GradCam mapping highlights the region of interest for the global model on individual images by highlighting the image with a gradient ranging from red to blue color. The bright red color in the feature map indicates the most important region while the blue color indicates the least. From the GradCam output, for RBCs that are not round in structure (e.g. Acanthocyte, Elliptocyte, Stomatocyte, Dacrocyte), we notice that a large portion of color mapping is at the edge of the images. The image samples have little texture in the middle of the cell images. Since the non-round RBC cells have distinguishable shapes, the model could better rely on the shape itself for proper classification. In order to retrieve the shape information, the model had to focus on the edge of the RBC images, explaining the color mappings often being centered around the edges. In the case of the RBCs that are relatively rounder in shape (e.g. Codocyte, Hypochromic, Normal, Spherocyte), the shape becomes a hard-to-distinguish feature for the model. For these images, the model prioritizes the texture in the middle of the RBC cells, resulting in a higher distribution of the color mapping in the middle.

Hot Topics

Related Articles