The potential of federated learning for self-configuring medical object detection in heterogeneous data distributions

DatasetsThe LUng Nodule Analysis 2016 dataset (Luna16) is a subset of the Lung Image Database Consortium (LIDC) dataset26, originally curated for medical image segmentation. Ethics approval for this public dataset was provided by each participating institution’s IRB. All participants gave informed consent. All research involving human participants adhered to the declaration of Helsinki, relevant guidelines, and regulations. It comprises pulmonary nodules with a minimum radius of 3mm (box volume median: 847 voxels), which are identified and verified by at least three experts. Non-nodular regions, nodules less than 3mm in radius, and nodules detected by one or two radiologists are excluded from the dataset as unrelated findings. Training data covers the initial eight subsets of Luna16 data, while the last subset is reserved for testing. One case is excluded due to an abnormal Hounsfield unit range. More than three-quarters of the cases are malignant, predominantly originating from the manufacturer GE Medical Systems. The mean spacings for the x and y dimensions are 0.69mm, while the z dimension is 1.57mm. The intensity value distribution ranges from -1024 to 2468 Hounsfield Units (HU), with a median intensity value of -101 while values below and above the 0.5th and 99.5th percentiles are considered as outliers.The Duke Breast Cancer MRI dataset (Duke)17,18 comprises 922 high-resolution MRIs, including comprehensive metadata annotated by medical experts. Ethics approval for this public dataset was provided by the Duke University Health System IRB. All participants gave informed consent. All research involving human participants adhered to the declaration of Helsinki, relevant guidelines, and regulations. Key characteristics of the dataset include mean spacings of 1.4 for the x and y dimensions, and 2.0 for the z dimension. Intensity values span from -4671 to 10834 (0,5th – 99.5th percentile), with a median value of 96. Compared to Luna16, the Duke dataset’s images exhibit larger mean spacing and larger boxes, with a median size of 5460 voxels, all of which are malignant and originate from devices of only two manufacturers.nnDetectionnnDetection12 is a state-of-the-art self-configuring MOD method that adjusts itself to novel datasets, with successful applications27,28. It automatically selects optimal training parameters based on a set of rule-based parameters depending on a data fingerprint and fixed parameters. The data fingerprint characterizes the training data through predefined properties, such as intensity value distribution or object sizes. The selection process for rule-based parameters leverages these data characteristics, optimizing parameters like anchor sizes based on object sizes in the training dataset. Fixed parameters remain unchanged unless manually adjusted, relying on heuristics and empirical evaluations. Following the training procedure, nnDetection further refines object detection hyperparameters, such as the minimum object size, through an automatic empirical optimization. Starting from set of predefined values the empirical hyperparameter optimization refines those initial values sequentially due their interdependencies.Global data fingerprintThe creation of a global data fingerprint is an important aspect of the federated self-configuring MOD framework developed in this project. The data fingerprint characterizes the dataset with a set of specific properties. In a federated setting, multiple local data fingerprints are generated based on the local dataset of each client. Those local data fingerprints need to be aggregated into a single global data fingerprint to generate a plan that facilitates the generation of a common global model.Most properties of the data fingerprint are based on individual images of the local datasets and can be easily concatenated in the global data fingerprint. However, properties such as the intensity value distribution, which are based on all images in local datasets, pose a significant challenge. The intensity value distribution from each local dataset cannot be concatenated in the global data fingerprint because the intensity value distribution of the distributed global dataset needs to be represented by a single global property. Since these properties from the local datasets include values such as percentiles and standard deviations, there is no straightforward algorithm to reduce multiple of those values into a single value. To address this issue, we evaluate two options:

nosyn involves sampling every 10th voxel of each image from the datasets of all clients, sending them to the server, and recomputing corresponding intensity value distribution properties. While this option offers good accuracy similarly to centralized nnDetection, it exposes patient data to the network and the server.

syn proposes computing these properties using samples drawn from Gaussian distributions, using intensity properties from individual images of local datasets. For each image in the distributed global dataset, a truncated Gaussian distribution with the mean and standard deviation of the corresponding intensity value distribution within the range of the minimum and maximum values is created. The global intensity property in the global data fingerprint is then recomputed based on samples drawn from each of those Gaussian distributions. This approach prioritizes data privacy and empirical tests have shown no performance loss with this option in the label-based IID scenarios (Fig. 3, Fig. 5).

However, to enable a better comparison with centralized nnDetection, non-IID experiments utilize nosyn data fingerprints since they are more similar to centralized nnDetection’s data fingerprint.Empirical hyperparameter optimizationThe empirical hyperparameter optimization (sweep) refines parameters as the minimum object size or the the minimum prediction score threshold. Starting from a set of predefined default values each of the sweep parameters is sequentially optimized as interdependencies prevent a parallel optimization.For each predefined value of a particular sweep parameter, the corresponding predictions for each image are evaluated based on a defined target metric such as the FROC score. The predefined value producing the best results is subsequently selected and updated in the hyperparameter configuration before the next sweep parameter is optimized.Due to the interdependencies among the sweep parameters and the FROC score serving as the target metric, conducting an independent local empirical optimization of the clients in a federated setup with subsequent aggregation of results on the server side is not feasible. Instead, for each sweep parameter along with its default value set, the clients conduct an empirical optimization and send prediction matches (predictions matching ground-truth boxes) and prediction scores to the server. This step is necessary as FROC scores from different clients rely on different thresholds and cannot be aggregated.With the received prediction matches and prediction scores, the server computes the final FROC scores for each value of the corresponding sweep parameter and updates the hyperparameter configuration accordingly before initiating the optimization of the next sweep parameter. Given the high cost associated with this empirical hyperparameter optimization, it is performed only once after completing the final global training round.Federated LearningFL allows joint training of a global model that benefits from a large distributed training data pool across different clients (institutions). Each client trains a local model based on their local dataset and submits only the resulting model weights to the server to preserve patient data privacy.The standard FL approach to aggregate model weights from different clients is Federated Averaging (FedAvg)2, being especially effective when dealing with homogeneously distributed data among clients (IID). As FedAvg employs weighted averaging, it is also well-suited to handle quantity skews19. However, FedAvg faces challenges when faced with data domain shifts, where data distributions among clients are heterogeneous (non-IID). This heterogeneity arises commonly in real-world settings, where guidelines, imaging devices, and protocols vary among different institutions. As a result, loss divergence and performance degradation become issues.To alleviate these challenges, alternative FL strategies have been developed, like FedProx4, FedMOON13, and FedDC14. FedProx and FedMOON address the problem by introducing additional proximal terms to local loss functions. These terms encourage the local models to remain closer to the most recent global model, with the extent of this proximity determined by the hyperparameter \(\mu\). Equations 1 and 2 delineate the specifics of the respective adjustments to local loss functions. FedProx’s proximal term calculates the residual between the most recent global model w and the current local model. The greater the divergence of the local model \(w^r\) from the global model, the larger the penalty imposed by the proximal term. The experiments of Quinbin Li et al.19 show that FedProx achieves particularly good results in label distribution skews such as the label-based non-IID distribution in this project. Similarly, FedMOON incorporates the more sophisticated, albeit computationally expensive, model-contrastive loss into the local function, where \(\tau\) represents a temperature parameter. In contrast to only considering the global model \(z_{glob}\), the model-contrastive loss also encompasses the local model from the previous epoch, denoted as \(z_{prev}\). Using a similarity metric (e.g., cosine similarity), the objective is to increase the difference between the current local model z and \(z_{prev}\) while decreasing the difference to \(z_{glob}\).$$\begin{aligned}&Loss(w(x), y) + \frac{\mu }{2} \cdot \underbrace{||w – w^{r}||^2}_{\text {proximal term}} \end{aligned}$$
(1)
$$\begin{aligned}&Loss(w(x), y) -\mu \cdot \underbrace{\log \frac{\exp (\text {sim}(z, z_{glob})/\tau )}{\exp (\text {sim}(z, z_{glob}/\tau )+\exp (\text {sim}(z,z_{prev})/\tau )}}_{\text {model-contrastive loss}} \end{aligned}$$
(2)
FedDC, on the other hand, has been specifically developed for a large number of clients with small datasets executing only a very few local epochs. It applies FedAvg within specific intervals of global rounds, and otherwise randomly swaps local model weights among different institutions without weight aggregation. Compared to FedProx and FedMOON, FedDC is much less computationally demanding, offering an opportunity to mitigate the negative impact of data domain shifts on model performance at low cost.FlowerThe implementation applies nnDetection in a simulated decentralized setup using the FL framework Flower. Alternative FL frameworks that have also been considered during the course of this project include NVFlare29, FedML30, and PySyft31.Flower is an open-source FL framework designed for heterogeneous edge devices, compatible with PyTorch and TensorFlow. It supports customizable strategies and client classes, simplifying setup while allowing for tailored configurations.The selection of Flower for this project is driven by its lightweight nature, ease of use, and ability to accommodate the aforementioned customization options. Moreover, it comes equipped with pre-implemented FL strategies, while also offering the infrastructure for straightforward implementation of other FL strategies.Model trainingAll models are trained for ten global rounds with five local epochs each. Training with the Luna16 dataset uses hard negative mining (HNM), as it showed better stability throughout training compared to the focal loss. Conversely, the training for Duke dataset relies on the focal loss, due to its superior performance in centralized training scenarios and good stability.Hard negative miningHard negative mining (HNM) addresses imbalanced datasets in object detection by identifying and prioritizing challenging negative samples during training. When the number of negatives outweighs positives, models can become biased towards predicting negatives. HNM evaluates the model predictions on negatives, identifies misclassifications or high-confidence errors (hard negatives), and adjusts the training process to assign higher importance to these hard negatives, usually by re-weighting the loss function. This encourages the model to learn from errors and improve the discrimination between true and false positives. Although not attributed to a single publication, HNM is employed by popular object detection algorithms like32 and33.Focal lossThe focal loss34, akin to HNM, addresses the challenge of imbalanced datasets by prioritizing hard-to-classify samples. It modifies the standard cross-entropy loss function, decreasing the impact from well-classified examples while increasing the impact from poorly classified ones.$$\begin{aligned} \text {p}_\text {t}, \alpha = {\left\{ \begin{array}{ll} \text {p}, \alpha _\text {t} & y=1\\ 1 -\text {p}, 1 – \alpha & \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} \text {Focal Loss}&= -\alpha _\text {t}(1-\text {p}_\text {t})^\gamma \text {log}(\text {p}_\text {t}) \end{aligned}$$
(4)
The focal loss computation is given by Eq. (4). Based on the sample’s class, the hyperparameter \(\alpha _t \in [0, 1]\) (Equation 3) weights the loss. The critical component of the focal loss is the term \((1-\text {p}_t)^\gamma\), which reduces the impact of well-classified positive or negative samples. When \(y=1\), as \(\text {p}_t\) (Eq. 3) approaches 1 and the larger the hyperparameter \(\gamma\), the impact on the loss of the current prediction decreases. The same is true for \(y=0\) and p approaching 0.EvaluationFree-response receiver operating characteristic scoreThe Free-Response Receiver Operating Characteristic (FROC) score35 is a widely-employed metric for evaluating model performance in MOD. It compares the true positive rate (TPR) [0, 1] (Eq. 5) against the average false positives per image (FPPI) \([0, \infty ]\) (Eq. 6) across multiple prediction score thresholds at a fixed IoU threshold.$$\begin{aligned} \text {TPR}&= \frac{\text {TP}}{\text {TP} + \text {FN}}\end{aligned}$$
(5)
$$\begin{aligned} \text {FPPI}&= \frac{\text {FP}}{\text {total images}} \end{aligned}$$
(6)
The FROC curve is often reduced to a single value by computing the average TPR at different FPPI values. In general, the FROC score is useful for focusing on the detection of true positive samples and finding the optimal ratio between the TPR and FPPI across various thresholds.

Hot Topics

Related Articles