An international study presenting a federated learning AI platform for pediatric brain tumors

This multi-center, retrospective study underwent approval by the Stanford University institutional review board (IRB) and execution of data use agreements across the participating sites, with a waiver of consent/assent (IRB No. 51059: Deep Learning Analysis of Radiologic imaging). Nineteen institutions from North America, Europe, West Asia, North Africa, and Australia participated in the study (Fig. 1a). Waiver of consent was granted by the IRB for the following reasons: (1) As a retrospective study, the research involves no more than minimal risk to the participants as the materials involved (data, documents, records) have already been collected and precautions will be taken to ensure confidentiality, (2) the waiver will not adversely affect the rights and welfare of the participants as there are procedures in place that protect confidentiality, and (3) the information learned during the study will not affect the treatment or clinical outcome of the participants.The inclusion criteria were: patients who presented with a new, treatment-naive PF tumor; had pathologic confirmation for any of the following benign or malignant tumors: medulloblastoma (MB); ependymoma (EP); pilocytic astrocytoma (PA); and in the case of diffuse intrinsic pontine glioma (DIPG), MRI and/or biopsy-based diagnosis; obtained pre-treatment brain MRI that included axial T2-weighted imaging (T2-MRI). Subjects were excluded if the imaging was non-diagnostic due to severe motion degradation or other artifacts. Table 1 summarizes cohort demographics and site-specific tumor pathology.Tumor segmentation was performed on axial T2-MRI by an expert board-certified, pediatric neuroradiologist (KY, >15 years’ experience), followed by a consensus agreement among three pediatric neuroradiologists (AJ, JW, MK) and two pediatric neurosurgeons (SC, RL). Segmentation was performed over the whole tumor, inclusive of cystic, hemorrhagic, or necrotic components within the tumor niche. T2-MRI was selected as it is most frequently acquired on routine MRI protocols; is embedded within pre-surgical navigation; and most reliably identifies the tumor margins regardless of enhancement, hence, recommended for pediatric glioma assessment29.MRI acquisitionMRI of the brain was obtained using either 1.5 or 3 T MRI systems. The following vendors were employed across sites: GE Healthcare, Waukesha, WI; Siemens Healthineers, Erlangen, Germany; Philips Healthcare, Andover, MA; and Toshiba Canon Medical Systems USA Inc., Tustin, CA. The T2-weighted MRI (T2-MRI) sequence parameters were: T2 TSE clear/sense, T2 FSE, T2 propeller, T2 blade, T2 drive sense (TR/TE 2475.6-9622.24/80-146.048); slice thickness 1–5 mm with 0.5 or 1 mm skip; matrix ranges of 224–1024 × 256–1024.Study designDataset distributionOf the 19 sites, 16 sites were selected to participate in the model training and validation; the remaining three sites served as independent, external hold-out sites. A dataset from a database of normal pediatric brain MRI (N = 1667 from ST site) was used for pretraining. Within each of the 16 sites that participated in model training and validation, 75% of the MRI data was used in the training set; the remaining 25% was used as hold-out validation sets. Sample collection on sex and/or gender were not considered for sample selection.Statistics and reproducibilityNo statistical method was used to predetermine sample sizes of the training, validation, and external, independent validation sites. All data collected from the 19 sites were used. The training runs showed minor variations in convergence for different random seeds.Data preprocessingEach site must possess the small but important knowledge to manage consistent data preprocessing, a task that, under CDS, would typically be centralized by a trusted party. To streamline preprocessing, we have minimized any complex preprocessing steps (e.g., brain registration to a common atlas or skull-stripping). Preprocessing only includes: (1) normalization of each 3D image to a simple 0–255 intensity range and (2) volume extraction of 64 congruent axial slices of 256 × 256. These preprocessing steps are executed via an automated script applied to the DICOM data across all 19 sites. The number of 64 slices was chosen such that it can handle virtually all of the variations of the individual sites’ T2 sequence parameters (e.g., TSE, FSE, Propeller, etc.) with a large range of slice thicknesses (e.g.,1–5 mm) based on site-specific scanner technology and protocols. Therefore, our FL system can accommodate a large range of sequence parameters and axial slices. While normal pediatric MRI data of the pediatric brain were not required in the FL experiments, we observed that it could help retrain the model to identify the geometry and spatial locations of the pediatric brain across all ages, i.e., infants to adult head sizes of teenagers. The normal dataset (N = 1667) was shared and distributed amongst the participating sites for both CDS and FL approaches. However, the normal cohort was not used in the validation or the hold-out test sets.Federated model development and evaluationWe developed a 3D model that jointly performs tumor pathology prediction (MB, EP, PA, DIPG, normal) and segmentation masks using FL (Fig. 1b). In the CDS approach, we combined the datasets from all 16 sites into a single pool, on which we trained the model. We also examined a Siloed model trained using the training and validation data from a single site only (Site ST, which hosted the largest single institution dataset), which was then evaluated on the 16 hold-out validation sets and 3 external independent sites. In contrast, the FL strategy used a method known as Federated Averaging15. Within this framework, the 16 sites did not share data. Instead, they only share information via model parameters learned on each site-specific data.Each FL round began with local model training at the individual sites, after which each site transmitted the learned weights back to a central server. Here, the model weights from each of the 16 sites were averaged, creating a unified, global set of weights. These weights were then distributed back to each site to initiate the next FL round, where local training resumed. This iterative process, alternating between local training and centralized averaging, continued through many FL rounds. Eventually, the finalized global model underwent evaluation across the 16 validation sets and three hold-out test sets, its performance reflecting the collaborative—yet segregated—approach that characterizes the FL paradigm.We modified the conventional FL strategy by creating a “warm-up” phase for the initial model, called Federated Warm-up, which enabled an efficient FL training to hasten convergence, given the large disparity in data distributions that underlie the 16 participating centers. The FL training consists of two stages enabling efficient learning: an initial 50 rounds of Federated Averaging on the ST and SE sites followed by 150 additional rounds of Federated Averaging across all 16 sites. A convergence plot that illustrates this Federated Averaging “warm-up” is shown in Fig. 5.We employed a 3D-UNet architecture, incorporating a Kinetics-pretrained encoder that was initially trained on large-scale video data30. The 3D architecture allowed for processing 64 slices of high-resolution planes per datum, necessitating substantial GPU memory to manage large batch sizes. For the CDS training, 200 epochs were conducted with a combined loss function of Cross-Entropy and Dice Score Loss, utilizing Adam Optimization with a learning rate of 0.0001. This combined loss function facilitated the learning of both classification and segmentation predictions.For classification performance, we calculated model raw accuracies and F1 scores. For segmentation, we utilized the same model as in the classification task to calculate the DSCs. The DSC determines the overlap between the predicted and ground-truth segmentations and thus offers insights into the quality of segmentation. We also conducted a two-sided t-test on the DSCs and compared the performance between CDS and FL. The distribution of predictions is approximately normally distributed due to the large test sample sizes.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles