AI-Generated Annotations Dataset for Diverse Cancer Radiology Collections in NCI Image Data Commons

This section is structured according to the annotation type listed in Table 1 and follows the order displayed in Fig. 2. Each task has its own uniquely catered methodology but follows the general workflow outline in Fig. 2A.Fig. 2Overview of AI-Generated Annotations: (A) Illustration of AI-generated annotation tasks and their associated modalities, (B) Workflow for the annotation tasks, (C) Distribution of the total number of DICOM series per modality of interest across 11 collections.All collections publicly available in Imaging Data Commons and were downloaded from Google Cloud Platform using BigQuery queries. Through data curation, a total of 3091 radiologic DICOM series (934 PET, 1704 CT, and 453 MRI) across all 11 collections (Fig. 2B) were identified for annotations. All datasets in the experiment are deidentified.Supervised deep learning AI models, ensembled from five-fold cross validation models using the nnU-Net20 framework, were trained from a combination of the imaging data available in the IDC and additional publicly available collections. Following our previous observation21 that multi-task nnU-Net20 models outperformed single-label models in detecting whole body FDG PET/CT lesions, the nnU-Net20 models for the PET/CT annotations were trained as multi-task models that detects multiple organs in addition to lesions. The multi-task labels for the training data were assembled from publicly available expert annotations, as well as output predictions generated by a different model, TotalSegmentator22. The training sets differ between models and a description of each training set is provided in subsequent sections.To evaluate the quality and accuracy of the AI predictions, approximately 10% of the data was reviewed and corrected for quantifiable quality control by both a board-certified radiologist (expert) and an annotation specialist (non-expert). The annotation specialist has medical knowledge and a passing familiarity with radiology scans but is not a certified expert. The reviewers were provided with the cases and ai segmentation files in NIFTI format. The reviewers used their preferred viewer (ITKSnap1) to load the case and ai segmentation. Both the radiologist and the annotation specialist rated the AI predictions per case on a Likert Scale to assess their quality, as described in Table 2. Likert score ratings were assigned to each case, evaluating the overall quality of annotations across all labels. Higher Likert scores indicate images with superior annotation quality. For cases where the AI predictions were not rated as ‘strongly agree’, the reviewers corrected the AI annotations by editing the segmentations and saving the corrections to a NIFTI formatted file. These corrections were then used to calculate the quantitative accuracy of the AI models. To control project costs, the radiologist’s review was limited to 10% of the entire dataset. The same 10% subset was also reviewed by the annotation specialist. For the remaining 90% of the data, the non-expert rated each AI prediction only on the Likert Scale. This allowed for an extrapolation of the correlation between the expert and non-expert ratings from the 10% subset to the remaining predictions.Table 2 Likert Score description used by reviewers to assess the quality of the AI annotations per case.Thus, the radiologist and the annotation specialist both reviewed the same 10% of the data, ensuring overlap in their evaluations. The non-expert then rated the remaining 90% of the data independently.The following sections describe the data curation, preprocessing, analysis, and results post-processing steps used to develop the AI models for each specified annotation task. The code to reproduce our analysis is publicly available via zenodo.org (Table 10).Volumetric segmentations produced by the models were saved as standard DICOM Segmentation objects (SEG), which included appropriate metadata to describe the contents and provide links to the input images. DICOM data element SegmentAlgorithmType (0062,0008) is set to “AUTOMATIC” if the segmentation is the AI output. If the segmentation is from a reviewer’s correction, the SegmentAlgorithmType is set to “SEMIAUTOMATIC”. The SegmentAlgorithmName (0062,0009) data element is set to a short name specific to the model, the specific value is given in the model overview sections below. The ContentCreatorName (0070,0084) data element and the SeriesDescription (0008,103E) data element contain the segmentation creator’s description, such as AI, Radiologist, or Non-expert.FDG PET/CT lung and lung tumor annotationImaging data
IDC collections
TCGA-LUAD3, TCGA-LUSC4, LUNG-PET-CT-Dx5, Anti-PD-1_Lung6, RIDER Lung PET-CT7, NSCLC-Radiogenomics8,9,10, and ACRIN-NSCLC-FDG-PET11,12.

Data curation
For this AI-generated annotation task input images were attenuation-corrected paired FDG-PET/CT scans of the lung/chest region. Out of the seven chosen collections, a total of 736 paired FDG-PET/CT images matched the task criteria.
Model training methodologyThe AutoPET Challenge 2023 dataset23,24 comprises whole-body FDG-PET/CT data from 900 patients, encompassing 1014 studies with tumor annotations. The highest performing model in the AutoPET II Challenge25 used multitask learning by including tasks for organs that typically have background activity in FDG PET scans. This same multitask training strategy was employed by adding labels for the brain, bladder, kidneys, liver, stomach, spleen, lungs, and heart generated by the TotalSegmentator22 model to the training dataset. A multi-task AI model was trained using the augmented datasets. To evaluate algorithm robustness and generalizability a held-out dataset of 150 studies, randomly selected without patient crossover, was employed. Among these, 100 studies were sourced from the same hospital as the training database, while 50 were selected from a different hospital but adhered to a similar acquisition protocol. This model has achieved robust results in the final leaderboard of AutoPET challenge26,27. The CT images were resampled to the resolution of the associated paired PET images.Annotation dataAI generated annotationsThe predictions of the AI-generated FDG-avid tumor annotation model28 for this task were overlaid with the lung annotations provided by the TotalSegmentator22 model. Tumor predictions were then limited to only the predictions seen in the pulmonary and pleural regions. An example output can be seen in Fig. 3.Fig. 3Automatic segmentation of Lung (green) and FDG-avid tumor (blue) from FDG-PET/CT scans of patient RIDER-2610856938.
Validation
The non-expert qualitatively assessed the AI annotations on all 736 DICOM studies using a Likert scale. Approximately 10% of the data (N = 77) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set.
DICOM-SEG SegmentAlgorithmType: BAMF-Lung-FDG-PET-CT.
CT Lung nodule annotationImaging data
IDC collections
TCGA-LUAD3, TCGA-LUSC4, LUNG-PET-CT-Dx5, Anti-PD-1_Lung6, RIDER Lung PET-CT7, and NSCLC-Radiogenomics8,9,10.

Data curation
For this AI-generated annotation task input images were CT scans of the lung/chest region that were not part of the paired attenuation-corrected FDG-PET/CT scans that were used in the previous (FDG PET/CT Lung) task. Out of the six chosen collections, a total of 433 CT scans met the task criteria.
Model training methodologyThe DICOM-LIDC-IDRI-Nodules collection29,30,31 was used to train an AI model32,33 to annotate lung nodules. This collection included 883 studies with annotated nodules from 875 patients. Within the dataset only nodules that were identified by all four of their radiologists (size condition: 3 mm ≤diameter ≤30 mm), were considered for AI model training for this task. The lung annotations AI model was trained on 411 and 111 lung CT data from NSCLC Radiomics34,35 and NSCLC Radiogenomics36 respectively. No additional preprocessing was used.Annotation data
AI generated annotations
The predictions of the AI-generated lung nodule annotation model for this task were limited to by size and regions. Only annotations that were in the pulmonary and pleural regions and had diameters between 3 mm and 30 mm, same as training data, were kept. An example output can be seen in Fig. 4.Fig. 4Automatic segmentation of Lung (green) and nodule (blue) from CT scan of patient TCGA-34-5239.

Validation
The non-expert qualitatively assessed all the AI annotations on all 430 DICOM studies using a Likert scale. Approximately 10% of the data (N = 47) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set.
DICOM-SEG SegmentAlgorithmType: BAMF-Lung-CT.
FDG PET/CT breast tumor annotationImaging data
IDC collections
QIN-Breast13,14.

Data curation
For this AI-generated annotation input images were attenuation-corrected paired FDG-PET/CT. A total of 110 paired PET/CT scans met the task criteria.
Model training methodology
Model design and training
This task used the same nnU-Net20 based AI model28 as the previous FDG-PET/CT Lung and FDG-avid Tumor tasks, which was trained on the AutoPET Challenge 2023 dataset augmented for multitask by incorporating labels generated by TotalSegmentator22. The CT images were resampled to the resolution of the paired PET images.
Annotation data
AI generated annotations
The predictions of the AI-generated FDG-avid tumor annotation model for this task were overlayed with the annotations provided by the TotalSegmentator22 model. Tumor predictions were then limited to only the predictions seen in the breast regions. An example output can be seen in Fig. 5.Fig. 5Automatic segmentation of FDG-avid breast tumor (blue) from FDG-PET/CT scans of patient QIN-BREAST-01-0033.

Validation
The non-expert qualitatively assessed the AI annotations on all 110 DICOM studies using a Likert scale. Approximately 10% of the data (N = 10) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set.
DICOM-SEG SegmentAlgorithmType: BAMF-Breast-FDG-PET-CT.
CT Kidneys, tumors, and cysts annotationImaging data
IDC collections
TCGA-KIRC15.

Data curation
For this AI-generated annotation task, input images were limited to contrast enhanced CT scans that contained the kidneys. A total of 156 CT scans met the task criteria.
Model training methodologyThe kidney tumor annotation AI model was trained to accurately delineate the kidney, tumor, and cysts. Model training was split into two stages. Stage one training used contrast CTs from the KiTS 2021 collection37,38,39 (N = 489) to identify the kidney, tumor, and cysts. This trained model was then used to generate annotations for 64 cases of TCGA-KIRC15 collection. These annotations were then further refined by non-experts. An additional 45 cases from the TCGA-KIRC15 dataset was included as part of the training set for stage two training. The final trained model40 was used to generate annotations for all 156 cases of the TCGA-KIRC15 collection that met the task criteria. No additional preprocessing was used.Annotation data
AI generated annotations
The AI-generated annotations were limited to the two largest connected components to remove false positives. The connected components were determined from the union of the kidney, cyst, and tumor labels. An example output can be seen in Fig. 6.Fig. 6Automatic segmentation of kidney (green), tumor (blue), cyst (yellow) from CT scan of patient TCGA-CJ-4873.

Validation
The non-expert qualitatively assessed the AI annotations on all 156 DICOM studies using a Likert scale. Approximately 20% of the data (N = 39) was randomly selected as a validation set. A larger percentage of the collection was selected for annotation because of the heterogeneity from characteristics such as contrast phase and scan field of view. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set. Additionally, the expert provided annotations for 39 cases. This enabled a comparison between the final model’s annotations and expert’s annotations.
DICOM-SEG SegmentAlgorithmType: BAMF-Kidney-CT.
MRI prostate annotationImaging data
IDC collections
ProstateX16,17. For this collection, the IDC has prostate annotations for 98 MRI scans from PROSTATEx-Seg-HiRes41,42 (high resolution prostate annotations, N = 66) and PROSTATEx-Seg-Zones43,44 (zone segmentations of the prostate, N = 32).

Data curation
For this AI-generated annotation task, input images were limited to T2W MRI scans that did not have any missing slices. A total of 347 MRI scans met the task criteria.
Model training methodology
Model design and training
While an extensively trained prostate AI segmentation model already exists in the PI-CAI collections45 dataset (N = 1500) we were unable to use this model as a baseline due to their inclusion of the ProstateX16,17 dataset for their training and validation. To ensure no ProstateX16,17 data leakage occurred in the AI model training this task was done in two stages. The training set for the first stage modal was composed of manual annotations from several datasets. It included 232 scans from ProstateX16,17, where 98 of the labels were in the IDC collection and 134 were from the PROSTATEx_masks46,47 collection. An additional 207 data points came from Prostate15848 (N = 138) and ISBI-MR-Prostate-201349 (N = 69). Both of these datasets contained a single case that did not meet our inclusion criteria and thus that case was excluded. A total of 439 T2W MRI prostate annotations were used to train the first stage prostate annotation AI model. A test/holdout validation split of 81/34 was created from the remaining 115 scans without annotations in the ProstateX16,17 collection. The first stage model was then used to predict the unannotated 81 test set scans of ProstateX16,17 and 1172 cases of the PI-CAI collections45. All ProstateX16,17 scans including the same patient were removed from the PI-CAI dataset (N = 1500) to ensure no data leakage between the two collections. A portion of the PI-CAI dataset contained a much larger field of view than the field of view used in the training collections. To combat the increased risk of additional off-targeting regions in the predictions the centremost segmentation (in all directions) was assumed to be the prostate and all additional regions were removed for all 1253 prostate predictions. In the second training stage a new AI model50 was trained using the same data as the first stage but now with the addition of the 1253 predicted annotations from the ProstateX16,17 test split (N = 81) and the PI-CAI collections (N = 1172). This second stage was used to generate the final prostate labels for the ProstateX16,17 collection. The 34 scans from the ProstateX16,17 holdout collection was used for manual validation by radiologist and non-expert. No additional preprocessing was used.
Annotation data
AI generated annotations
The AI-generated prostate annotations were limited to the largest centremost (in all directions) annotation. An example output can be seen in Fig. 7.Fig. 7Automatic segmentation of prostate gland from T2 MRI scan of patient ProstateX-0336.

Validation
The non-expert qualitatively assessed the AI annotations on all 347 DICOM studies using a Likert scale. Approximately 10% of the data (N = 34) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set. Additional validation was performed by generating AI segmentation for the QIN-Prostate-Repeatability51,52,53,54,55, PROMISE1256, and Medical Segmentation Decathlon57 T2W MRI collections.
DICOM-SEG SegmentAlgorithmType: BAMF-Prostate-MR.
MRI liver annotationImaging data
IDC collections
TCGA-LIHC18

Data curation
For this AI-generated annotation task, input images were limited to T21W MRI. A total of 65 MRI scans met the task criteria.
Model training methodology350 MRI liver annotations taken from the AMOS58 (N = 40) and DUKE Liver Dataset V259 (N = 310) collections were used to train an MRI liver annotation AI model60. No additional preprocessing was used.Annotation data
AI generated annotations
The AI-generated liver annotations were limited to the single largest connected component. An example output can be seen in Fig. 8.Fig. 8Automatic segmentation of the liver from T1 MRI scan of patient TCGA-G3-A7M7.

Validation
A non-expert qualitatively assessed the AI annotations on all 65 DICOM studies using a Likert scale. Approximately 10% of the data (N = 7) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set.
DICOM-SEG SegmentAlgorithmType: BAMF-Liver-MR.
CT liver annotationImaging data
IDC collections
TCGA-LIHC18.

Data curation
For this AI-generated annotation task, input images were limited to CT scans of the liver region. A total of 89 CT scans met the task criteria.
Model training methodology1565 CT liver annotations taken from the TotalSegmentator22 (N = 1204) and FLARE2161,62 (N = 361) collections were used to train a CT liver annotation AI model63. No additional preprocessing was used.Annotation data
AI generated annotations
The AI-generated liver annotations were limited to the single largest connected component. An example output can be seen in Fig. 9.Fig. 9Automatic segmentation of the liver from CT scan of patient TCGA-DD-A1EH.

Validation
A non-expert qualitatively assessed all liver annotations using a Likert scale. Approximately 10% of the data (N = 9) was randomly selected as a validation set. Both the non-expert and the expert Likert-scored and manually corrected the AI predictions of the validation set.
DICOM-SEG SegmentAlgorithmType: BAMF-Liver-CT.

Hot Topics

Related Articles