Estimating infant age from skull X-ray images using deep learning

Data collectionInfants under 12 months of age who underwent plain skull X-rays for head trauma evaluation from January 2010 to December 2021 were included in this study. Patients with congenital cranial malformations were excluded. The study was approved by the Institutional Review Board of Hallym University Sacred Hospital (No. 2023-01-002). Informed consent was waived due to the retrospective nature of the study by the Institutional Review Board of Hallym University Sacred Hospital (No. 2023-01-002), and all image data were anonymized.All plain skull X-ray images (74 kVp, 200 mA, 100 ms) were taken using a digital radiographic device (GC85A, SAMSUNG, Korea) and retrieved from the Picture Archiving and Communication System (PACS, Infinite version 3.0.9) of the institution in DICOM format and converted into .png format. Personal information and annotations were removed during the conversion process to ensure patient confidentiality, and images that were not properly focused were systematically excluded from the database. To maintain the integrity of our dataset, the acquired images were evaluated by the neurosurgical expert (H.S.L.), who decisively excluded AP, Towne, and lateral view radiographs that deviated significantly from the standards.Additionally, the study utilized another 864 images of 216 distinct patients from the same institution for external validation. These images were obtained from skull X-ray examinations performed over the period from January 2017 through December 2021. The mean vertical resolution of the internal dataset was 2046 ± 80 (1382–2177) pixels, while the mean horizontal resolution has 1703 ± 74 (1382–2177) pixels. In the external dataset for validation, the mean vertical resolution has 1991 ± 188 (1453–2177) pixels, and the mean horizontal resolution has 1641 ± 181 (1160–1814) pixels.The skull X-ray images included four different types: anteroposterior (AP), Towne, right lateral, and left lateral views. Some patients had all four types of X-rays, while others had only some of them. Before the study, the X-ray views were categorized into two groups: the AP view dataset, which included both AP and Towne views, and the lateral view dataset, comprising right and left lateral views. Two separate deep learning models were independently developed for these datasets.Dataset constructionEach image was labeled according to the patient’s age group, categorized into 12 categories by month of age. As presented in Table 1, the entire dataset was divided into three subsets: training, validation, and test datasets, using random sampling at an 8:1:1 ratio. These sub-datasets were mutually exclusive. The validation dataset was used to determine the optimal training process point. Sampling was performed stratified by age groups to maintain consistent data proportions in each subset. To enhance performance reliability, dataset splitting was carried out three times with three different seeds to train deep learning models separately.Table 1 Data composition of enrolled plain skull X-rays of anterior–posterior (AP) and lateral views in the internal datasets.Data pre-processingTo eliminate potential age prediction biases unrelated to the skull, all images were pre-processed to hide teeth and paranasal sinus areas. The region of exclusion (ROE) containing the orbital and mandibular regions in the skull X-ray was identified. The border lines of the ROE were defined as follows:

1.

on the AP or Town’s skull X-ray image, they encompassed the upper margin of the supraorbital rim and the lower margin of the mandible (Fig. 1A)

2.

on the lateral skull X-ray image, they included the supraorbital rim, the foremost part of the mandible, and the posterior margin of the cervical spinous process (Fig. 1B).

Figure 1The defined region of exclusion (ROE) in the skull X-ray for image tailoring. (A) Anteroposterior (AP) or Town’s view skull X-ray showing the defined ROE. The borders of the ROE extend from the upper margin of the supraorbital rim to the lower margin of the mandible. (B) Lateral skull X-ray with the ROE including the supraorbital rim, the foremost part of the mandible, and the posterior margin of the cervical spinous process. (C) Post-processed AP or Town’s view skull X-ray. The region below the upper margin of the ROE has been removed. (D) Post-processed lateral skull X-ray. A square box, defined by the upper and right margins of the ROE, has been removed.The defined area on each of 293 skull X-ray images was labeled as ROE by a neurosurgery expert (H.S.L). The entire ROE dataset was divided into training, validation, and test datasets through random sampling with a ratio of 8:1:1. The MobileNetV3 model was trained for object detection of the labeled ROE. Regarding training parameters, the Adam optimizer was used with an initial learning rate of 1e − 3 and batch size of 16. Subsequently, post-processing was performed on all images to eliminate the detected ROEs based on the following criteria:

1)

on AP or Town’s skull X-ray, the region below the upper margin of the ROE was removed (Fig. 1C)

2)

on the lateral skull X-ray, the square box defined by the upper margin of the ROE and right margin of the ROE was removed (Fig. 1D).

All tailored images were then reviewed by a neurosurgeon (H.S.L.) and adjusted for any misprocessing. After tailoring the region of interest (ROI) in the images, all images were center-symmetrically zero-padded into square shapes to match the longer side of the width and height. Bi-linear interpolation was applied to the transformed square images of different sizes to resize them to a uniform size of 1024 × 1024 pixels. Min–max normalization was applied to normalize all images.Training CNN modelsTo construct deep-learning models, two different CNN architectures, DenseNet-121 and EfficientNet-V2-M, were adopted. DenseNet-121 has an improved algorithm for feature representation and learning efficiency and has been effective at medical image classification10, and EfficientNet-V2-M, which has been relatively recently introduced and has shown higher performance in general image classification tasks with low computational cost11,12. In brief, DenseNet consists of dense blocks linking the feature map of previous layers together, while the EfficientNet-V2-M model searches for the most effective CNN architecture using neural architecture search, similar to EfficientNet. DenseNet-121 and EfficientNet-V2-M had previously been trained with the ImageNet dataset and were fine-tuned by unboxing the weights11,12,13. All layers were unfreezed, allowing fine-tuning of every layer in the network.The batch size was set at 8 for DenseNet-121 and 4 for EfficientNet-V2-M, the maximum capacity that the GPU memory of our hardware could handle with each architecture. Categorical cross-entropy was used as the loss function, and the Adam optimizer was applied14. The initial learning rate was set to 0.0001 and was reduced by a factor of 0.1 every 10 epochs. Early stopping was employed after the 20th epoch with a patience value of 10, which counts sustaining training steps based on the loss for the tuning dataset or the validation loss value, completing training within a total of 100 epochs. During training, if the validation loss value exceeded the minimum validation loss achieved so far in any epoch, the model was not saved. Thus, the model updated at the epoch showing the minimum validation loss in the training process was chosen as the final saved model to prevent overfitting.The deep-learning model used in this study was implemented on a PyTorch platform using a hardware system comprising an NVIDIA GeForce RTX 4090 graphics processing unit and Intel Xeon Silver central processing unit with a customized water-cooling system.Performance evaluation and statistical analysisAfter training deep-learning models, the performance of each model was evaluated in the test dataset three times using different seeds. For external validation, the trained deep learning models were tested with another external validation dataset as described above.The primary outcome measurement for the established deep learning model was the classification accuracy in predicting twelve age groups, delineated on a monthly basis. The secondary outcome included the one-month relaxation accuracy of the deep learning models. Continuous variables are presented as means with standard deviations. Mann–Whitney U test was used for the comparison of prediction performance between different age groups. A P-value of < 0.05 were considered statistically different and all tests were two-sided. A gradient-weighted class activation map (Grad-CAM++) was implemented in the neural network layer to localize the discriminative regions used by the deep-learning tool to determine the specific class in the given images15. To validate the superiority of the method proposed in this study, comparison experiments were conducted with the RSNA Bone Challenge Winner Model16. RSNA Bone Challenge is a competition for estimating the bone age of pediatric patients based on radiographs of their hand. The RSNA Winner Model used not only InceptionV3 as the deep learning network but also sex as an additional input feature.Ethical approvalAll procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee (Institutional Review Board of Hallym University Sacred Hospital and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.Informed consentThis study was carried out as a retrospective analysis, wherein all patient data were anonymized prior to utilization. Informed consent was waived due to the retrospective nature of the study by the Institutional Review Board of Hallym University Sacred Hospital (No. 2023-01-002).

Hot Topics

Related Articles