The two-stage detection-after-segmentation model improves the accuracy of identifying subdiaphragmatic lesions

Original image datasetWe selected the National Institutes of Health Chest X-ray Dataset (NIH CXR dataset) for this study. The dataset was made available for public use under the Creative Commons License “CC0: Public Domain”26,27. This is an open-access dataset, allowing scientists to use it freely. In accordance with the requirements, we have acknowledged the NIH Clinical Center as the provider of the dataset, included the appropriate citation28, and referenced the dataset’s website: https://nihcc.app.box.com/v/ChestXray-NIHCC29. The dataset consists of 112,120 X-ray images from 30,805 unique patients. It is organized into 12 separate folders, sequentially named from “images_001” to “images_012.” For this study, we reviewed 24,999 image files from three folders “images_001,” “images_002,” and “images_003.” The remaining 87,121 images, stored in folders “image_004” to “image_012,” were not included and have been reserved for future analysis.ArchitectureThe two-stage detection-after-segmentation architecture consists of the dataset preparation, model training, and performance evaluation in two parts (See Fig. 1).Fig. 1An overview of dataset preparation, model training, and evaluation in the two-stage detection-after-segmentation architecture. The figure on the right illustrates the first stage: (A) We used the image from National Institutes of Health Chest X-ray Dataset (NIHCXR) image_001 to create three masks: right upper quadrant (RUQ), left upper quadrant (LUQ), and upper abdomen (ABD). These masks were combined with their corresponding images to form a dataset of mask/image pairs, split into training and validation sets in a ratio of 0.8:0.2. (B) The UNet architecture was employed to train the segmentation model. After each training epoch, the models’ performance was evaluated using the mean Intersection over Union (mIoU) to determine the optimal threshold and number of epochs. (C) The subplot on the right shows the mIoU-threshold curve derived from 200 test samples after a particular epoch. The y-axis represents mIoU values, and the x-axis represents thresholds, ranging from 0.1 to 0.9. The subplot on the left illustrates the mIoU-threshold curve over the training epochs, showing that the mIoU curves peaked at approximately 15–20 epochs, with no further improvement observed beyond that point. The curve generally forms a dome shape, peaking between thresholds 0.5, 0.6, 0.7, and 0.8. The final segmentation model used 0.5 as the cutoff threshold for the prediction mask. In the workflows from image_002 and image_003, we move to the detection stage: (D) The original images were extracted from NIHCXR image_002 and image_003, focusing solely on the abdominal portion of the CXR. Based on anatomical locations, air patterns, and levels of stomach or bowel dilation, a subdiaphragmatic dataset was established, comprising 19 classes and a total of 5,996 images. (E) This labeled subdiaphragmatic dataset was partitioned into training/validation and testing datasets. It was then input into the previously trained segmentation models, generating three labeled subdiaphragmatic training sets (RUQ, ABD, LUQ) alongside the full CXR as a control group for comparison during training. (F) A neural network model with four CNN layers was used. Multiple models were trained depending on the region of interest. For RUQ lesion detection, RUQ, ABD, and whole CXR models were trained. For LUQ lesion detection, LUQ, ABD, and CXR models were trained. For bilateral subdiaphragmatic air detection, ABD and CXR models were used. (G) In the detection stage, we evaluated the models using the area under the receiver-operating characteristic curve (ROC AUC), along with TensorFlow’s built-in evaluation metrics such as ROC, PRC, accuracy, recall (sensitivity), specificity, and F1-scores.Stage 1, segmentationMask/image datasetWe reviewed each CXR in the “images_001” folder from the NIH CXR dataset. A total of 1190 original images out of 4237 CXRs from patient number 1 to 1153 were selected. The top of the mask is the diaphragm line, and we used a program we developed to connect points along the diaphragm line. Then, vertical lines are drawn from both ends of the diaphragm line down to the bottom of the image, enclosing a mask region. Some CXRs with obscured diaphragms or limited visibility due to conditions such as pleural effusion, large lung consolidation, or cardiomegaly were not annotated. If the entire CXR is misaligned or if the image is scanned with blank margins on all four sides, it will be discarded. On each image, we manually marked two masks: the right upper quadrant (RUQ) mask and the left upper quadrant (LUQ) mask. The RUQ mask was anatomically marked along the visible right diaphragm extending to the edge of the right spine. Similarly, the LUQ mask extended from the junction of the lower cardiac border and the edge of the left spine through the left diaphragm. If there was no clear boundary at the mediastinum, a straight line was visually drawn to the spine. Additionally, the upper abdomen (ABD) mask was created by combining the RUQ and LUQ masks, filling the gap along the spine to form a complete subdiaphragmatic mask using an image tool. A total of 1190 masks were created for each segment, and the filenames of these files were modified to include ‘_mask’ at the end to match the original images’ filenames (See Fig. S1 and Table S1 in Supplementary Appendix).Segmentation model training and evaluationWe used TensorFlow version 2.10 as our deep learning framework in Python 3.10. The model for training was a modified U-Net architecture based on a convolutional neural network (CNN), with two dropout layers incorporated at the bottleneck of the U-Net, inspired by the work of S. Rajaraman30. The input image size was set to (256, 256) with a single grayscale channel. The images first pass through a standardization layer to ensure proper normalization. The encoder part of the U-Net consists of four blocks of convolutional layers with ReLU activation, where the number of filters doubles as the network depth increases, ranging from 64 filters in the first block to 512 filters in the fourth block, with each block followed by max-pooling layers. After the fourth block, a dropout layer with a rate of 0.5 is applied to prevent overfitting. The bottleneck layer contains two convolutional layers with 1024 filters, followed by an additional dropout layer with a rate of 0.5. In the decoder part, transposed convolutional layers are used to upsample the feature maps, and skip connections concatenate the unsampled features with corresponding feature maps from the encoder. The decoder mirrors the encoder, progressively reducing the number of filters from 512 to 64 as the feature maps are unsampled. Finally, a single convolutional layer with a sigmoid activation function outputs the segmentation mask, which has pixel values ranging between 0 and 1, representing the predicted mask. To avoid losing peripheral image data, we applied “same” padding to the Conv2D layers, preventing the reduction of image size during convolution (see Fig. S2-1). A total of 1,190 data pairs were split into training and validation sets in a ratio of 0.8:0.2. Each dataset element was binary, consisting of an original image paired with its corresponding mask image (see Fig. S2-2).The mean intersection over union (mIoU) metric was used to determine the optimal settings for evaluating the accuracy of the segmentation algorithm. We used the first 200 image/mask pairs from the original dataset as a test set to assess the model’s performance. Based on the results, the optimal number of training epochs was found to be 20, with threshold values of 0.5, 0.6, and 0.7. Three models were trained: the RUQ segmentation model, which identifies the right subdiaphragmatic area; the LUQ segmentation model, targeting the left subdiaphragmatic area; and the ABD segmentation model, focusing on the upper abdomen area.Stage 2, detectionDataset of subdiaphragmatic lesionsThe classification of subdiaphragmatic dataset was based on the magnitude and patterns of gastric and intestinal dilatation in three locations, bilateral quadrants below the diaphragm (BIL), RUQ, and LUQ. Gastric air fluid levels characterized by a semi-circular structure and oval-shaped gastric bubbles were further subdivided according to their highly distinctive feature in imaging.The process for assessing a lesion begins with the RUQ, followed by the LUQ. In subdiaphragmatic air or superimposed air in the RUQ, these areas are prioritized. On the left side, the focus is on the level of dilation, with severe bowel or gastric dilation, as well as large gastric air-fluid levels or gastric bubbles, receiving priority. Lastly, the presence of any notable bilateral upper quadrant air distribution is considered. Each image was interpreted by two thoracic radiologists. If a decision could not be reached within ten seconds, the image was skipped, and the next one was reviewed. After the initial categorization, the first radiologist conducted a second round of classification adjustments based on clinical judgment, followed by the same process by the second radiologist. We did not consider lung field lesions, but in cases of significant diaphragm line disappearance or severe conditions such as pneumonia or pneumothorax, the images were generally skipped. The exposure levels of NIHCXR images vary, and no specific filtering was applied during the selection process. This approach was used to achieve human-level randomization.Out of 13,836 images, 5996 were selected from the “images_002” and “images_003” folders in the NIH CXR dataset from patient number 1336 to 5000. There were two categories in BIL, 4 in RUQ, and 13 in LUQ, totaling 19 categories. The illustration of these chest x-rays is available in Table S2 in Supplementary Appendix. The following sections showed an overview of the classification and summarized in Table 1.Table 1 The subdiaphragmatic dataset.In the BIL subset, the data was grouped into presence or absence of gastrointestinal gases. “No air” signified the absence of gases, typically manifesting as a uniformly radiopaque area beneath both sides of the diaphragm. “Bowel dilatation” was characterized by significant bowel distension or abundant bowel gases distributed in upper abdomen.In RUQ subset, “subdiaphragmatic air” was the presence of clearly visible “free air” below the right diaphragm, providing the contrast that revealed liver’s upper or lateral contour. The “superimposed air” pattern was identified when intestinal gas located to the right of or anterior to the liver. This was defined by the presence of air above the hepatic lower edge, approximating a line from the junction of right diaphragm and spine to the anterior end of the 12th rib. “Subhepatic air” was the presence of air along or below the hepatic lower edge line, therefore a homogeneous radiopacity on imaging appears in the region between the diaphragm and the hepatic lower edge. “No air” category was characterized by the absence of gastrointestinal gas patterns below the right diaphragm, extending to the ipsilateral bottom of the CXR, resulting in a uniformly radiopaque appearance. It is important to note that in this categorization, the air distribution at left diaphragmatic area is not considered.The LUQ subset was divided into “gastric only” and “both gastric and intestinal”. To facilitate imaging identification, the two groups were further classified based on the size of gastrointestinal dilatation and the absence or presence of egg-shaped dilatation or a gastric air-fluid level.The “Gastric only” group was defined by the presence of either a single air-fluid level or a single air pocket, with no evidence of intestinal gas evident. This group was further refined into three categories based on the magnitude of gastric dilatation. In severe gastric dilatation, the entire left subdiaphragmatic space was occupied by the extremely dilated gastric lumen. In moderate gastric dilatation, the pattern identified was oval shape only or in a large air-fluid level. Lastly, small gastric air and air-fluid levels are classified as non-specific air patterns in gastric only group.In “both gastric and intestinal” group, the degree of bowel dilatation was classified as either severe or moderate. In severe cases, three categories were classified into extreme dilatation of the intestines only, extreme bowel dilatation with large gastric air fluid levels, or bowel dilatation with severe gastric dilatation, all of which were characterized by dilatation occupying the entire space below the left diaphragm with or without diaphragmatic elevation. Moderate dilatation of intestines and stomach were divided into moderate bowel dilatation only, moderate bowel dilatation with large gastric air or with air fluid levels. Lastly, small quantity of gastric and bowel gas with or without air-fluid levels was classified as non-specific bowel air. In some cases, intestinal gas was observed to occupy half or more of the subdiaphragmatic space. However, instances where the intestinal lumen was significantly filled with feces, or where the distribution of intestinal gas was discontinuous, or when only one or two pockets of gas were present, were classified as non-specific. For chest x ray examples illustration, please see Table S2 in Supplementary Appendix.Subset combination for severityExcessive dilatation in the gastrointestinal tract often links to life-threatening conditions, and the level of dilatation correlates with the degree of clinical emergencies. We assumed that the degree of dilatation in the stomach or intestines was related to the clinical severity. Therefore, we had added a severity level to the original classification.In BIL subset, “bowel dilatation” was given moderate severity, while “no air” was given none. In RUQ subset, “subdiaphragmatic air” was given severe, “superimposed air” was given moderate, “subhepatic air” and “no air” were given non-specific air pattern (See Table 1).In LUQ subset, we combined these 13 categories for severity assessment based on two criteria: the presence of isolated gastric air or the coexistence of gastrointestinal air, and the extent of severity as determined by dilation. This resulted in six distinct classes: “both severe bowel dilatation” meaning severe gastric and intestinal dilation, “both moderate bowel dilatation” for moderate, “both nonspecific” for nonspecific air pattern, “gastric severe dilation” for extreme gastric dilatation only without obvious bowel air, “gastric moderate dilatation” for moderate dilation, and “gastric nonspecific” for nonspecific gastric air pattern (See Table S3 in Supplementary Appendix).Detection model trainingThe detection model architecture consisted of four CNNs with dropout to mitigate overfitting and a standardization layer to address exposure issues. Each convolutional layer had 16, 32, 64, and 128 filters, respectively, with a 3 × 3 kernel and ReLU activation, followed by max pooling to progressively reduce spatial dimensions and extract features. The output from the final convolutional layer was flattened into a one-dimensional vector. This was processed by a Dense layer with 512 units and ReLU activation. A Dropout layer with a 30% (0.3) dropout rate was applied to prevent overfitting by randomly deactivating a portion of the neurons during training. The model was trained for multi-class classification on the LUQ of 13 categories (LUQ-13), the LUQ of severity (LUQ-severity), and RUQ subsets. A final Dense layer with multiple units, corresponding to the number of labels, and a softmax activation function was used, making it suitable for categorical classification. For the BIL subset, a final Dense layer with a single unit and a sigmoid activation function was used to output a probability between 0 and 1, suitable for binary classification. An early stopping mechanism was implemented to halt training at the optimal time. To prevent weight interference between training sessions, we terminated the Python program after each session and restarted it before subsequent sessions. The model architectures and TensorFlow code are provided in the supplementary appendix, Fig. S4.The training and analysis were conducted on the subdiaphragmatic dataset, which included LUQ-13, LUQ-severity, RUQ, and BIL categories. The dataset was split into training, validation, and test sets with a 0.7:0.2:0.1 ratio, and the shuffling seed was set to 51 to ensure consistent data arrangement across different model training processes. Detection was performed using two types of input for training: one used the entire image input (CXR), and the other used regions extracted from segmentation model predictions, specifically RUQ, LUQ, and ABD. The segmentation models produced prediction masks for RUQ, LUQ, and ABD, which were then used to draw bounding boxes around the predicted areas. The regional images within these bounding boxes were cropped and fed into the detection model. For the BIL region, only CXR and ABD inputs were compared. The RUQ subset was trained with CXR, ABD, and RUQ crops as inputs. A similar approach was applied to LUQ-13 and LUQ-severity, using CXR, ABD, and LUQ crops. In total, 11 detection models were trained.Due to computational power and memory limitations, we did not load the segmentation model and detection model simultaneously during training. Instead, we saved the cropped rectangular images predicted by the ABD, RUQ, and LUQ segmentation models to folders on disk corresponding to their respective categories for later use in the detection process. For example, the complete LUQ-13 dataset was stored as whole CXR images in folders labeled with the 13 category names. After prediction by the ABD segmentation model, the corresponding cropped images were saved in folders with the same 13 category names. This approach allows us to verify the accuracy of the cropped prediction areas by simply viewing the saved images from the disk. For the very few images with inaccurate predictions, we retained them in their original state, without any post-processing, to preserve the models’ initial prediction for experimental integrity (Refer to Fig. S6 in Supplementary Appendix).Statistical evaluationThe testing was conducted on the test set from the subdiaphragmatic dataset, which had been divided into training, validation, and test sets at a fixed ratio. The images in the test set were cropped to the appropriate sizes and then processed by the respective detection models. The primary performance metrics were analyzed using the area under the receiver operating characteristic curve (ROC AUC) from the scikit-learn library. The ROC AUC was further supplemented using the bootstrap method to estimate 95% confidence intervals. Additionally, we performed pairwise comparisons across three groups, CXR-ABD, CXR-LUQ/RUQ, and ABD-LUQ/RUQ, using confidence interval differences to infer the presence of statistically significant differences. If either lower bound or upper bound of the confidence interval approached zero, we inferred that the result was close to a significant level.To provide a comprehensive overview of the performance across different classes, both micro-average and macro-average approaches were applied to the analysis of the curves. Micro-average and macro-average are methods used to evaluate the performance of multi-class classification models. The micro-average aggregates contributions from all classes before calculating the overall metric, giving equal weight to every instance, which may skew results if certain classes have more samples than others. It is particularly useful for assessing overall model performance when class sizes are unequal. On the other hand, the macro-average computes metrics for each class individually and then averages them, offering a more balanced view when class distribution is imbalanced. It provides insights into how well the model performs on each class, assigning equal importance to each class regardless of size.The third set of metrics was provided by TensorFlow’s evaluation functions. By configuring tf.keras.metrics in TensorFlow to include “accuracy”, “ROC”, “PR”, “precision”, and “recall”, the model was evaluated from multiple perspectives. Since we used TensorFlow version 2.10 on a native Microsoft Windows platform with CUDA, which does not provide built-in functions for specificity and F1 score (these are available starting from version 2.1631, which supports CUDA only on Windows Subsystem for Linux 232), we implemented a custom function to calculate specificity and derived F1 scores using the formula: F1 score = 2 * (precision * recall) / (precision + recall), based on the computed precision and recall values. These results were generated using TensorFlow’s metric functions, which did not provide insights into statistical significance. However, these metrics offered an overview of the model’s performance in these comparisons.

Hot Topics

Related Articles