High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma

Human ethical clearanceTissue slides were collected with the approval of an ethical committee from the participating hospitals and research institutions, (1) Jamia Millia Islamia, New Delhi(Proposal No.: 6(25/7/241/JMI/IEC/2021), (2) Maulana Azad Institute of Dental Sciences, New Delhi(Proposal No.: F./18/81/MAIDS/Ethical Committee/2016/8099); (3) Rajendra Institute of Medical Sciences, Jharkhand(Proposal No:. ECR/769/INST/JH/2015/RR-18/236), (4) Banaras Hindu University, Banaras(Proposal No.: Dean/2021/EC/2662), and (5) All India Institute of Medical Sciences, New Delhi(Proposal No.: IEC-828/03.12.2021, RP-33/2022), India. The buccal mucosa tissue samples were collected for three classes, normal, OSMF, and OSCC, with grade-wise annotation from the pathologists at each hospital. Data collection for the study was conducted with the explicit consent of the patients involved, following a rigorous ethical review and approval process carried out by relevant committees. Informed consent was obtained from all participants, ensuring they were fully aware of the study’s purpose, procedures, potential risks, and benefits. They were given the opportunity to ask questions and seek clarification before providing their consent to participate. Participants willingly agreed to the open publication of their data, understanding that their identities would be protected and their information anonymized. The manuscript includes specific references to ethical approval granted by different institutions, indicating their compliance with ethical guidelines and regulations. These references serve as a means of tracking and verifying the study’s adherence to ethical standards.Haematoxylin and eosin staining (H&E)Biopsy samples of normal, OSMF and OSCC tissues underwent H&E staining. The staining procedure was conducted either in-house or outsourced to different laboratories. To eliminate staining variations across different laboratories, the preparation of H&E slides involved five histopathology labs, each utilizing their own independently developed and optimized protocols for the staining process. Following staining, the samples were examined under a microscope by a skilled histopathologist to assess cellular morphology, and tissue architecture, and identify any distinctive features or abnormalities specific to each sample type. This evaluation by the histopathologist involved grading the tissue slides for OSCC and OSMF, as well as differentiating between normal and diseased tissue sections. Subsequently, the annotated and validated images were utilized for further analysis.Image acquisitionImages were acquired using a 1000X magnification (100X objective) lens from Leedz microimaging (LMI) bright field microscopy. To capture the images consistently, we utilized ToupView imaging software, which was configured for automatic adjustments. This setting applies to both white balance and camera settings, thereby standardizing the image acquisition process across different slides. The images of the H&E stained slides were captured at 1000X magnification(100X objective lens). By setting the ToupView software to automatically adjust white balance and camera settings, we aimed to minimize human intervention and the variability it introduces. This approach ensures that the images are not only consistent but also replicable in different laboratory settings, provided similar equipment and software settings are used. We collected approximately 100–150 images per tissue slide, which were stored in PNG file format.Expert annotation and validationThe data included in the ORCHID database underwent rigorous expert annotation and validation to ensure a high level of quality and accuracy. In our expert validation process, ‘sufficient detail’ for an image to be qualified was determined based on several key criteria. Firstly, the clarity of histological features which depict the necessary histological structures, such as cellular details and tissue architecture. Images should be free from artifacts that could interfere with accurate interpretation (e.g., folds, tears, excessive staining). The image must be in focus, with appropriate contrast and resolution to discern pathological features. Our team of pathologists and histopathology experts independently assessed each image against these criteria to ensure only high-quality images were included in our study. Images that were blurry or lacked sufficient detail were dismissed as they would not provide accurate or reliable information. Next, the experts evaluated the annotations that accompany the images. These annotations were scrutinized for consistency and accuracy, to ensure that they accurately represented the disease conditions depicted in the images. The process of labeling the slides was conducted manually by trained pathology experts. This involved a careful review of each slide to identify and label the specific disease conditions present. This procedure was crucial to ensure that the slides were correctly categorized. Furthermore, the slides that showed staining artifacts were also rejected. Staining artifacts can occur during the preparation of the slides and can alter the appearance of the tissue, potentially leading to misinterpretation or incorrect diagnosis. As such, only slides that were free from such errors and provided a clear and accurate representation of the oral pathology were included in the database. These standardization processes ensure that AI models are trained and validated on data that consistently represent the true pathological features. Standardized and validated data enhance the model’s ability to generalize findings across different datasets and real-world scenarios.Stain normalizationThe handling of the samples at each hospital during the collection of the tissue samples led to staining problems that persisted even after following the established H&E staining protocol. To address and minimize the variations in staining appearance across different sites in the H&E images, a stain normalization method was implemented, specifically the Reinhard stain normalization technique14, as shown in Fig. 1b. This approach, described in the study, involves a series of steps to standardize the color properties of the images to a desired standard. The first step is scaling the input image to match the target image statistics. This involves adjusting the intensity values of the input image to align with the desired color distribution of the target image. The scaling ensures that the overall brightness and contrast of the input image are consistent with the target image. The next step involves transforming the image from the RGB color space to the LAB color space proposed by Ruderman. The LAB color space separates the image into three channels: L (lightness), A (green-red color component), and B (blue-yellow color component). By performing the transformation, the image is represented in a color space that better captures the perceptual differences in human vision. Finally, Reinhard color normalization is applied to the LAB image. Reinhard color normalization adjusts the color properties of the image to align with a desired standard. It achieves this by equalizing the mean and standard deviation of the LAB channels across the image.Fig. 1Workflow, Image Analysis, and Stain Normalization. (a) The workflow for preparing oral histopathology slides involves a series of steps, from the collection of tissue samples to slide preparation and staining. (b) Stain normalization is performed to standardize the stain appearance in the images. The Reinhard stain normalization method is utilized for this purpose, ensuring consistent and comparable staining across the images. The scale bar is 10 μm. (c) Representative images captured at a magnification of 1000X exhibit normal tissue, cases of OSMF, and cases of OSCC. These images were digitized using bright field microscopy, providing a visual depiction of the different stages involved in the preparation and staining of tissue slides. The scale bar is 10 μm. (d) Image patches, measuring 512 by 512, are generated from the 1000X images of normal tissue, OSMF cases, and OSCC cases. These patches serve as representative examples of specific regions within the larger images, offering focused insights into the characteristics of normal tissue as well as OSMF and OSCC conditions.If the LAB statistics for the input image are not provided, they are derived from the input image itself. This ensures that the normalization process is tailored to each individual image. Below is the equation for the same:$${I}_{n}={I}_{o}\ast \left(1+{k}_{1}\ast \left({L}_{o}-{\mu }_{L}\right)+{k}_{2}\ast \left({S}_{o}-{\mu }_{S}\right)\right)$$where:In is the normalized imageIo is the original imagek1 and k2 are constants that are chosen to optimize the appearance of the normalized imageLo is the average brightness of the original imageSo is the average saturation of the original imageμL and μS are the average brightness and saturation of a reference image.By minimizing variations in staining and image quality, AI models can focus on learning relevant pathological patterns rather than adapting to artifacts or inconsistencies.Patch generationAfter normalization, we generated image patches of size 512 by 512 pixels from 1000X magnified images (Fig. 1d). The patches were generated by left-to-right sequential cropping (overlapping 256 pixels) in the original images. Below is the procedure for automatically creating overlapping patches from original large-size image:Let W and H represent the width and height of the input image, respectively.Pw and Ph as the width and height of each patch. Here, Pw = Ph = 512.Let O denote the overlap between adjacent patches, with O = 256.The number of patches along the width (Nx) and height (Ny) of the image can be calculated using the formula:$${N}_{x}={\rm{\lfloor }}W-{P}_{w}/{P}_{w}-O{\rm{\rfloor }}+1,{N}_{y}={\rm{\lfloor }}H-{P}_{h}/{P}_{h}-O{\rm{\rfloor }}+1$$For a given patch at position, (i,j), where i ranges from 0 to Nx −1 and j ranges from 0 to Ny −1, the top-left corner (xtl, ytl) and bottom-right corner (xbr, ybr) of the patch are determined as:$${x}_{{tl}}=i\times ({P}_{w}-O)$$$${y}_{{tl}}=j\times ({P}_{h}-O)$$$${x}_{{br}}={x}_{{tl}}+{P}_{w}$$$${y}_{{br}}={y}_{{tl}}+{P}_{h}$$The condition xbr ≤ W and ybr ≤ H ensures the patch is within the image bounds.Baseline model development and fine-tuningWe performed benchmarking of ten deep Convolutional Neural Network (DCNN) through pre-training and fine-tuning our models aimed at classifying oral cancer from non-cancerous samples (Fig. 3). The study focuses on three class classification tasks: Normal vs. Oral Submucous Fibrosis (OSMF) vs. Oral Squamous Cell Carcinoma (OSCC).Fig. 2Details statistics of the ORCHID database. This figure provides a distribution of the ORCHID dataset images and patient cases across five categories: normal samples, and samples with varying degrees of differentiation in Oral Squamous Cell Carcinoma (OSMF, PDOSCC, MDOSCC, WDOSCC). The entire dataset was split into 70% train, 20% validation and 10% test set.Fig. 3Flowchart describing the process to benchmark which pre-trained model to choose. The flowchart serves as a visual representation of the process involved in selecting an appropriate pre-trained model for a specific task. It outlines the steps and criteria to consider when evaluating different models. Further, the flowchart provides a systematic approach to benchmarking various pre-trained models, taking into account factors such as model architecture, training data, performance metrics, and compatibility with the classification task at hand.The OSCC class aggregates three distinct stages: Well-Differentiated (WD), Moderately Differentiated (MD), and Poorly Differentiated (PD) OSCC, treating them as a unified class due to their common pathological origin.The InceptionV315 model was pre-trained on the ImageNet dataset, providing a strong initial set of learned features. It not only offers a robust balance between accuracy and overfitting but also exceeds in computational efficiency. The architecture of InceptionV3 is uniquely suited to handle the complexity and variability in the ORCHID dataset, making it an optimal choice for ensuring both high performance and applicability in a clinical setting. The model’s top layers were excluded to allow for customization. The model was then fine-tuned by setting all layers in the InceptionV3 model as trainable. This process allows the model to adapt to the specific dataset being used in the study. A flattened layer was added to convert the output of the InceptionV3 model into a 1-dimensional tensor. This was followed by a global average pooling 2D layer, a dense layer with 128 units and a ReLU activation and L2 regularization (penalty = 0.01) function, facilitating feature extraction and non-linear transformations. Finally, a dense layer with 3 units and a softmax activation function was employed to produce the output probabilities for the three classes in the classification task. The model was compiled using the RMSprop optimizer with a learning rate of 0.0000001 (or 10e-7) and trained with the categorical cross-entropy loss function. The training process was executed for 50 epochs, with performance evaluation across three-fold dataset.The same settings were used for both the classification models that are; the first model(model-1) which classifies the image patched into normal, OSMF, and OSCC, and the second model(model-2) which classifies the image patches into WD, MD, and PD grades of OSCC. This baseline architecture aimed to capture local and global patterns within the cellular graphs indicative of cancerous transformations.To ensure the reproducibility of results, random seeds = 42 were set during dataset splitting and model initialization phases. This practice guaranteed consistent data shuffling and initialization patterns across experiments. The refined InceptionV3 model demonstrated improved classification performance across all tasks, with notable gains in precision and recall. The model effectively captured the nuclear structures, distinguishing between normal, OSMF, and OSCC conditions with high accuracy. The development and fine-tuning of the InceptionV3 model for oral cancer classification exemplify the potential of DCNN in biomedical applications.

Hot Topics

Related Articles