CystNet: An AI driven model for PCOS detection using multilevel thresholding of ultrasound images

This section comprises a comprehensive overview of the dataset utilized for training and testing the diagnosis model, followed by image preprocessing which includes normalization, augmentation and segmentation. Moreover, this section discusses the proposed model for diagnosing PCOS using ultrasound images and classifying PCOS and non-PCOS ovaries. The overall framework of this study is visually presented in Fig. 1, outlining the various phases involved in the research.Figure 1Systematic approach outlining each step of the research process.Dataset descriptionA dataset obtained from Kaggle52 is utilized for training and testing our models, initially comprising 1,924 images for training and 1,932 for testing. However, due to significant overlap between these sets, the test set is discarded, and the training set is utilized exclusively. This set is then split into new training and test sets. It includes ultrasound images labelled as ’INFECTED’ (781 images with cystic ovaries) and ’NOT INFECTED’ (1,143 images with healthy ovaries). The given ’INFECTED’ and ’NOT INFECTED’ classes uniquely identify individuals suffering from PCOS and those who are not, respectively, making this classification method highly relevant for real-time medical systems in accurately diagnosing PCOS. Figure 2 shows sample images from both categories.This study has also incorporated the PCOSGen Dataset53, gathered from various online sources. This dataset includes 3,200 healthy and 1,468 unhealthy samples, divided into training and test sets, which have been medically annotated by a gynaecologist in New Delhi, India. Additionally, the Multi-Modality Ovarian Tumor Ultrasound (MMOTU) image dataset54 is utilized, containing 1,639 US images from 294 patients.Figure 2Visual comparison of ultrasound images from patients with and without PCOS.Dataset preprocessingThe data preprocessing step is crucial for preparing the ultrasound images for training and testing the diagnosis model. It involves several sub-processes: image resizing and normalization, image augmentation, and image segmentation techniques are described below:Image resizing and normalizationAll ultrasound images are resized to a uniform dimension of 224×224 pixels. This standardization ensures that the input dimensions are consistent across all images, which is crucial for the convolutional neural network(CNN) to process them efficiently. Uniform resizing also helps in reducing computational load and memory requirements during model training. After resizing, the pixel values are normalized using min-max normalization55. This involves scaling each pixel value I to a range of 0 to 1 using the formula:$$\begin{aligned} I_{\text {norm}} = \frac{I – I_{\text {min}}}{I_{\text {max}} – I_{\text {min}}} \end{aligned}$$
(1)
where, I\(\rightarrow\) original pixel value, \(I_{\text {min}}\)\(\rightarrow\) minimum intensity value, and \(I_{\text {max}}\)\(\rightarrow\) maximum intensity value (usually 255 for 8-bit images). This scaling helps in reducing the variance and making the model training process faster and more stable. It also ensures that the pixel values are on a similar scale, preventing any single feature from dominating the learning process.Image augmentationImage augmentation56 is essential for enhancing the diversity of training data, thereby improving the ability of the model to generalize and perform well on unseen samples. It includes applying various transformations to the original images, such as rotation, flipping, zooming, and shifting. These operations introduce variations in the dataset, simulating real-world scenarios and ensuring robustness in the predictions of the model. Figure 3 depicts the workings of augmentation operations visually and these augmentation operations are discussed below to provide a comprehensive overview of the transformations applied to the images during the preprocessing stage:Figure 3Visual representation of the different augmentation operations.

Rotation: Images I\(\rightarrow\) rotated by an angle \(\theta\) randomly selected from the range \([-90^\circ , 90^\circ ]\). This transformation is represented mathematically as: $$\begin{aligned} I_{\text {rotated}} = R(\theta ) \cdot I \end{aligned}$$
(2)
Where \(R(\theta )\)\(\rightarrow\) rotation matrix for angle \(\theta\).

Flipping: Horizontal and vertical flipping are applied, represented as: $$\begin{aligned} I_{\text {flipped\_horizontal}} = F_h \cdot I \end{aligned}$$
(3)
$$\begin{aligned} I_{\text {flipped\_vertical}} = F_v \cdot I \end{aligned}$$
(4)
where \(F_h\) and \(F_v\)\(\rightarrow\) horizontal and vertical flip operators, respectively. This effectively doubles the dataset size.

Zooming: Zoom transformations are applied by scaling the image by a factor Z randomly chosen from a range \([Z_{\text {min}}, Z_{\text {max}}]\): $$\begin{aligned} I_{\text {Zoomed}} = Z(z) \cdot I \end{aligned}$$
(5)
where Z(z) \(\rightarrow\) zoom transformation matrix.

Shifting: Random width \(\Delta _x\) and height \(\Delta _y\) shifts, up to 20% of the total dimensions, are applied: $$\begin{aligned} I_{\text {shifted}} = T(\Delta _x, \Delta _y) \cdot I \end{aligned}$$
(6)
Where \(T(\Delta _x, \Delta _y)\)\(\rightarrow\) translation matrix for shifts \(\Delta _x\) and \(\Delta _y\).

Image segmentationImage segmentation is a crucial process in image analysis, where an image is partitioned into multiple segments or regions to simplify and change the representation of the image into an interpretable format. It is particularly important for medical imaging, as it allows for the precise identification and isolation of relevant structures, such as identifying cysts in ultrasound images for diagnosing PCOS. In this study, the segmentation process involves several steps: noise filtering, contrast enhancement, binarization, multilevel thresholding, and morphological processing. The process used for segmentation is visualized step-by-step in Fig. 4.Figure 4Visual steps in image segmentation for PCOS diagnosis.

Noise Filtering: Noise filtering39 aims to smooth out irregularities in the image caused by noise. One common technique is Gaussian blur, which applies a convolution operation between the image and a Gaussian kernel. Mathematically, this is represented as: $$\begin{aligned} I_{\text {blurred}}(x,y) = \sum _{i=-k}^{k} \sum _{j=-k}^{k} I(x+i, y+j) \times G(i,j) \end{aligned}$$
(7)
Where i\(\rightarrow\) horizontal distance & j\(\rightarrow\) vertical distance from the center of the kernel, \(I_{\text {blurred}}(x,y)\)\(\rightarrow\) blurred pixel value at coordinates (x, y), \(I(x+i, y+j)\)\(\rightarrow\) neighboring pixel values in the original image, and G(i, j) \(\rightarrow\) Gaussian kernel values which is mathematically represented as: $$\begin{aligned} G(i,j) = \frac{1}{2\pi \sigma ^2} \exp \left( -\frac{i^2 + j^2}{2\sigma ^2}\right) \end{aligned}$$
(8)
The Gaussian filter used has a mean of \(\mu = 0\) and a standard deviation of \(\sigma = 0.85\), which are set to specific values during preprocessing to control the amount of smoothing applied.

Contrast Enhancement: Contrast enhancement39 techniques aim to improve the visual quality of an image by increasing the intensity differences between pixels. Histogram equalization is a common method that redistributes pixel intensities to achieve a more uniform histogram. Mathematically, the transformation function T is computed from the image histogram H and its cumulative distribution function CDF: $$\begin{aligned} T(i) = \frac{L-1}{N \times M} \sum _{j=0}^{i} H(j) \end{aligned}$$
(9)
Where T(i) \(\rightarrow\) transformed intensity value, L\(\rightarrow\) number of intensity levels, and \(N \times M\)\(\rightarrow\) total number of pixels in the image.

Watershed Technique: The watershed algorithm40 interprets the intensity gradient of the image as a topographic surface. It identifies regions of unexpected changes in intensity, indicating potential object boundaries. Mathematically, it computes the gradient magnitude \(\nabla I\) of the image and treats it as a topographic surface. It then identifies regions of low intensity and regions of high intensity to segment the image.

Binarization: This operation simplifies the image by converting grayscale intensity values to binary values, effectively separating the foreground from the background. Otsu’s thresholding39 is a widely used technique that automatically computes the optimal threshold value T to distinguish between the two classes by minimizing the intra-class variance. The optimal threshold value T is computed by maximizing the between-class variance \(\sigma _B^2\), which represents the variance between the two classes of pixel intensities. Mathematically, it can be expressed as: $$\begin{aligned} T = \arg \max _t (\sigma _B^2(t)) \end{aligned}$$
(10)
In this equation, \(\arg \max\)\(\rightarrow\) value of t that maximizes \(\sigma _B^2\) between-class variance.

Multilevel Thresholding: Multilevel thresholding36 partitions an image into multiple segments by applying thresholding operations with different threshold values \(T_i\). This technique is useful for segmenting images with complex intensity distributions. Mathematically, it applies thresholding operations with different threshold values \(T_1, T_2, \ldots , T_n\) to segment the image into multiple regions. For a grayscale image, this process can be represented mathematically as follows: $$\begin{aligned} \text {Segment}_i = \{p \in \text {Image} \mid T_{i-1} \le \text {Intensity}(p) < T_i \} \end{aligned}$$
(11)
Where \(\text {Segment}_i\)\(\rightarrow\) region segmented by the \(i^{th}\) threshold, \(T_{i-1}\) and \(T_i\) are consecutive threshold values, and \(\text {Intensity}(p)\) denotes the intensity of the pixel p in the image.

Morphological Processing: Morphological operations39 manipulate the shape and structure of objects in the image. Erosion, dilation, opening, and closing are common morphological operations. Mathematically, these operations are defined using structuring elements (kernels) and applied to binary images to modify object boundaries and remove noise. For instance, erosion E and dilation D can be expressed as: $$\begin{aligned} E(A) = \bigcap _{i=1}^{n} (A – B_i) \end{aligned}$$
(12)
$$\begin{aligned} D(A) = \bigcup _{i=1}^{n} (A \oplus B_i) \end{aligned}$$
(13)
Where A\(\rightarrow\) input binary image, \(B_i\)\(\rightarrow\) structuring element, and \(\oplus\) and − denotes dilation and erosion operations, respectively.

Dataset divisionThe dataset from Kaggle is divided into two sets: 70% for training and 30% for testing. The training set is utilized to train the classifier, allowing it to learn patterns and features indicative of PCOS, while the testing set evaluates the performance of the model and ensures its ability to generalize and accurately diagnose PCOS on unseen data.Feature extractionThe process of feature extraction transforms raw data into a set of informative attributes which reduces the complexity of the data by focusing on the most relevant aspects, enhancing the performance and efficiency of models. InceptionNet V3 and Convolutional Autoencoder have been used for feature extraction in our approach. These methods integrate to form a comprehensive model named CystNet, which combines the strengths of both techniques to complete the feature extraction process, represented in Fig. 5. Each method is described in detail below:Figure 5CystNet: integrating InceptionNet V3 and convolutional autoencoder for feature extraction.Feature extraction using InceptionNetV3Unlike traditional CNN models that use specific receptive field sizes in different layers for feature extraction, the Inception module employs kernels of various sizes (1\(\times\)1, 3\(\times\)3, and 5\(\times\)5) in parallel41. These parallel features are then depth-wise stacked to produce the output of the module. Additionally, a 3\(\times\)3 max-pooled version of the input is included in the feature stack. This combined output delivers rich, multi-perspective feature maps to the subsequent convolutional layer, contributing to the effectiveness of the Inception module in applications such as medical imaging, in our case with PCOS ultrasound images.InceptionNet V342, a variant of the Inception architecture, has been employed in this study for feature extraction. Retaining three Inception modules and one grid size reduction block, followed by Max Pooling and Global Average Pooling layers to decrease the output dimensions. It is a 48-layer deep convolutional neural network capable of recognizing intricate patterns and features in medical images and employs convolution kernel splitting to break down large convolutions into smaller ones, such as dividing a 3\(\times\)3 convolution into 3\(\times\)1 and 1\(\times\)3convolutions, developed by Keras and pre-trained on ImageNet. After convolution, the channels are aggregated and non-linear fusion is performed, enhancing the adaptability of the network to different scales, preventing overfitting, and enhancing spatial feature extraction. With input matrix \(A_X\) given in equation 23, the convolutional operation followed by ReLU activation, max-pooling, and global average pooling is represented as follows:$$\begin{aligned} A_X = \left[ \begin{array}{ccc} P_{1,1} & \cdots & P_{1y} \\ P_{21} & \cdots & P_{2y} \\ P_{x1} & \cdots & P_{xy} \end{array}\right] \times \left[ \begin{array}{ccc} Q_{1,1} & \cdots & Q_{1y} \\ Q_{21} & \cdots & Q_{2y} \\ Q_{x1} & \cdots & Q_{xy} \end{array}\right] \end{aligned}$$
(14)
$$\begin{aligned} = \sum _{i=0}^{x-1} \sum _{J=0}^{y-1} P_{(x-1),(y-j)} B_{(i+1),(j+1)} \end{aligned}$$
(15)
$$\begin{aligned} \text {ReLU} = \max (0,x) \end{aligned}$$
(16)
$$\begin{aligned} X^{(i)} = \text {ReLU}(W^{(i)} * X^{(i-1)} + b^{(i)}) \end{aligned}$$
(17)
$$\begin{aligned} \text {MaxPooling}(X)_{ij} = \max (X_{ij}) \end{aligned}$$
(18)
$$\begin{aligned} X_{\text {pooled}}^{(i)} = \text {MaxPooling}(X^{(i)}) \end{aligned}$$
(19)
$$\begin{aligned} \text {GlobalAveragePooling}(X) = \frac{1}{M \times N} \sum _{i=1}^{M} \sum _{j-1}^{N} X_{ij} \end{aligned}$$
(20)
$$\begin{aligned} X_{\text {features}} = \text {GlobalAveragePooling}(X^{(n)}) \end{aligned}$$
(21)
In these equations, \(W^i\)\(\rightarrow\) weight tensor of the \(i^{th}\)convolutional layer, \(X^i\)\(\rightarrow\) output feature map after the \(i^{th}\) layer, \(b^{(i)}\)\(\rightarrow\) bias vector, and \(X_{\text {features}}\)\(\rightarrow\) final feature vector obtained after applying global average pooling, while M and N represent the height and width of the spatial dimensions of the feature map X, respectively.Feature extraction using convolutional autoencoderAn autoencoder can utilize various types of neurons, but convolutional kernels are particularly effective for 2D data, making them ideal for handling images. Convolutional autoencoders share a structural similarity with CNNs, including convolution filters and pooling layers44. Another benefit of CNNs is their ability to maintain and utilize the spatial information inherently present in images43. Moreover, the convolutional autoencoder converts 2D data representations into 1D arrays. For example, it transforms the positions of m pins, given as pairs of coordinates (p, q) in \(\mathbb {R}^{m \times 2}\), into a 1D format and ensures that the input and output nodes have the same dimensionality, allowing for a direct comparison between the input and output images43. This comparison is utilized as an objective function to learn the parameters of the autoencoder. Since the learning process is label-independent, convolutional autoencoders are considered unsupervised methods. Equation 22 is used to express the latent encoding of the \(i^{th}\) feature map in the encoder for a single-channel input p.$$\begin{aligned} l_i = \sigma (p * w_i + b_i) \end{aligned}$$
(22)
where \(*\) represents 2D convolution, \(w_i\)\(\rightarrow\)convolutional filters in the encoder, and \(\sigma\)\(\rightarrow\) activation function. The decoder reconstructs the image using:$$\begin{aligned} q = \sigma \left( \sum _{k \in L} l_i * \bar{w}_i + b\right) \end{aligned}$$
(23)
where b\(\rightarrow\) bias per input channel, L\(\rightarrow\) set of latent feature maps, and \(\bar{w}\)\(\rightarrow\) learnable de-convolution filters. The objective function minimized during training is the mean squared error (MSE), defined as:$$\begin{aligned} O(\theta ) = \frac{1}{2p} \sum _{k=1}^m (p_k – q_k)^2 \end{aligned}$$
(24)
where \(\theta\)\(\rightarrow\) parameters of the autoencoder, \(p_k\)\(\rightarrow\) the input image in the dataset, and \(q_k\)\(\rightarrow\) the reconstructed image produced by the autoencoder.Image classificationAfter feature extraction, we have applied two approaches for image classification: Classification Using Dense Layer and Classification Using ML Classifiers are explained below:Classification using dense layerIn the Dense Layer classification, a fully connected neural network is used to perform the final classification of the extracted features. The network architecture typically consists of multiple dense layers, each followed by activation functions and dropout layers to prevent overfitting. The operations or blocks are described below:Input Layer: The input to the dense network is a feature vector x extracted from the previous stages. If the feature vector has n features, it is represented as:$$\begin{aligned} x = [x_1, x_2, \ldots , x_n] \end{aligned}$$
(25)
Dense Layers: Each dense layer performs a linear transformation followed by a non-linear activation function. For a given layer l, the transformation can be expressed as:$$\begin{aligned} Z^l = W^l a^{(l-1)} + b^l \end{aligned}$$
(26)
where \(a^{(l-1)}\)\(\rightarrow\) activation from the previous layer, \(W^l\)\(\rightarrow\) weight matrix of layer l, and \(b^l\)\(\rightarrow\) bias vector of layer l. The activation function applied to the linear transformation is typically a Rectified Linear Unit (ReLU), defined as:$$\begin{aligned} a^l = \text {ReLU}(Z^l) = \max (0, Z^l) \end{aligned}$$
(27)
Dropout Layers: Dropout layers are used during training to prevent overfitting by randomly setting a fraction p of the input units to zero. Mathematically, this can be expressed as:$$\begin{aligned} a_{\text {dropout}}^l = a^l \cdot d^l \end{aligned}$$
(28)
where \(d^l\)\(\rightarrow\) binary mask vector sampled from a Bernoulli distribution with parameter p.Output Layer: The final layer is a dense layer or fully connected layer with a sigmoid activation function that outputs the probability of the positive class for binary classification:$$\begin{aligned} \hat{y} = \sigma (W^{\text {out}} a^L + b^{\text {out}}) \end{aligned}$$
(29)
where \(W^{\text {out}}\) and \(b^{\text {out}}\) are the weights and biases of the output layer, \(a^L\)\(\rightarrow\) activation from the last hidden layer, and \(\sigma\)\(\rightarrow\) sigmoid function, which is calculated as:$$\begin{aligned} \sigma (z) = \frac{1}{1 + e^{-z}} \end{aligned}$$
(30)
Model Compilation: During the training process, the model parameters are optimized using the Adam optimizer, which dynamically adjusts the learning rate for efficient convergence. The optimization aims to minimize the loss function, specified as the binary cross-entropy, which quantifies the difference between predicted and actual labels. Additionally, the training incorporates rigorous 5-fold cross-validation to ensure robustness and generalization to unseen data. In 5-fold cross-validation, the dataset is divided into five subsets, and the model is trained on four subsets while the remaining one is held out for validation as depicted in Fig. 6. This process iterates five times, with each subset used once as the validation data. Mathematically, the loss function can be expressed as:$$\begin{aligned} \text {Loss} = -\frac{1}{n} \sum _{i=1}^n \left[ o_i \log (P_i) + (1 – o_i) \log (1 – P_i) \right] \end{aligned}$$
(31)
where n\(\rightarrow\) number of samples, \(o_i\)\(\rightarrow\) true label of the \(i^{th}\) sample, and \(P_i\)\(\rightarrow\) predicted probability of the positive class. Additionally, the k-fold cross-validation process is mathematically represented as:$$\begin{aligned} \text {CV} = \frac{1}{K} \sum _{k=1}^K E_k \end{aligned}$$
(32)
where K\(\rightarrow\) number of folds, and \(E_k\)\(\rightarrow\) performance metric obtained on the \(k^{th}\) fold during validation.Figure 6Visual representation of K-fold cross validation.Classification using ML algorithmsIn addition to the deep learning approach, traditional machine learning classifiers are also employed for the classification of PCOS. These classifiers include Naïve Bayes (NB), k-Nearest Neighbors (K-NN), Random Forest (RF), and Adaptive Boosting (ADB) where the features extracted by the CystNet model serve as the input for these classifiers. Each of the classifiers is briefly discussed below:

1.

K-Nearest Neighbor: The K-NN57 classifier is a straightforward, instance-based learning algorithm used for classification tasks. It classifies a new instance by identifying the K closest training instances (neighbors) using a distance metric, commonly the Euclidean distance. The Euclidean distance between two instances \(x_i\) and \(x_j\) with m features is given by: $$\begin{aligned} d(x_i, x_j) = \sqrt{\sum _{l=1}^m (x_{il} – x_{jl})^2} \end{aligned}$$
(33)
The class label of the \(i^{th}\) neighbor \(\hat{y}\) is then determined by the majority vote of its K-nearest neighbors: $$\begin{aligned} \hat{y} = \text {mode}\{y_i | i = 1, 2, \ldots , K\} \end{aligned}$$
(34)

2.

Naïve Bayes: Naïve Bayes57 is a probabilistic classifier grounded on Bayes’ theorem, operating with an assumption of feature independence. Given a feature vector \(x = (x_1, x_2, \ldots , x_n)\) and a class \(C_k\), the classifier computes the class with the highest posterior probability using the formula: $$\begin{aligned} P(M_k | x) = \frac{P(x | M_k) \cdot P(M_k)}{P(x)} \end{aligned}$$
(35)
This calculation involves the likelihood \(P(x | M_k)\), the prior probability \(P(M_k)\), and the marginal likelihood P(x).

3.

Random Forest: An RF57 classifier is an ensemble learning method used for classification tasks, where it constructs multiple decision trees and outputs the class that is the mode of the classes predicted by the individual trees. Mathematically, for a new instance x, the prediction \(\hat{y}\) is the mode of the predictions from B trees: $$\begin{aligned} \hat{y} = \text {mode}\{h_b(x) | b = 1, 2, \ldots , B\} \end{aligned}$$
(36)
where \(h_b(x)\) is the prediction of the \(b^{th}\) tree. The best split \(s^*\) at each node is found by minimizing the impurity, such as the Gini impurity: $$\begin{aligned} s^* = \arg \min _s \sum _{k \in \{L, R\}} \frac{|D_k|}{|D|} \cdot \text {Gini}(D_k) \end{aligned}$$
(37)
where \(D_L\) and \(D_R\) are the left and right splits of the node, respectively.

4.

Adaptive Boosting: ADB58 classifier combines multiple weak models to create a strong classifier. It iteratively trains models, giving more weight to misclassified samples. The final strong classifier H(x) is a weighted sum of the individual weak classifiers \(h_t(x)\): $$\begin{aligned} H(x) = \text {sign}\left( \sum _{t=1}^T \alpha _t h_t(x) \right) \end{aligned}$$
(38)
where t is the index of the individual weak classifiers, and T is the total number of weak classifiers.

Hot Topics

Related Articles