Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Technological advances are improving medical treatment. Gene sequencing, medical imaging, and artificial intelligence have significantly enhanced medical diagnosis, allowing for early detection and precise treatment. Technological advances have expanded medical diagnostic possibilities. However, in many developing countries, doctors face challenges in diagnosis due to limited medical resources and high cancer diagnosis workloads. Early detection and precise diagnosis are crucial for cancer-related diseases, making efficient diagnoses challenging for doctors. Therefore, we suggest using computer vision techniques to screen important information, reducing physicians’ workload and improving diagnostic efficiency. This has significant implications for automated cancer diagnosis in developing countries. Figure 1 illustrates the architecture of our proposed image segmentation model.Fig. 1Overall program architecture diagram.First, we optimize the cytopathology images using various image enhancement techniques. Then, we apply the RSAA model for semantic segmentation to generate predicted classifications for each pixel. This improves segmentation accuracy and solves the category imbalance problem. Next, we give priority to using reliable unlabeled data alongside labeled data for training. This is based on the reliability screening strategy for unlabeled data in the RU3S model, which helps to reduce the interference of complex images during early training while also saving image annotation resources. Finally, we use CRF for post-processing to further improve segmentation accuracy and consistency.Image enhancement moduleAdvanced image processing and machine learning techniques can extract useful features from pathology images to assist doctors in making more accurate diagnoses and save valuable medical resources. In developing countries where resources are scarce, automated image processing technology can effectively reduce doctors’ workload, allowing them to focus on more complex diagnostic tasks. However, many developing countries lack the technical and financial support required to collect and process large amounts of medical image data37,38,39. Additionally, due to a shortage of medical facilities and specialized personnel, it may not be possible to carry out large-scale pathology image acquisition and annotation in these countries.In this study, we employed a series of preprocessing and data augmentation techniques to expand the dataset and enhance the model’s robustness. Figure 2 displays the flowchart of our data augmentation process. Initially, we simulated various angular and size differences, as well as orientation changes, that are possible in pathology sections under the microscope by rotating, translating, cropping, and flipping the images. These operations increase both the amount and diversity of data, improving the model’s generalization ability. Additionally, we adjusted the brightness and saturation of the images to simulate variations in pathology sections caused by different lighting and staining conditions. This approach enhances color distortion and expands our dataset, enabling the model to adapt to the various color and contrast variations that may be encountered in real-world applications. Furthermore, we employed the Generative Adversarial Network (GAN) for image enhancement to the augmented images, thereby enhancing the quality of the images. In this process, the generator receives low-quality images and then generates high-quality images. The discriminator is then required to assess whether the generated image is an accurate representation of the original, high-quality image. Consequently, the generator is able to discern how to enhance the quality of the image in the context of a confrontation with the discriminator. These preprocessing and data enhancement strategies address the challenge of acquiring accurately labeled pathology images and avoid the over-fitting problem due to insufficient data. This improves the performance and robustness of our model in processing high-resolution pathology images.Fig. 2Data enhancement flowchart.Image enhancement allows for the generation of numerous new images from a limited set of originals, thereby expanding the dataset. This not only improves the model’s generalization ability by increasing its training data but also simulates various changes in real-world environments, enabling the model to better handle diverse situations in practical applications. This approach effectively addresses the challenge of accessing large, high-quality annotated datasets in developing countries due to resource and capacity constraints. It also enables the use of deep learning techniques in the task of semantic segmentation of cancer cytopathology images.Image segmentation moduleA significant challenge in working with pathology images, particularly in image segmentation, is the issue of category imbalance. Conventional models tend to be biased toward predicting background categories during training due to the higher number of background pixels compared to target pixels (such as cancer cells). This results in the target categories being ignored. This is particularly relevant when the target categories are infrequent compared to the background categories, which presents a significant obstacle to enhancing the accuracy and resilience of segmentation models.To tackle this issue, we introduce a novel deep neural network architecture named RSAA. The aim is to create a model that can more precisely identify and distinguish between different categories in an image, particularly when the target categories are less frequent than the background categories. The RSAA model is based on ResUNet ,SE, ASPP, and Attention modules, which are combined to form a strong network structure. This design enables our RSAA model to extract and utilize multi-scale and global information from the image, while also adaptively focusing on important parts of the image. This effectively solves the problem of category imbalance in cytopathology images and improves the segmentation accuracy of the model.Fig. 3(1) EncoderFigure 3 shows that in our approach, the input pathology image undergoes feature extraction through two \(3\times 3\) convolutional layers. The current output is then added to the original input to form a residual join. This approach improves the model’s segmentation performance by introducing an inductive bias, as opposed to using a single 4×4 convolution to achieve a quarter-sized output. To enhance the model’s efficiency, we propose the SEBlock module while improving segmentation performance.The BN technique is commonly used to accelerate neural network training40, optimize weights, and provide slight regularization effects. Equation (1) is the basic equation of the BN method.$$\begin{aligned} \hat{x}_{pi} = \frac{x_{pi} – \mu _{Bp}}{\sqrt{\sigma _{Bp}^2 + \epsilon }} \end{aligned}$$
(1)
where p represents the pathology image and i represents the corresponding pixel index within the image. \(\hat{x}_{pi}\) is the normalized output, \(x_{pi}\) is the input, \(\mu _{Bp}\) is the mean of the batch, \(\sigma _{Bp}^2\) is the variance of the batch, and \(\epsilon\) is a small constant that prevents division by zero.The obtained output is put into the ReLU activation function and is used to introduce non-linearity in the neural network to speed up the computation. The equation of the ReLU activation function is shown in Eq. (2).$$\begin{aligned} f(x_{pi}) = {\left\{ \begin{array}{ll} x_{pi} & \text {if } x_{pi} \ge 0, \\ 0 & \text {if } x_{pi} < 0. \end{array}\right. } \end{aligned}$$
(2)
where \(x_{pi}\) is the output of the BN and \(f(x_{pi})\) is the result after the ReLU function.In cytopathology images, osteosarcoma cells typically occupy a small portion of the image, while the majority of the area is taken up by normal cells and background. This class imbalance can cause the model to be biased towards predicting the majority class during training, thereby ignoring the minority class of osteosarcoma cells. To address this issue, SEBlock incorporates the SE (Squeeze-and-Excitation) module before the \(3\times 3\) convolution described above. The SE module adapts by learning the weights of feature channels to extract useful features more effectively. This strategy improves the model’s ability to recognize certain categories, such as osteosarcoma cells, by focusing on the most significant feature channels. As a result, the model outperforms the strategy without the SE module in terms of performance. After three SEBlock down samplings, the model will obtain feature maps of \(\frac{h}{2} \times \frac{w}{2}, \quad \frac{h}{4} \times \frac{w}{4}, \quad \frac{h}{8} \times \frac{w}{8}\) respectively.The Squeeze-and-Excitation (SE) module is a mechanism that adaptively recalibrates the channel weights of convolutional features41,42. Its purpose is to enhance the model’s representation by explicitly modeling the inter-channel dependencies of input features. The SE module consists of three steps: Squeeze, Excitation, and Scale. The SE module compresses the spatial dimensions of the input features using the Global Average Pooling (GAP) operation, as shown in Eq. (3).$$\begin{aligned} z_{cp} = \frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} x_{cpij} \end{aligned}$$
(3)
where \(x_{cpij}\) is the value of the input feature map at channel c and spatial location (i,j), and H and W are the height and width of the input feature map.The global average pooling operation captures the global spatial information, and the compressed feature \(z_{cp}\) represents the global response of channel c. The SE module then nonlinearly transforms the compressed features through two fully connected layers to obtain the weights of each channel, a step called Excitation, as shown in Eq. (4).$$\begin{aligned} s_{cp} = \sigma (g(z_c,W)) = \sigma (W_2 \delta (W_1 z_c)) \end{aligned}$$
(4)
where \(s_{cp}\) is the computed channel weights, \(\sigma (\cdot )\) is the sigmoid activation function, \(\delta (\cdot )\) is the ReLU activation function, and \(W_1\) and \(W_2\) are the weights of the two fully connected layers. the sigmoid activation function is used to restrict the weights to the range [0,1], which allows each channel to be tuned independently. Finally, the SE module performs the recalibration of the features through a multiplication operation between the channels, this step operates as shown in Eq. (5):$$\begin{aligned} y_{cpij} = s_c \times x_{cpij} \end{aligned}$$
(5)
where \(y_{cpij}\) is the value of the output feature map in channel c and spatial position (i,j), this step implements adaptive recalibration of the input feature map.The innovation of the SE module is that it enhances the representation of the model by learning the dependencies between channels and adaptively adjusting the weights of each channel. This mechanism can be easily inserted into any convolutional network to enhance its performance with little additional computational and parametric overhead.(2) BridgeAt the intersection of the encoder and decoder, we introduce the Atrous Spatial Pyramid Pooling (ASPP) module as a connecting bridge. The ASPP module uses null convolution at different sampling rates to integrate multi-scale contextual information effectively while maintaining the original image size. This design enables the model to understand the complex structures and patterns in the image more deeply, thus improving the recognition accuracy of details and global context. This mechanism is highly effective when dealing with targets of varying scales and shapes, which further enhances the performance of our network in semantic segmentation tasks. This is especially true for the recognition of a few categories.The ASPP module comprises three parallel null convolution blocks, each comprising a null convolution layer, a ReLU activation function, and a bulk normalization (BN) layer. Equation (6) is used to perform null convolution for each block.$$\begin{aligned} Y_{ijk}^{(t, O)} = \text {BN}\left( \text {R}\left( C_{r_t}\left( X_{ijk}\right) \right) \right) \end{aligned}$$
(6)
where \(Y_{ijk}^{(t, O)}\) is an element of the output feature map of the t-th null convolution block, \(C_{r_t}\) denotes a null convolution operation with a null rate of \(r_t\), R(\(\cdot\)) is the ReLU activation function, BN(\(\cdot\)) is the batch normalization operation, and \(X_{ijk}\) is the input feature map.By varying the null rates, contextual information can be captured at different scales while maintaining a constant feature map size. The output feature maps of all null convolution blocks are then concatenated in the channel dimension, resulting in Eq. (7) as the total merge result of the three null convolution blocks.$$\begin{aligned} Y_{ijk} = M\left( Y_{ijk}^{(1, O)}, Y_{ijk}^{(2, O)}, Y_{ijk}^{(3, O)}\right) \end{aligned}$$
(7)
where \((Y_{ijk}^{(1, O)}, Y_{ijk}^{(2, O)}, Y_{ijk}^{(3, O)}\) are the output feature maps of the three null convolutional blocks, and M(\(\cdot\)) is the merge operation on the channel dimension.Finally, a 1×1 convolutional layer is applied to the connected feature maps to combine the features with different void rates. Equation (8) is then used to mix the contextual information at different scales, improving the model’s recognition accuracy.$$\begin{aligned} Z_{ijk} = C_1\left( Y_{ijk}\right) \end{aligned}$$
(8)
Where \(Z_{ijk}\) is an element of the output feature map, \(C_1\) denotes a 1\(\times\)1 convolution operation, and \(Y_{ijk}\) is the connected feature map.The advantage of utilizing ASPP is that it efficiently incorporates multi-scale contextual information through null convolution with varying sampling rates, thereby enhancing the accuracy of the model in identifying both intricate and broad contexts(3) DecoderAn Attention module was integrated into the decoder section of the ResUNet model. The module’s primary function is to focus on significant regions in the image, particularly those that contribute significantly to the classification task, thereby improving the model’s accuracy in recognizing target objects, especially for a few categories43,44. By introducing the Attention module, our model can effectively focus on critical regions of the feature map, improving overall segmentation performance.The Attention module comprises of two parallel convolutional blocks, one for the encoder output and the other for the decoder output. The outputs of these blocks are then summed, and the attention weights are generated by a 1×1 convolutional layer. Equation (9) is used to operate the two parallel convolutional blocks.$$\begin{aligned} Y_{ijk}^{O} = P\left( R\left( C_3\left( \text {BN}\left( X_{\text {enc}}^{O}\right) \right) \right) \right) + R\left( C_3\left( \text {BN}\left( X_{\text {dec}}^{O}\right) \right) \right) \end{aligned}$$
(9)
where \(Y_{ijk}^{O}\) is the sum of the output feature maps and \(X_{\text {enc}}^{O}\) and \(X_{\text {dec}}^{O}\) are the input feature maps of the encoder and decoder, \(P(\cdot )\), \(R(\cdot )\), \(BN(\cdot )\) stand for pooling, ReLU activation function and batch normalization operations, and \(C_3\) denotes a \(3\times 3\) convolution operation.This step integrates the information from the encoder and decoder to capture richer contextual information. Next, we apply a 1\(\times\)1 convolutional layer on the summed feature maps to generate the attention weights as shown in Eq. (10).$$\begin{aligned} A_{ijk}^{O} = R\left( C_1\left( \text {BN}\left( Y_{ijk}^{O}\right) \right) \right) \end{aligned}$$
(10)
where \(A_{ijk}^{O}\) is the attention weight matrix. This step generates the attention weights for each location to indicate the region of attention of the model.Finally, we weigh the input feature map of the decoder with the attention weights and Eq. (11) yields the final output. This operation implements an adaptive weighting of the decoder’s input feature map to enhance the model’s attention to critical regions.$$\begin{aligned} Z_{ijk}^{O} = A_{ijk}^{O} \cdot X_{\text {dec}}^{O} \end{aligned}$$
(11)
where \(Z_{ijk}^{O}\) is the corresponding feature values on the feature map, an operation that implements an adaptive weighting of the decoder’s input feature map to enhance the model’s focus on key regions.Although the RSAA segmentation model is derived from the ResUNet model, each of the additional modules possesses distinctive advantages, and their integration can further enhance the performance of the model. The SE module emphasizes crucial features and suppresses irrelevant ones by weighting the feature map with learned weights. This mechanism enhances the model’s ability to capture critical information. The ASPP module captures information at multiple scales through multi-scale null convolution, which is important for image segmentation tasks where the scales of objects and scenes can vary greatly. The Attention module helps the model to better focus on the important parts of the image, thus improving the model’s performance. The integration of these three modules confers upon the model enhanced performance and robustness in the context of complex image segmentation tasks.Semi-supervised learning image segmentation moduleFully supervised semantic segmentation models achieve semantic label assignment at the pixel level by learning from a large number of densely labeled images. However, the main limitation of this model is the difficulty and cost of acquiring high-quality labeled datasets. The labeling process is not only time- and labor-intensive but also often requires specialized knowledge and skills, which may not be feasible in certain scenarios, such as medical image segmentation. To tackle this issue, we opted for a semi-supervised semantic segmentation model. This model has the advantage of requiring only a portion of labeled data, while most of the data can be unlabeled. This approach significantly reduces the workload and complexity of the data preparation phase, while maintaining the model’s performance and reducing its dependence on labeled data. As a result, the model be-comes more useful and scalable.However, the primary challenge in semi-supervised learning is how to effectively utilize unlabeled images. Classical self-training frameworks attempt to use all unlabeled images simultaneously, but this approach is problematic. Specifically, different unlabeled images may vary in difficulty, and thus the reliability of the generated pseudo-labels can vary, leading to serious confirmation bias45,46. Pseudo-labels that are incorrect can accumulate during iterations, causing the model to overfit to the wrong supervisory signals. This can significantly degrade the model’s performance.To address the aforementioned issues, we suggest a novel semi-supervised learning model called RU3S. This model incorporates a confidence screening approach for unlabeled images, building upon the traditional semi-supervised learning model. Figure 4 displays the specific framework diagram; we give priority to unlabeled samples with high confidence for training by screening the confidence of unlabeled images. This strategy enables more efficient utilization of unlabeled samples while avoiding degradation of model performance caused by noise interference from complex samples. It has been demonstrated that the similarity between the pseudo-masks generated during model training can be employed to assess the stability of unlabeled samples. Samples with good stability are capable of optimizing model training. Consequently, mIoU is employed as a metric for gauging the reliability of unlabeled samples and the stability of model training. Furthermore, the utilization of mIoU as a screening criterion offers the additional advantage of enabling the dynamic reflection of the performance changes of the model throughout the training process. As the model is optimized, changes in the mIoU value can inform the selection strategy for samples, guiding the model towards the handling of increasingly complex data and further improvement in overall performance.Fig. 4In our RU3S model training process, we follow the following three stages: For the initial training, we use all unlabeled samples to train the model. Once the training is complete, we apply our proposed confidence screening strategy to unlabeled images to determine their confidence levels. Based on the results, we classify the unlabeled samples as either high or low confidence. In the second stage, we use the labeled samples along with the filtered high-confidence unlabeled samples to further train the model. This step aims to improve the model’s generalization ability using high-confidence unlabeled samples. In the final stage, we use all labeled and unlabeled samples, including both high and low-confidence samples, for the final training of the model. This phase aims to enhance the model’s performance, enabling it to handle a wider range of sample types.In our proposed semi-supervised learning model, a confidence policy for unlabeled samples is designed. Given n unlabeled samples, the model operates at k different time points during training. For each unlabeled sample \(u_i\) denoted by i(i=1,2,…,n), the model saves the generated data at each time point \(t_j\) (j=1,2,…,K) saves the generated pseudo-label, denoted as \(P_ij\).At the end of the training, the final generated pseudo-label of each unlabeled sample is saved, denoted as \(P_iK\) of the i-th unlabeled sample. subsequently, the pseudo-label of the unlabeled sample \(u_i\) at the time point \(t_j\) with the final time point \(mIoU_(i,j)\) is calculated as shown in Eq. (12), m(\(\cdot\)) denotes the mIoU computed for the sample.$$\begin{aligned} \text {mIoU}_{ij} = m\left( P_{ij}, P_{iK}\right) \end{aligned}$$
(12)
These differences are then aggregated at all time points to obtain the total difference \(mIoU_i\) for the i-th unlabeled sample, as shown in Eq. (13).$$\begin{aligned} \text {mIoU}_i = \sum _{j=1}^{k} \text {mIoU}_{ij} \end{aligned}$$
(13)
Finally, the model ranks the unlabeled samples according to their \(mIoU_i\) values. The first r unlabeled samples with high \(mIoU_i\) values are regarded as high confidence samples, and the rest are classified as low confidence samples. By prioritizing the learning of these high-confidence samples, the model is able to circumvent the problem of label error accumulation caused by difficult samples, thereby extracting useful information with greater efficiency. This strategy not only improves the learning efficiency of the model, but also enhances the model’s ability to generalize to unlabeled data. The process of semi-supervised learning has been described in detail. In contrast, pseudo-labels are generated for low-confidence samples, which are then employed in the subsequent retraining process.Using low confidence samples can still be helpful in model training. We can improve the model by training it with high confidence samples and then relabeling the low confidence samples using the trained model. This not only increases the amount of training data and learning opportunities, but also allows the model to handle difficult samples, improving robustness and generalization to more challenging real-world situations. In addition, adding low confidence unlabeled samples also reduces the cost of labelling and reduces the need for additional labelled samples. The training framework is shown in Algorithm 1.Algorithm 1RU3S semi-supervised modeling algorithmThe core concepts and main processes of the RU3S model have been elaborated. Following this, we will compare the computational complexity and memory usage of RU3S with traditional semi-supervised models, assuming that the input feature maps are presented in the form of h\(\times\)w\(\times\)d, where d denotes the total number of channels, h and w denote the height and width of the feature maps, and the number of labeled samples is M, and the number of unlabeled samples is N.The time complexity of the RU3S model is mainly determined by the components of training the model, generating pseudo-labels, calculating the mIoU score, and sorting. Overall, it is \(O((M+N)hwd+Nhw+NlogN)\). Although this time complexity may be higher than that of some semi-supervised algorithms (e.g., some self-training semi-supervised algorithms may only require \(O((M+N)hwd)\), the advantage of the RU3S algorithm lies in its efficient strategy for selecting and utilizing unlabeled samples. The RU3S algorithm may outperform some traditional semi-supervised algorithms in terms of actual running time by selecting high-confidence samples for training.Regarding memory usage, the RU3S model’s requirements primarily stem from storing the labeled and unlabeled sample sets, parameters of the teacher and student models, and generated pseudo-labels. The space complexity of the RU3S model is \(O((M+N)hwd+P)\), where P represents the number of model parameters. The general self-training semi-supervised algorithm typically has a space complexity of \(O((M+N)hwd+P)\), because it needs to store all labeled and unlabeled samples, as well as the model’s parameters. However, the RU3S model divides the unlabeled samples into high-confidence and low-confidence sets, allowing for more efficient use of memory resources. This approach may be more effective for large-scale datasets.Post-processing moduleIn the field of image semantic segmentation, existing models do not fully utilize inter-pixel relationships. This is particularly noticeable in medical image analysis. For in-stance, in MRI scans of osteosarcoma, the grayscale values of the tissue edema area, muscle area, and tumor area are similar at the boundaries. This similarity may cause difficulties for physicians in judging the disease condition, especially for less experienced physicians. Furthermore, these indistinct boundaries also affect the efficiency of image segmentation.Simultaneously, it is observed that the likelihood of a pixel being assigned to a particular class is closely linked to the distribution of information among the surrounding pixels47,48. However, neural networks do not consider this directly during image segmentation. As a result, we opted to use a Conditional Random Field (CRF) to post-process our segmentation model. The Conditional Random Fields (CRF) algorithm addresses the issue of inadequate spatial relationships between pixels by introducing inter-pixel dependencies. This results in more consistent labeling of neighboring pixels, reducing misclassified pixels, particularly at the boundaries of differently organized regions. As a result, our approach improves overall segmentation accuracy while maintaining continuity at the boundaries.In the final output section of the model, a fully connected Conditional Random Field (CRF) module is introduced to enhance the segmentation performance of the model. This is particularly useful when dealing with details and edge regions in an image, as the CRF can consider the spatial relationship between pixels, resulting in more accurate pixel-level classification. By integrating the CRF module, our model can efficiently process image details and edge regions, resulting in improved segmentation performance.The CRF module comprises two primary stages: defining and minimizing the energy function. First, we define an energy function, Eq. (14) for describing the consistency be-tween pixel labels.$$\begin{aligned} E(X) = \sum _i \theta _i (x_i) + \sum _{i,j} \theta _{ij} (x_i,x_j) \end{aligned}$$
(14)
where E(X) is the energy function, X is the label of the pixel, \(x_i\) is the label of pixel i, \(\theta _i (x_i)\) is the unitary potential function of pixel i, and \(\theta _{ij} (x_i,x_j)\) is the binary potential function between pixels i and j.The purpose of this step is to define an energy function that describes the consistency between pixel labels. Then, we obtain the optimal pixel labels by minimizing the energy function, as shown in Eq. (15).$$\begin{aligned} X^* = \arg \min _X E(X) \end{aligned}$$
(15)
where \(X^*\) is the optimal pixel label. The purpose of this step is to find a pixel label that minimizes the energy function, i.e., maximizes the consistency between pixel labels.By introducing the CRF module, we enable the model to take into account the spatial relationship between pixels for more accurate pixel-level classification.

Hot Topics

Related Articles