Intelligent cell images segmentation system: based on SDN and moving transformer

The continuous advancement of medical and computational disciplines has led to an increasingly significant role of computer technology in addressing complex medical conditions. Nevertheless, the unequal allocation of medical resources remains an issue of major concern, particularly in the area of automated cancer diagnosis in developing nations.The practice of cancer diagnosis has traditionally relied heavily on the manual scrutiny of cytopathologic images by qualified medical professionals. The vast quantity of cytopathology images generated by every patient, coupled with the substantial amount of information embedded into each image, renders the manual recognition and diagnosis of these images by physicians highly demanding.Therefore, we aim to use computer vision techniques to screen important information for doctors, reducing their workload and enhancing diagnostic efficiency. This has significant implications for the automated diagnosis of cancer in developing countries. The comprehensive configuration of our suggested approach for image segmentation is exemplified in Fig. 1.Fig. 1Framework diagram of scheme.The SDN denoising algorithm is used initially to remove common noise in cellular pathological images, such as uneven staining, non-uniform illumination, and polarization noise. This improves the image segmentation effect and visual quality. In the development of the SDN-Moving Transformer system, several significant challenges were encountered, particularly in the integration of self-supervised learning and segmentation algorithms. Firstly, the key to self-supervised learning lies in effectively extracting and utilizing the features from unlabeled data, a process that becomes particularly challenging in pathological images due to noise and complex cellular structures. Secondly, designing a model architecture that effectively combines self-supervised learning with segmentation tasks poses a major challenge, necessitating the model to adequately capture cellular features while maintaining strong segmentation performance. Additionally, the computational complexity of the model is a crucial consideration; reducing time and space overhead, while ensuring accuracy, is key to making the system applicable in real clinical scenarios. Therefore, addressing these challenges is essential for ensuring the effectiveness and applicability of the SDN-Moving Transformer system. Diverse data augmentation methodologies are then employed to generate a large volume of training data from the acquired clean image collection. Finally, the UPerMVit image segmentation network is used to process these datasets.Image denoising moduleMedical imaging is an important technological tool for assisting doctors in diagnosis and treatment. However, certain modalities of medical imaging, such as electronic computed tomography and single-photon emission computed tomography, expose patients to significant amounts of radiation, which can affect patient safety. Low-dose imaging technology can reduce radiation dose to 0.61–1.50 mSv51, which is well below the safe standard for radiological detection set by the American Association of Medical Physicists. The standard requires a single examination dose below 50 mSv and a cumulative dose below 100 mSv for multiple short-term accumulations. However, the use of low-dose imaging technology can affect image quality and often results in significant noise during image acquisition. This can be caused by inconsistent staining depth, background spatter interference, and uneven staining. These factors not only reduce image quality but also affect the efficacy of image segmentation.Obtaining a large number of cancer cell pathological images can be challenging, especially when trying to obtain images that contain both noise and clean areas. To address this issue, we propose a self-supervised denoising algorithm called SDN. Our experiments have shown that this algorithm significantly reduces the impact of noise on image segmentation and improves the clarity of denoised images.Existing methods such as Noise2void29, Noise2self52, and Noise2noise28 have proposed various theories. The self-supervised denoising algorithm SDN proposed in this paper is based on these theoretical works. Among them, Noise2Noise is an approach to denoising images that does not require clean data. This method has effectively resolved the challenge of training with numerous pairs of noisy and pristine images that are difficult to obtain. However, this method still has limitations. It requires collecting multiple images with independent noise in the same scene, which is difficult to implement in our problem background. Therefore, we aim to optimize this algorithm to make it more versatile. Our improved version can train the model and complete denoising with just one medical image containing noise.As shown in Fig. 2, SDN first uses a special subsampler S to sample a noisy image, obtaining \({s_1}\left( a \right)\) and\(~{s_2}\left( a \right)\), where\(~{s_1}\left( a \right)\) is denoised through a denoising network to obtain \({f_\theta }\left( {{s_1}\left( a \right)} \right)\), and then the loss used for updating the network is calculated. However, since the ground-truths of sampled images \({s_1}\left( a \right)\) and\(~{s_2}\left( a \right)\) are not equal, we need to calculate the regularization loss jointly with \({s_1}\left( {{f_\theta }\left( a \right)} \right)\), \({s_2}\left( {{f_\theta }\left( a \right)} \right)\), \({s_1}\left( a \right)\) and\(~{s_2}\left( a \right)\), and applying regularization loss to correct the loss.Fig. 2Framework diagram of SDN.Based on Fig. 3, we can observe that the subsampler S initially partitions an image into regions, where the case of k = 2 is considered. In selecting the subsampling parameter k, we tested values of k = 2, 3, 4 and choose k = 2 to balance image detail preservation and noise sensitivity. This choice was based on the complexity of cytopathology images, including factors such as uneven illumination, lens defects, and other influences, to ensure that important cellular structures are retained.Fig. 3The operating principle of Subsampler.The image is partitioned into non-overlapping regions, for instance, an 8 × 8 image is divided into sixteen 2 × 2 sub-images. This division is crucial for maintaining the granularity of the denoising process. In each small sub-image, two adjacent pixels are randomly selected. These pixels are marked distinctively—this marking dictates the sampling method described subsequently. The marked sub-images are then used to generate two separate subsampled images. One is derived from the sub-images marked with the first designation (e.g., yellow), while the other from sub-images marked with the second designation (e.g., purple). This step ensures that the critical details within each small region are preserved and contributes to the overall robustness against noise proliferation.In this section, the primary objective is to address the issue of noise-free images requirement for denoising algorithms. SDN approach has drawn inspiration from Noise2noise methodology. When two noisy images b and c are attainable for an image a, the algorithm endeavors to minimize loss for θ parameterized denoising network. As illustrated in (1), \(\:{\text{a}\text{r}\text{g}}_{{\uptheta\:}}\text{m}\text{i}\text{n}\) refers to a process of minimizing the parameter θ, whilst \(\:\parallel\:{\parallel\:}_{2}^{2}\) represents the square of the \(\:{l}_{2}\) paradigm distance. The formula adjusts the parameter θ to minimize the squared error between the predicted and target values. This would result in generating solutions equivalent to those obtained using supervised training with \(\:{l}_{2}\)-loss.$$\mathop {argmin}\limits_{\theta } {E_{a,b,c}}{f_\theta }\left( b \right) – c_{2}^{2}$$
(1)
Furthermore, constraint equations are introduced for image gap as shown in (2). The aim of the formula is to establish whether there is uniformity between the arbitrary variables b and c under the specified circumstances of a.$$\varepsilon :={{\mathbb{E}}_{c\mid a}}\left( c \right) – {{\mathbb{E}}_{b\mid a}}\left( b \right) \ne 0$$
(2)
According to the constraints that we have introduced, we can obtain the following equation, as shown in (3):$${{\mathbb{E}}_{a,b}}{f_\theta }\left( b \right) – a_{2}^{2}={{\mathbb{E}}_{a,b,c}}{f_\theta }\left( b \right) – c_{2}^{2} – \varvec{\sigma}_{c}^{2}+2\varvec{\varepsilon}{{\mathbb{E}}_{a,b}}\left( {{f_\theta }\left( b \right) – a} \right)$$
(3)
In this equation, \(\varvec{\sigma}_{c}^{2}\) represents the variance of c. As ε approaches zero, it becomes possible to achieve similar training results using only the noisy images of b and c, compared to using the three images of a, b, and c in training. This solves the problem of requiring clean images.However, even the requirement for multiple independent noisy images of the same image is difficult to satisfy, as medical imaging is difficult to obtain and motion and lighting may vary. Therefore, based on this method, we can replace the demanding independent noisy images of a and b with two samples taken from one noisy image a, using the sampler S=(\({s_1}\),\({s_2}\)). This allows us to obtain \({s_1}\left( a \right)\) and\({\text{~}}{s_2}\left( a \right)\) as replacements for a and b, respectively, Eq. (1) can be transformed as (4). The formula minimizes the squared error by adjusting the parameter θ between the predicted value \({f_\theta }\left( {{s_1}\left( a \right)} \right)\) and the target value \({s_2}\left( a \right)\).$$\mathop {argmin}\limits_{\theta } {E_{a,b}}{f_\theta }\left( {{s_1}\left( a \right)} \right) – {s_2}\left( a \right)_{2}^{2}{\text{~}}$$
(4)
However, considering that the ground-truths of \({s_1}\left( a \right)\) and\({\text{~}}{s_2}\left( a \right)\) are not identical, we need to consider the case where the image gap ε does not approach zero. In this case, to simplify the new equation, we set a to satisfy the following system of equations (where\({\text{~}}f_{\theta }^{{\text{*}}}\) is the ideal denoising network trained using three images of a, b, and c with \({l_2}\)-loss, g denotes a transformation of the input), as shown in (5).$$\left\{ {\begin{array}{*{20}{c}} {a=f_{\theta }^{*}\left( b \right)} \\ {{g_1}\left( a \right)=f_{\theta }^{*}\left( {{g_1}\left( b \right)} \right)} \\ {{g_2}\left( a \right)=f_{\theta }^{*}\left( {{g_2}\left( b \right)} \right)} \end{array}} \right.$$
(5)
So, we can obtain the equation shown in (6).$$\begin{gathered} {{\mathbb{E}}_{b|a}}\left\{ {f_{\theta }^{*}\left( {{s_1}\left( b \right)} \right) – {s_2}\left( a \right) – \left( {{s_1}\left( {f_{\theta }^{*}\left( b \right)} \right) – {s_2}\left( {f_{\theta }^{*}\left( b \right)} \right)} \right)} \right\} \hfill \\ ={s_1}\left( a \right) – {{\mathbb{E}}_{b|a}}\left\{ {{s_2}\left( b \right)} \right\} – \left( {{s_1}\left( a \right) – {s_2}\left( a \right)} \right) \hfill \\ ={s_2}\left( a \right) – {{\mathbb{E}}_{b\mid a}}\left\{ {{s_2}\left( b \right)} \right\}=0 \hfill \\ \end{gathered}$$
(6)
In (6), we can discuss it in two categories. The first category is when the ground-truths of \({s_1}\left( a \right)\) and\({\text{~}}{s_2}\left( a \right)\) are identical, satisfying the constraint in (2), which leads to (3). If the ground-truths of \({s_1}\left( a \right)\) and\({\text{~}}{s_2}\left( a \right)\) are not identical, we use them to correct the ground-truths in (6). Therefore, we consider the following constrained optimization problem in (7),(8). Equation (7) aims to merge the terms derived from two distinct samples to achieve the regularized loss, while Eq. (8) is a modified version of the conditional expectation.$${{\mathbb{E}}_{b\mid a}}\left\{ {{f_\theta }\left( {{s_1}\left( b \right)} \right) – {s_2}\left( b \right) – {s_1}\left( {{f_\theta }\left( b \right)} \right)+{s_2}\left( {{f_\theta }\left( b \right)} \right)} \right\}$$
(7)
$${{\mathbb{E}}_{a,b}}={{\mathbb{E}}_a}{{\mathbb{E}}_{b\mid a}}$$
(8)
The optimization problem is in (9).$${\text{~}}min{~_\theta }{{\mathbb{E}}_{b\mid a}}{f_\theta }\left( {{s_1}\left( b \right)} \right) – {s_2}\left( b \right)_{2}^{2}$$
(9)
Furthermore, we can transform this optimization problem into a regularized optimization problem in (10). Equation (10) presents the overall loss acquired by merging the loss with the regularised loss. The chosen loss function functions as a metric of the disparity between the model’s output and the initial input while denoising, guiding the model to learn the resulting output following two sub-sampling and alleviate the noise by minimizing this loss function.$$L=loss+regularization~loss$$
(10)
We have already described in detail the model and process of our denoising algorithm. to illustrate the specific workflow of the algorithm. The training framework is shown in Algorithm 1.The Self-Denoising Algorithm (SDN) exhibits significant advantages in processing noisy cellular pathology images, distinguishing itself from traditional denoising methods. Operating within a self-supervised learning framework, SDN eliminates the need for clean reference images by leveraging the inherent structure of the noisy images themselves for training. It employs a “N2N” training paradigm, utilizing different versions of the same image to effectively preserve the fundamental signal and image features. Moreover, SDN is specifically designed for high-resolution images, integrating multi-scale feature extraction to capture details across varying noise granularities. Its adaptive learning mechanism allows for dynamic adjustments to the denoising strategy, enhancing the retention of critical cellular structures. In comparison to conventional methods, SDN demonstrates exceptional denoising performance and feature preservation, as evidenced by superior PSNR metric, thereby providing clearer cellular pathology images that support more accurate clinical diagnoses.Algorithm 1Image enhancement moduleIn the context of our study, we address the scarcity of large, well-annotated datasets of cancer cell pathology images, particularly those labeled by authoritative physicians, in developing countries. To mitigate this limitation, we implement a robust data augmentation module designed to enhance dataset size and variation. This approach compensates for the lack of high-quality annotations and addresses the difficulties in the exchange and digitization of medical imaging data in these regions53. The preprocessing and data augmentation stages aim to enhance the quantity and diversity of the training set, thereby reducing the risk of overfitting caused by insufficient data and ensuring the model’s generalizability.The operations involved in this study include image rotation, translation, cropping, flipping, brightness enhancement, and saturation enhancement. Each image in the dataset undergoes a sequence of these augmentation techniques, significantly increasing the effective size and diversity of the data. Data augmentation plays a crucial role in generating annotated images, especially when large, high-quality annotated datasets are scarce. By applying various techniques such as image rotation, translation, cropping, flipping, brightness enhancement, and saturation enhancement, the data augmentation module substantially increases the quantity and diversity of the training set. This method effectively reduces the risk of overfitting due to insufficient data and enhances the model’s generalization capability. By expanding the dataset’s size and diversity, these techniques improve the accuracy of cellular structure segmentation, making the model more robust when faced with diverse and challenging input data.The integration of these six methodologies enables us to partially mitigate the limitations imposed by the lack of well-annotated, large-scale cancer cell datasets. These modifications not only enhance the segmentation capability of our model but also improve its robustness when faced with diverse and challenging data inputs.Cell pathology image segmentation modelWe employed the UPerMVit model for cell pathology image segmentation, which integrates the UperNet45with our Moving Transformer networks. The model architecture is inspired by UperNet45, Feature Pyramid Network (FPN)54, and incorporates the ParC structure, enabling our model to surpass UperNet in terms of performance while demanding lower computational and space requirements. This is particularly beneficial for medical-assisted treatments in developing countries, where reduced computational costs can lead to better segmentation results for high-definition pathology images. The comprehensive configuration of our model is illustrated in Fig. 1, specifically in the image segmentation section. Below, we delve into the structures and functions of each component of the model.The In the segmentation of cell pathology images model’s modular architecture facilitates the segmentation process by breaking down the image segmentation task into several independent and interchangeable modules. Each module focuses on specific tasks such as feature extraction, context modeling, and decision layer integration. This design enables the model to handle different types of medical images more effectively and allows adjustments and optimizations to be made based on specific application requirements. Additionally, UPerMVit incorporates multi-scale feature fusion and self-attention mechanisms to capture rich contextual information, thereby enhancing segmentation accuracy and robustness.Compared to other transformer-based models used in medical image analysis, UPerMVit demonstrates superior performance in dealing with complex pathological images. Traditional models often rely on fixed feature extraction methods, making it challenging to adapt to the diverse characteristics of medical images. In contrast, UPerMVit’s modular design permits dynamic adjustment of processing strategies based on the characteristics of the input data. This enables UPerMVit to achieve higher segmentation precision and better adaptability in tasks such as cell segmentation and lesion detection, especially when handling challenging image data.We propose an innovative attention mechanism termed Moving Attention, which diverges from the conventional Self Attention mechanism. This novel mechanism ensures a more flexible and efficient allocation of attention. In Moving Attention, each pixel’s attention can move to k neighboring pixels, whereas, in Self Attention, attention spans all pixels.The UPerMVit model integrates a novel attention mechanism known as Moving Attention, which enhances image segmentation accuracy and reduces computational complexity. Unlike traditional Self Attention that considers all pixels, Moving Attention allows each pixel to focus on k neighboring pixels, significantly reducing the attention scope. This localized attention mechanism is more efficient, as it limits the computational overhead by narrowing the focus to nearby pixels, thereby conserving resources and improving processing speed. The model also employs a series of 3 × 3 overlapping convolutions for downsampling, which introduces bias induction and enhances segmentation performance. This approach, inspired by effective CNN architectures, leads to improved efficiency over traditional methods. The combination of these techniques allows UPerMVit to deliver superior segmentation results with lower computational and space requirements, making it particularly suitable for medical applications in resource-constrained environments.Figure 4 illustrates the difference between Moving Attention and traditional Self Attention mechanisms. In this illustration, we set the value of k to 2. This means that each pixel will focus on 9 neighboring pixel blocks within 2 grids, compared to traditional Attention where each pixel focuses on all 25 pixels. For instance, the attention of a single pixel (blue grid) in Self Attention would cover every other pixel in a 5 × 5 grid, while Moving Attention would only focus on 2 nearby pixels (pink grid) in a 3 × 3 size (blue grid).Fig. 4The structures of self attention and MOV attention.The formal description of Moving Attention is in (11). Let \(X \in {{\mathbb{R}}^{n \times m}}\)be the input, where m is the dimension of the feature vector. (Q, K,V) are the linear projections of X, k denotes the magnitude of the attention region for each pixel, P(i, j) represents the Position Bias, \({\varvec{A}}_{x}^{k}\) is the dot product between the projection of the x-th output and the projection of the attention area:$${\varvec{A}}_{x}^{k}=\left[ {\begin{array}{*{20}{c}} {{Q_x}K_{{{\rho _1}\left( x \right)}}^{T}+{P_{\left( {x,{\rho _1}\left( x \right)} \right)}}} \\ {{Q_x}K_{{{\rho _2}\left( x \right)}}^{T}+{P_{\left( {x,{\rho _2}\left( x \right)} \right)}}} \\ \vdots \\ {\begin{array}{*{20}{c}} {{Q_x}K_{{{\rho _{k – 1}}\left( x \right)}}^{T}+{P_{\left( {x,{\rho _{k – 1}}\left( x \right)} \right)}}} \\ {{Q_x}K_{{{\rho _k}\left( x \right)}}^{T}+{P_{\left( {x,{\rho _k}\left( x \right)} \right)}}} \end{array}} \end{array}} \right]$$
(11)
The symbol \({\rho _k}\left( x \right)\) denotes the k nearest pixels to the input x. Furthermore, we define \({\varvec{V}}_{x}^{k}\) as a matrix composed of the projection of the k nearest values from the x-th input, as shown in (12).$${\varvec{V}}_{x}^{k}={\left[ {\begin{array}{*{20}{c}} {V_{{{\rho _1}\left( x \right)}}^{T}}&{V_{{{\rho _2}\left( x \right)}}^{T}}& \ldots &{V_{{{\rho _k}\left( x \right)}}^{T}} \end{array}} \right]^T}{\text{~}}~~~~~~~~~~~$$
(12)
Finally, we can obtain the x-th Moving Attention that focuses on the k nearby pixels in (13), The symbol α is the scaling coefficient in this formula:$${\text{MO}}{{\text{V}}_k}\left( x \right)=softmax\left( {\frac{{{\varvec{A}}_{x}^{k}}}{{\sqrt \alpha }}} \right){\varvec{V}}_{x}^{k}$$
(13)
The cross-entropy loss function55 is commonly used to address classification issues when the number of samples per class is roughly equal. However, in imbalanced classification problems, its use may cause the model to favor predicting the more frequent classes and ignore the less frequent ones. In cell pathology image segmentation, the cells only occupy a small portion of the image. Therefore, we use the Lovász-SoftMax Loss function as it is better suited for imbalanced classification problems.Lovász-SoftMax Loss56 is a loss function designed for imbalanced classification problems, with the advantage of helping the model better handle the imbalance between classes. Unlike traditional cross-entropy loss functions, Lovász-SoftMax Loss can measure the difference between predicted results and true labels by comparing their relative position instead of directly comparing probability distributions. Specifically, Lovász-SoftMax Loss is a continuously differentiable convex function that imposes smaller penalties on correctly classified samples and larger penalties on misclassified samples, thus improving the accuracy of classifying a small number of samples.Therefore, in cell pathology image segmentation, the Lovász-SoftMax Loss can better handle class imbalance issues and outperform traditional cross-entropy by better distinguishing between positive and negative samples, thereby improving the model’s classification performance. Additionally, Lovász-SoftMax Loss not only considers the sample’s classification label but also considers the relative position of the predicted value, making it more accurate in capturing cell boundary information, thus improving cell segmentation accuracy. The Lovász-SoftMax Loss possesses properties that enhance robustness in managing noisy or irregular segmentation outcomes. It reduces sensitivity to noise by ordering and computing prediction scores, resulting in a minimal effect on noise or outliers. Consequently, it better deals with the challenge of a sizeable volume of intricate cytopathology images that exhibit high noise levels. Finally, compared to cross-entropy loss functions, Lovász-SoftMax Loss has smoother gradients, regularization with this loss function facilitates more stable model training, reduces overfitting, and enhances the model’s generalization performance. This is especially beneficial for tasks such as cellular image segmentation, which involve small sample sizes and datasets containing noise and uncertainty. The main steps of Lovász-SoftMax Loss are as follows:Define the true label y and predicted value f(x) for a binary classification problem in (14).$$y \in – 1,{1^n},\;f\left( x \right) \in {{\mathbb{R}}^n}$$
(14)
In (15), n designates the number of samples. Next, we define the loss function l_Lovasz for each sample:$${\ell _L}\left( {{y_i},{f_i}} \right)=\hbox{max} (0,1 – {y_i}{f_i})~$$
(15)
whereas \({y_i}\) signifies the denomination of the i-th specimen, \({f_i}\) indicates the prognosticated output measure of that same specimen. When \({y_i}\)=1, \({\ell _{Lovasz}}\) is smaller when \({f_i}\) is larger, and vice versa. When \({y_i}\)=-1, \({\ell _{Lovasz}}\) is smaller when \({f_i}\) is smaller, and vice versa. Finally, we define the mean loss function \({\ell _{Lovasz}}~\)for all samples, \({\ell _{Lovasz}}\) represents the average loss of all samples, as shown in (16).$${\ell _{Lovasz}}\left( {y,f\left( x \right)} \right)=\mathop \sum \limits_{{i=1}}^{n} {\ell _{Lovasz}}\left( {{y_i},{f_i}} \right)~$$
(16)
We have expounded on the concept of MOV and the architecture of MT. Moving forward, we will proceed to analyze the computational complexity and memory usage of MOV. We will compare the complexity and memory usage of SA, WSA, and MOV attention mechanisms. It is presupposed that the input characteristic chart manifests itself in a configuration of \(h \times w \times d\), where d denotes the total count of conduits, while h and w signify the vertical and horizontal dimensions of the feature map with due deference.Firstly, for all three attention mechanisms, the calculation of linear projection requires \(3hw{d^2}\) floating point operations. Subsequently, let us now analyze the computational complexity of each mode individually. SA has a structure of \(hw \times hw\), resulting in a calculation complexity of \(3hw{d^2}+2{h^2}{w^2}d\). WSA divides QKV into \(\frac{h}{k} \times \frac{w}{k}\) parts, with each part having a shape of \(k \times k\), and performs independent calculations on each of them. Therefore, the calculation complexity of WSA is \(3hw{d^2}+2hwd{k^2}\). The computation complexity of MOV can be deduced from the size mentioned previously, which leads to a computation complexity of \(3hw{d^2}+2hwd{k^2}\).Moving on to the analysis of memory usage among these three attention mechanisms, the attention weight of SA is \(hw \times hw\), thus requiring a memory of \(3{d^2}+{h^2}{w^2}\). Likewise, the attention weight of WSA is \(\frac{h}{k} \times \frac{w}{k} \times {k^2} \times {k^2}\), necessitating a memory of \(3{d^2}+hw{k^2}\). Lastly, the memory usage of MOV can be obtained from the size of V mentioned earlier, resulting in a memory usage of \(3hw{d^2}+2hwd{k^2}\).

Hot Topics

Related Articles