A multibranch and multiscale neural network based on semantic perception for multimodal medical image fusion

This section describes the proposed method for fusing multi-modal medical images, named DUSMIF. Firstly, an overview of the general framework of the method is presented. Subsequently, the structure of the fusion network and the details of each module are elaborated upon, and the segmentation network used is briefly introduced. Finally, an explanation of the loss functions is provided.Overall frameworkThe DUSMIF method proposed in this paper consists of a multi-scale multi-branch image fusion network and an image segmentation network. The image fusion network extracts and fuses features from multi-modal images to achieve reconstruction. It serves as the core and focus of this method. On the other hand, the image segmentation network is utilized to obtain semantic information from the fused image. As described in “Task-driven and semantic-aware image processing methods” section, the effectiveness of this approach is thoroughly validated by transferring semantic information to relevant image processing tasks through loss functions. Similarly, our method employs segmentation losses derived from the image segmentation network to feed semantic information to the fusion network. The overall framework of our method and the flow of information are illustrated in Fig. 1.Figure 1The overall framework of the proposed method.During the training of the network, CT and MRI source images are passed through the image fusion network to generate fused images, from which the fusion loss is computed. These fused images then undergo the image segmentation network to generate pixel-level segmentation labels and obtain segmentation loss. The segmentation loss, derived from the image segmentation task and containing semantic information about the images, is also referred to as semantic loss. The fusion loss and segmentation loss collectively form the loss function of the image fusion network, guiding the update of network parameters. Conversely, the loss function of the image segmentation network only comprises the segmentation mentioned above loss. It is important to note that the ultimate output of this method, after processing multi-modal medical images through the network, is the fused image. The segmentation labels associated with the fused image are intermediary products generated during training for acquiring semantic loss.Due to the simultaneous utilization of the segmentation loss as the loss function for both networks, training the fusion and segmentation networks together can lead to an issue of clarity training objectives. This ambiguity arises from the inability to distinguish whether a reduction in segmentation loss is due to improvements in the fusion or segmentation networks. Consequently, the training outcomes might overly specialize the image segmentation network, negatively impacting the performance of the image fusion network. While employing a pre-trained image segmentation network can mitigate this problem, a fixed segmentation network might gradually become less adaptable to variations in input images, introducing certain biases in semantic results. To address this concern, the approach in this paper adopts an alternating training scheme during the network training process. Specifically, within the same epoch, only one of the two networks (either the image fusion network or the image segmentation network) is trained. After a predetermined number of epochs, the training focus alternates between them. This approach helps better balance the two networks and their respective training objectives.Multi-branch and multi-scale image fusion networkIn Fig. 2, the multi-branch and multi-scale fusion network structure for CT and MRI medical images is presented. The fusion network comprises a multi-branch and multi-scale feature extraction network and a corresponding multi-branch and multi-scale feature fusion reconstruction network.Figure 2The structure of fusion network.As explained in “Multi-branch and multi-scale feature extraction” section, employing multiple branches for feature extraction contributes to obtaining a more comprehensive and enriched feature representation. However, an increased number of branches implies higher computational demands, while the improvements gained diminish with the growing number of branches. Balancing resource consumption with the enhancement in feature extraction capability, the proposed method in this paper employs three feature extraction branches in the feature extraction network, designed with appropriate lightweight considerations. Each feature extraction branch processes the input images and outputs features at five scales. Through the multi-branch and multi-scale feature extraction network, the input multi-modal medical images yield six sets of image features across five scales, with each modality possessing three sets of features.Corresponding to the extracted feature scales, the multi-branch and multi-scale image fusion reconstruction network comprises five fusion reconstruction blocks. In each fusion reconstruction block, multiple attention mechanisms are employed for inter-modality feature fusion, inter-branch feature fusion, and intra-scale feature fusion with the previous scale. These multiple attention mechanisms during fusion allow for the selective integration of extracted rich features and emphasize their crucial aspects. The convolutional block at the end of the fusion reconstruction network consists of three convolutional layers (each with a 3 \(\times \) 3 kernel size), LeakyReLU activation functions and batch normalization. This block is responsible for constructing the mapping from fused features to the final fused image.Feature extraction branchThe feature extraction branch proposed by this method consists of two convolutional layers and four feature extraction pairs. Each feature extraction pair consists of a downsampling feature extraction block and a feature extraction block. The structure of the feature extraction branch is illustrated in Fig. 3.Figure 3The structure of feature extraction branch.Conventional convolutional operations often involve the downsampling of features, leading to information loss. The shallower the layer, the more significant the information loss caused by downsampling. To mitigate the information loss resulting from downsampling, specific convolutional layers within the feature extraction branch are designed to expand the feature dimensions without altering the feature size, achieved by adjusting the convolutional parameters. In the branch, the two convolutional layers following the image input do not perform downsampling, aiming to retain information from shallow features, and the result of the second convolutional layer is exported as the feature for the first scale. The features for the remaining four scales are derived from feature extraction pairs. Moreover, the latter module of the two feature extraction blocks in each pair does not undergo downsampling, effectively delaying the consecutive loss of feature information. The acquisition of scale-specific features through the feature extraction pairs can be represented as follows:$$\begin{aligned} {{F}_{i}}=FE{{B}_{i}}(FEB\_dow{{n}_{i}}({{F}_{i-1}})),i=1,2,3,4, \end{aligned}$$
(1)
where \(F_i\) represents the ith scale feature, \(F_0\) is the first scale feature obtained by convolution, \(FEB_i\) and \(FEB\_down_i\) represent the ith feature extraction block and ith downsampled feature extraction blocks.Sobel convolution can introduce rich gradient information into the feature extraction process by calculating the gradient magnitude of the feature, so Sobel convolution is used in the feature extraction block. Sobel convolution can be discretized by the formula as:$$\begin{aligned} \begin{aligned} \text {S(}x,y\text {)=}&\left| {{\Delta }_{x}}f+{{\Delta }_{y}}f \right| \\ =&|(f(x-1,y-1)+2f(x-1,y)+f(x-1, \\&+ y+1))-(f(x+1,y-1)+2f(x+1,y) \\&\times f(x+1,y+1))|+|(f(x-1,y-1)+2f(x,y-1) \\&+f(x+1,y-1))-(f(x-1,y+1) \\&+ 2f(x,y+1)+f(x+1,y+1))|, \\ \end{aligned} \end{aligned}$$
(2)
where S is the feature after Sobel convolution, x and y are the coordinates of the pixel on the image respectively, f is the feature before Sobel convolution, \(\Delta _{x}\) and \( \Delta _{y}\) are the Sobel gradient operators in the horizontal and vertical directions respectively.Figure 4The structure of feature extract block.The structure of the feature extraction block is depicted in Fig. 4. The features from the input module are divided into two processing branches: the standard convolution and the Sobel convolution. Within the Sobel convolution branch, there are two stages of Sobel convolutions followed by regular convolutions and a skip connection is employed. The inclusion of the skip connection serves the purpose of preventing gradient explosion. An individual feature from the standard convolution processing branch and two features from the Sobel convolution processing branch are combined element-wise. The resultant summation then undergoes processing through the LeakyReLU activation function, yielding the output of this module.Attention-based feature fusion and reconstructionThe feature fusion and reconstruction network primarily consists of attention-based feature fusion and reconstruction blocks. The structure of the feature fusion and reconstruction block is illustrated in Fig. 5. The module takes as input the image features at the current scale and the features processed by the previous feature fusion and reconstruction block. For the processing of features at the current scale, to fuse the features from two image modalities, the input features are divided into three processing branches, each corresponding to one modality’s branch feature. The image features of the two modalities within each branch are fused using cross-modal attention fusion blocks. Subsequently, each modality branch undergoes a 3 \(\times \) 3 convolutional normalization activation process, and the image features between different branches are fused using cross-branch attention fusion blocks. The features from the fused branches are then convolved, normalized, activated, and processed further in the cross-scale fusion reconstruction block. Throughout this process, the convolutional normalization activation between modules aims to introduce mappings and nonlinearity, enhancing feature extraction between modules. Regarding the processing of features at a larger scale, the features are first upsampled to match the size of the features at the current scale. The features are then processed through three consecutive 3 \(\times \) 3 convolutions with normalization activations. The gradual reduction of feature dimensions through this step helps mitigate information loss caused by abrupt dimension reduction, as opposed to a direct single-step dimension reduction. The processed features from the larger and current scales are fused in the cross-scale attention fusion module. Similarly, the features from the fused result undergo another round of convolutional normalization activation, yielding the fusion and reconstruction block output. It is important to note that there is no input from features processed at a higher level in the first fusion and reconstruction block. Hence, there is no processing for them or cross-scale feature fusion. The above processing can be represented using equations:$$\begin{aligned} \begin{aligned} {{F}_{so,i}}=&C(CSA{{M}_{i}}(C(CBA{{M}_{i}}(C(CMA{{M}_{i,1}}({{F}_{si,i}})),\\&C(CMA{{M}_{i,2}}({{F}_{si,i}})),C(CMA{{M}_{i,3}}({{F}_{si,i}})))),\\&C(Up({{F}_{so,i-1}})))). \end{aligned} \end{aligned}$$
(3)
Figure 5The structure of feature fusion reconstruction module.Figure 6 shows the structure of the three attention fusion modules designed and used in the fusion reconstruction block. From left to right, they are cross-modal attention fusion module, cross-branch attention fusion module and cross-scale attention fusion module.Figure 6The structure of attention modules.The purpose of the cross-modal attention fusion block is to facilitate mutual feature interaction between modalities through cross-modal attention, which is why you can observe cross interactions within the structure of the block. The left and right branches in the structure correspond to the processing of features from two modalities. Normalizing input features in the branches helps mitigate the influence of extreme values, thereby enhancing the convergence speed and generalization ability of the network. The construction of attention leverages the sparsity property of the ReLU activation function and the characteristic of the Sigmoid activation function, which maps results between 0 and 1, reflecting the level of attention. The image features from both modalities in the two branches are attended to by attention from their modality as well as the attention from the other modality. The symbol \(\otimes \) in the diagram represents attention multiplication. The features modified by attention from both modalities are concatenated and then dimension-reduced by convolutional processing, effectively refining the features and highlighting the attended portions of the features from both modalities. Taking one branch as an example, the application of cross-modal attention can be represented using the following formula:$$\begin{aligned} \begin{aligned}{}&F_{mc,a}=F_{m,a}\cdot (1+M_{m,a}+M_{cm,a}), \\&M_{m,a}=Sigmoid_{a}(Conv_{a}(Relu_{a}(Conv_{a}(F_{m,a})))), \\&M_{cm,a}=Sigmoid_{b}(Conv_{b}(Relu_{b}(Conv_{b}(F_{m,b})))), \end{aligned} \end{aligned}$$
(4)
where \(F_{m,a}\) represents the input image of the current modality, \(F_{mc,a}\) represents the current modality image after applying cross-modality attention, \(M_{m,a}\) represents the attention mask for the current modality, \(M_{cm,a}\) represents the attention mask for cross-modality, \(Conv_{a}\) and \(Conv_{b}\) represent convolutional layers in the current modality branch and the other modality branch respectively, \(Relu_{a}\) and \(Relu_{b}\) represent the ReLU activation function in the current modality branch and the other modality branch respectively, \(Sigmoid_{a}\) and \(Sigmoid_{b}\) represent the Sigmoid activation function in the current modality branch and the other modality branch respectively.The cross-branch attention fusion block is based on the convolutional block attention module, with modifications to the utilization of attention and subsequent processing. This fusion module employs both max pooling and average pooling to aggregate feature information across the channels of the branches, utilizing a shared set of attention weights to reduce parameter overhead. After applying attention to the features, a convolutional layer is used to reduce dimensionality, highlighting essential features. The subsequent normalization process serves to stabilize gradients. The application of cross-branch attention can be represented as follows:$$\begin{aligned} \begin{aligned}{}&F_{bc}=BN(Conv(F_{b}\cdot (1+M_{\text{max}}+M_{\text{avg}}))),\\ {}&M_{\text{max}}=Sigmoid(Conv(\text{Re}lu(Conv(Maxpool(F_{b}))))),\\ {}&M_{avg}=Sigmoid(Conv(\text{Re}lu(Conv(Avgpool(F_{b}))))),\end{aligned} \end{aligned}$$
(5)
where \(F_{b}\) represents the input features, \(F_{bc}\) represents the features after cross-branch attention processing, \(M_{\text{max}}\) represents the attention mask for max pooling, \(M_{\text{avg}}\) represents the attention mask for average pooling, \(Maxpool(\cdot )\) represents the max pooling operation, \(Avgpool(\cdot )\) represents the average pooling operation, \(BN(\cdot )\) represents the batch normalization process.The cross-branch attention fusion block is built upon the coordinated attention module, with modifications applied to the changes in feature dimensions and post-processing of attention features. Horizontal and vertical average pooling operations are introduced to incorporate spatially contextual information, aiding in the precise localization of position-related changes brought about by scale variations in features from both the current scale and the larger scale. This allows for timely adjustments of relevant attention. The features after attention are similarly dimension-reduced through convolutional operations and normalized. The process of applying cross-scale attention can be represented using the following formula:$$\begin{aligned} \begin{aligned}{}&F_{sc}=BN(Conv(F_{s}\cdot (1+M_{x}+M_{y})))),\\ {}&M_{x}=Sigmoid_{x}(Conv_{x}(hswish(BN(Conv(XAvg(F_{s})))))),\\ {}&M_{y}=Sigmoid_{y}(Conv_{y}(hswish(BN(Conv(YAvg(F_{s})))))),\end{aligned} \end{aligned}$$
(6)
where \(F_{s}\) represents the input features, \(F_{sc}\) represents the features after cross-scale attention processing, \(M_{x}\) and \(M_{y}\) represent the attention masks for the x direction and y direction respectively, \(XAvg(\cdot )\) and \(YAvg(\cdot )\) represent the average pooling for the x direction and y direction respectively, \(hswish(\cdot )\) represents the hsiwsh activation function, \(Conv_{x}\) and \(Conv_{y}\) represent the independent convolutional layers for the x direction branch and y direction branch respectively, \(Sigmoid_{x}\) and \(Sigmoid_{y}\) represent the independent Sigmoid activation functions for the x direction branch and y direction branch respectively.Unsupervised image segmentation networksThe training process of the multimodal medical image fusion network in this approach involves obtaining semantic information through advanced visual tasks to enhance the fusion capability of the network and achieve higher quality fused images. The latter is more suitable for medical imaging scenarios among advanced visual tasks such as image classification, object detection, and image segmentation. Thus, image segmentation is chosen as the source of image semantic information. Most existing image segmentation methods are supervised, imposing high demands on the dataset used for training requiring accurate segmentation labels. Simultaneously, image fusion datasets require images containing two modalities, and these different-modality images need to be registered. Currently, no medical image dataset fulfils image fusion and image segmentation requirements. This issue can be approached by annotating medical image fusion datasets to provide segmentation labels or employing unsupervised image segmentation networks. The former incurs substantial cost, and the resulting methods might lack generality. On the other hand, unsupervised image processing methods are currently a research trend, with some unsupervised approaches capable of achieving results comparable to supervised methods.In unsupervised image segmentation, we have identified an approach known as “Pixel-wise Feature Clustering Using Invariance and Equivariance”55. This unsupervised image segmentation technique employs geometric consistency as an inductive bias to learn the photometric invariance and geometric equivariance of images, facilitating image segmentation without the need for hyperparameter tuning or specific task preprocessing. The method demonstrates robust segmentation results. Within this approach, the alternating use of current feature representations is employed for unsupervised clustering, and the resulting cluster labels are used as pseudo-labels to train feature representations in an iterative manner, ultimately leading to stable outcomes. Moreover, the chosen method for unsupervised image segmentation offers a rational interpretation of its utilization of photometric invariance and geometric equivariance, rooted in sound theory. Photometric invariance entails that pixels in the same position should receive identical labels when there is a minor fluctuation in light intensity of the image, preserving their original division. This concept is manifested in their segmentation method as the feature representations obtained after subjecting each pixel to two distinct photometric transformations should remain consistent. Based on the idea of photometric invariance, clustering pixels transformed under two photometric alterations should ideally be closer to their respective cluster centers and also closer to the cluster centers of the other photometric transformation. Geometric equivariance implies that when an image undergoes geometric transformations like scaling, the resulting cluster segmentation labels should undergo corresponding scaling. The method embodies this principle by applying photometric transformations to both branches and subjecting one branch to a geometric transformation while keeping the other branch invariant, thus creating two distinct geometric forms.In conclusion, taking into account both the aspects of data considerations and the current landscape of relevant research, the image fusion method proposed in this paper opts for leveraging an unsupervised image fusion network to acquire image semantic information, thereby assisting in the training of the image fusion network. This approach involves utilizing a pixel-wise feature clustering technique incorporating invariance and equivariance principles to segment fused images.Design of loss functionThe method proposed here defines the loss function of the fusion network from three perspectives, and correspondingly, the final loss function is composed of three components. From the aspect of image content, medical fusion images should strive to incorporate high-intensity information and weak texture details present in the images, such as calcifications or hemorrhagic lesions in brain CT images, along with soft tissue details in MRI images. Regarding image accuracy, the generated fusion image should closely resemble the original two-modal images and not favor one modality over the other. Medical fusion images should encapsulate ample semantic information regarding image semantics, reflecting the fusion network’s understanding of image content. The overall loss function of the fusion network can be represented as follows:$$\begin{aligned} L={{L}_{content}}+{{L}_{similarity}}+{{L}_{semantic}}, \end{aligned}$$
(7)
where L represents the total loss function, \(L\_content\) represents the content loss function, \(L\_similarity\) represents the similarity loss function, and \(L\_semantic\) represents the semantic loss function.Content loss functionThe content loss function measures the content information contained within the fused image. The fusion network, employing the content loss function, iteratively refines the salient features and textural aspects of the fused image, thereby achieving enhancement. This process entails balancing the overall luminosity of the image alongside intricate detail preservation. The content loss function comprises two primary components: intensity loss and texture loss. Its formulation is as follows:$$\begin{aligned} {{L}_{content}}={{L}_{{\text {int}}}}+\alpha {{L}_{texture}}, \end{aligned}$$
(8)
where \(L\_int\) represents the intensity loss, \(L\_texture\) represents the texture loss, and \(\alpha \) in the formula is a balance constant with a value of 5.Intensity loss measures the pixel-by-pixel energy difference between the fused image and the source image and constrains the overall intensity of the fused image. The formula for intensity loss is defined as:$$\begin{aligned} {{L}_{{\text {int}}}}\text {=}\frac{\text {1}}{HW}\left\| {{I}_{f}}-\max ({{I}_{ct}},{{I}_{mri}}) \right\| , \end{aligned}$$
(9)
where H and W are the height and width of the image respectively, \(||\cdot ||\) represents the L1 norm calculation, and \(max(\cdot )\) represents the element-wise maximum value of the matrix.Texture loss measures the pixel-by-pixel gradient difference between the fusion image and the source image and reflects the texture difference between the fusion image and the source image. The formula for texture loss is defined as:$$\begin{aligned} {{L}_{texture}}\text {=}\frac{\text {1}}{HW}\left\| \left| \nabla {{I}_{f}} \right| -\max (\left| \nabla {{I}_{ct}} \right| ,\left| \nabla {{I}_{mri}} \right| ) \right\| , \end{aligned}$$
(10)
where \(\Delta \) is the Sobel gradient operator, which calculates the gradient between pixels, and \(|\cdot |\) is the absolute value operation.The content loss function balances the processing of global features and local details by combining intensity loss and texture loss so that the fused image tends to have rich image content.Similarity loss functionThe similarity loss function is grounded in the structural similarity index, where the mean differences in the structural similarity between the fused image and the modalities of the two source images are extracted as the loss. The structural similarity index measures the resemblance between two images by treating them as signals and computing statistical properties such as mean, variance, and covariance of these signals. A comprehensive similarity value is obtained by evaluating the similarity of statistical characteristics between the original and the evaluated image. The structural similarity index effectively characterizes the level of distortion present in an image. The calculation of the structural similarity index is expressed as follows:$$\begin{aligned} \begin{aligned} SSIM(x,y)&=\frac{(2{{\mu }_{x}}{{\mu }_{y}}+{{c}_{1}})(2{{\sigma }_{xy}}+{{c}_{2}})}{(\mu _{x}^{2}+\mu _{y}^{2}+{{c}_{1}})(\sigma _{x}^{2}+\sigma _{y}^{2}+{{c}_{2}})}, \\ {{c}_{1}}&={{(0.01L)}^{2}}, \\ {{c}_{2}}&={{(0.03L)}^{2}}, \\ \end{aligned} \end{aligned}$$
(11)
where \({{\mu }_{x}}\) and \({{\mu }_{y}}\) represent the average value of image x and image y respectively, \(\sigma _{x}^ {2}\) and \(\sigma _{y}^{2}\) represent the variance of image x and image y respectively, and \({{\sigma }_{xy}}\) represent the covariance of image and image, \({{c}_{1}}\) and \({{c}_{2}}\) are stability constants, and L is the dynamic range of pixel values.The value range of structural similarity is \(-1\) to 1, and when the two images are completely consistent, the value is 1. The similarity loss is expressed as:$$\begin{aligned} {{L}_{similarity}}=\text {1}-\frac{SSIM({{I}_{f}},{{I}_{ct}})+SSIM({{I}_{f}},{{I}_{mri}})}{2}, \end{aligned}$$
(12)
where \({{I}_{f}}\) represents the fused image, \({{I}_{ct}}\) represents the original CT image, and \({{I}_{mri}}\) represents the original MRI image.Semantic loss functionThe semantic loss function of the fused texture is also the segmentation loss of the image segmentation network, which reflects the loss of image semantic information in the image fusion network and the loss of segmentation difference in the unsupervised image segmentation network. The definition of this loss function is based on the photometric invariance and geometric invariance of the unsupervised image segmentation network used, consisting of an intra-view loss and an inter-view loss for unsupervised clustering. In the segmentation network, each fused image \({{x}_{i}}\) is applied with two random image transformations \(P_{i}^{(1)}\) and \(P_{i}^{( 2)}\), generate two feature vectors \(z_{ip}^{(1)}\) and \(z_{ip}^{(2)}\) for each pixel p in the image \({{x}_{i}}\) by segmenting the feature extraction network \(\theta \):$$\begin{aligned} \begin{aligned} z_{ip}^{(1)}&={{f}_{\theta }}(P_{i}^{(1)}({{x}_{i}}))[p], \\ z_{ip}^{(2)}&={{f}_{\theta }}(P_{i}^{(2)}({{x}_{i}}))[p]. \\ \end{aligned} \end{aligned}$$
(13)
Subsequently, two sets of pseudo-labels and cluster centers are obtained by performing independent clustering on the two feature views:$$\begin{aligned} \begin{aligned} {{y}^{(1)}},{{\mu }^{(1)}}&=\arg \underset{y,\mu }{\mathop {\min }}\,{{\sum \limits _{i,p}{\left\| z_{ip}^{(1)}-{{\mu }_{{{y}_{ip}}}} \right\| }}^{2}}, \\ {{y}^{(2)}},{{\mu }^{(2)}}&=\arg \underset{y,\mu }{\mathop {\min }}\,{{\sum \limits _{i,p}{\left\| z_{ip}^{(2)}-{{\mu }_{{{y}_{ip}}}} \right\| }}^{2}}, \\ \end{aligned} \end{aligned}$$
(14)
where y is the corresponding pseudo-label, and \(\mu \) represents the corresponding cluster center.$$\begin{aligned} {{L}_{clust}}({{f}_{\theta }}({{x}_{i}})[p],{{y}_{ip}},\mu )=-\log \frac{{{e}^{-d({{f}_{\theta }}({{x}_{i}})[p],{{\mu }_{{{y}_{ip}}}})}}}{\sum \nolimits _{l}{{{e}^{-d({{f}_{\theta }}({{x}_{i}})[p],{{\mu }_{l}})}}}}, \end{aligned}$$
(15)
where \(d(\cdot ,\cdot )\) represents the cosine distance.The feature vectors from both views need to correspond to the corresponding cluster labels, thus defining the in-view loss:$$\begin{aligned} \begin{aligned} {{L}_{within}}=&\sum \limits _{i,p}{{{L}_{clust}}(z_{ip}^{(1)},y_{ip}^{(1)},{{\mu }^{(1)}})} \\&+\sum \limits _{i,p}{{{L}_{clust}}(z_{ip}^{(2)},y_{ip}^{(2)},{{\mu }^{(2)}})}. \end{aligned} \end{aligned}$$
(16)
Corresponding to different image photometric transformations, the eigenvectors between two views need to be consistent with another cluster label, thus defining the semantic loss between views:$$\begin{aligned} \begin{aligned} {{L}_{cross}}=&\sum \limits _{i,p}{{{L}_{clust}}(z_{ip}^{(1)},y_{ip}^{(2)},{{\mu }^{(2)}})} \\&+\sum \limits _{i,p}{{{L}_{clust}}(z_{ip}^{(2)},y_{ip}^{(1)},{{\mu }^{(1)}})}. \end{aligned} \end{aligned}$$
(17)
Finally, the semantic loss function is the sum of the intra-view loss and the inter-view loss, expressed as:$$\begin{aligned} {{L}_{semantic}}={{L}_{within}}+{{L}_{cross}}. \end{aligned}$$
(18)

Hot Topics

Related Articles