Learning co-plane attention across MRI sequences for diagnosing twelve types of knee abnormalities

This study received ethical approval from the Institutional Review Board of The Third Affiliated Hospital of Southern Medical University (No. 201501003) and was also approved by centers B, C, D, and E. Informed consent was waived due to the retrospective nature of the study and anonymity of the analyzed data. This study followed the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guidelines33.Previous works on automatic knee abnormality classificationMRI sequences are three-dimensional volumes that consist of a series of single-channel slices. Traditional image analysis methods that rely on hand-crafted features are less effective in handling complex 3D images. Deep learning can automatically model the intricate relations between high-dimensional data and labels, making it feasible for many challenging tasks in medical image analysis34,35. It has sparked research interest in knee abnormality classification20,21,22. For example, Bien et al.24 proposed MRNet, which separately predicts abnormalities from three MRI planes, and combines their results by max pooling. This kind of method that assigns each abnormality to one specific network requires large computational resources under a multi-abnormality setting. Similarly, ensemble models, such as in Ref. 36, are usually criticized for being redundant and lacking immediate explanation.Some other methods are specifically designed for specific abnormality types (e.g., cruciate ligament tears)37,38. The specific tailoring of these single-task models presents challenges for their transferability to other tasks, thereby limiting their applicability in clinical practice. Therefore, utilizing multi-task learning is more realistic and can benefit from the shared features39. However, a common phenomenon in multi-task learning is that the model may be misled when trying to optimize for all tasks at once due to the task heterogeneity and experiences negative transfer of the performance40.In previous works, the number of MRI sequences is usually less than three, which means they rarely need to pay attention to the complex integration strategy of the inputs. Belton et al.26 explored different strategies for fusing multi-planar MRI. The paper uses simple concatenation for late fusion. This equal-contribution structure ignores the implicit correlations in low-level features and thus makes it vulnerable to noise when the number of sequences increases. Characterizing the inter-sequence representations with attention learning, as in our work, will help in the selection of the most representative features and achieve better classification results.Model overviewAn overview of our model is illustrated in Fig. 5. The original data comprises PDW MR scans in different planes (sagittal, coronal, and axial) and two sequences with different contrasts (T1W in the coronal plane and T2W in the sagittal plane). All images are pre-processed into cubic volumes to serve as the model input, resulting in five original volumes (on the left) and six synthetic volumes obtained by rotating PDW volumes (on the right). Notice that this set of images is a specific case to demonstrate and verify the capacity of our model on the internal dataset. In other words, we are proposing a general solution for knee abnormality diagnosis using a set of multi-planar multi-contrast MRIs.Fig. 5: The pipeline of our approach.We perform rotation on the PDW volumes to other planes. The volumes will be fed into three network branches (as shown on the bottom left) representing three planes, and integrated by co-plane attention learning (details are shown in Fig. 6). The predictions from the three branches will be further integrated using a probability matrix, which can be analyzed to explore the correlation between the planes and abnormalities. This integration process will ultimately yield the diagnostic prediction for the patient. This figure was created with BioRender.com released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license (https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en).We draw inspiration from clinical knowledge that different MRI planes account differently for diagnosis, we design three branches to extract cross-plane information from three MRI planes, denoted by blue, green, and orange. Detailed network structures are illustrated in Fig. 6. The volumes from different planes are fed into their respective branches, allowing for the integration of information by cross-plane and cross-sequence attention. The predictions from each plane will then be integrated by a probability matrix that characterizes the joint distribution of the classified abnormalities. Finally, a correlation mining module is employed to obtain the final classification result for the 12 abnormalities.Fig. 6: Detail of the three-branch network structure.The coronal, sagittal, and axial branches (represented by orange, blue, and green blocks) aim at different planes. For all branches, the cross-plane volume will be encoded by a weight-sharing encoder (ResNet3D), and cross-plane spatial information will be decoupled from each volume and integrated using an attention mechanism. Notably, for sagittal and coronal branches, the PDW features will join the T2W and T1W features for co-plane cross-sequence integration. In this way, the three branches will yield individual prediction results.Pre-processing with image cropping and volume rotationFirstly, to establish a reference point for the knee, we segment the meniscus by U-Net41 as an upstream task. Note that a coarse segmentation is adequate because we only need a rough region of interest (ROI) to identify the center of the meniscus. We crop the original images according to the segmentation mask to remove irrelevant areas and reduce the overall size of the image.In a clinical setting, the utilization of multi-planar MRI offers a comprehensive perspective for diagnosing abnormalities. It is important to recognize that each MRI plane contains specific information that is crucial for decision-making purposes. Therefore, we tailor three individual branches in our model to specialize and extract features from different planes. Given that the slice thickness (i.e., the distance between two scanned slices) is relatively larger, resulting in anisotropic spatial resolution across different axes. Conventionally, using interpolation will reduce this imbalance, but the interpolation with one image itself does not provide any information gain and may introduce additional errors.To solve this problem, we incorporate the information from other sequences by leveraging cross-attention learning. Specifically, we utilize the high-resolution information from other MRI volumes obtained in orthogonal planes. As shown in Fig. 5, the sagittal volume (224 × 224 × 24) has a high resolution in the sagittal plane but only has 24 slices. When the left side of this volume is rotated to the front, it can be regarded as a coronal-plane image of 24 × 224 × 224 (note that this is still the same volume). We believe that this synthetic coronal plane volume (denoted as ‘Sag → Cor’ in Fig. 5) from the sagittal plane contains complement information in the coronal plane. Therefore, we generate eleven volumes (nine PDW, one T1W, and one T2W) in total from five images. Simply, we name the synthetic volume ‘cross-plane volumes’ as they are converted from other MRI planes. Recall also that, essentially, there are only three different PDW volumes.Cross-plane attention enhancing spatial informationThe success of the deep learning model is attributable to its capability of automatically extracting high-dimensional features with rich semantic information. Attention mechanisms42, inspired by human vision, have shown their power in eliminating the interference of irrelevant features. Similarly, we can adopt cross-attention to enhance the spatial information from cross-plane volumes. Figure 6 demonstrates the detailed structures of the three branches of our model.We will take the coronal branch as an example. As shown in Fig. 6, the coronal image is regarded as the main image, while the volume converted from the sagittal and axial images serves to provide additional information. Our model utilizes ResNet3D with 18 layers43 as the basic encoder for the network. The 3-dimensional convolution in ResNet3D enables the learning of patterns from the third dimension44, which, in our case, corresponds to the spatial information along slices. To prioritize the main volume for plane-specific feature extraction, we adopt weight sharing in the encoder across all volumes. This means that the weights are initially updated based on the main volume, allowing the network to locate lesions guided by high-resolution slices.Moreover, the final pooling layer operates on the last two dimensions to preserve depth-wise information, which is then transposed. That is, the feature map will be pooled and reshaped from RC×D×H×W to RD×C, where C, D, H, and W represent the channel, depth, height, and width of the feature. This will transform the volumes into embeddings while keeping the order of slices, i.e., generate the representations of each slice. In this way, the model can learn to decouple three plane-specific features that are dominated by the three MRI planes.To alleviate the information loss from the large slice thickness, we further integrate the cross-plane information by characterizing the spatial correspondence with attention. As illustrated in Fig. 6, a linear transformation with weight WQ is applied to the main feature to obtain the query Q. Then we use separate transformations with weight \({{{{\bf{W}}}}}_{{{{{\bf{K}}}}}_{i}}\) and \({{{{\bf{W}}}}}_{{{{{\bf{V}}}}}_{i}}\) to obtain Ki and Vi which represent keys and values. The attention output F is then calculated as follows:$${{{\bf{F}}}}={\sum}_{n=1}^{2}\left({{{\rm{Softmax}}}}\left(\frac{{{{\bf{Q}}}}{{{{\bf{K}}}}}_{n}^{T}}{\sqrt{{d}_{{{{{\bf{K}}}}}_{n}}}}\right){{{{\bf{V}}}}}_{n}\right),$$
(1)
where \({d}_{{{{{\bf{K}}}}}_{n}}=C\) represents the dimension of the keys. The features will pass through normalization and linear layers. We then add the original main feature for the model to learn residues between the original and attention features. In this way, the model will identify and absorb similar features from the cross-plane volumes, thereby enhancing the features with rich spatial information that captures more fine-grained intensity changes along the slices.Learning anatomical information with co-plane cross-sequence attentionConsidering the PDW images carry more informative details about abnormalities during the diagnostic process. On the other hand, T1W and T2W images provide a clear view of anatomical structures. To leverage the advantages of each image type, we introduce T1W and T2W as the contrast reference to PDW to demonstrate the basis of multi-sequence attention learning in our model. In conventional feature fusion strategies, the auxiliary features are usually added to the main features using methods such as summation or concatenation. However, given that sequences of different contrasts contain both sequence-specific and shared information, we propose to leverage their feature attention to filter out irrelevant features and, therefore, improve the accuracy of the model.As illustrated in Fig. 6, we utilize the contrast between two sequences to calculate the corresponding channel-wise attention. Unlike the cross-plane attention, which can be regarded as an enhancement of the main image to provide details in the depth dimension, the T1W and T2W sequence acts more like a filter to re-weight different features. Therefore, considering the feature extractors should have individual weights other than those we used in the PDW sequences, the encoder in this module is trained separately to reduce computational complexity.To be specific, a linear layer will learn the sequence-specific information with the guidance of the co-plane information from upstream PDW volumes and project it as attention factors to the mixture of two features. Here, we employ the sum operation to combine the two features, as we believe that the model can learn by adding the corresponding features and guiding the integration process.Plane-aware feature integration with abnormality probability matrixConsidering the attention mechanism will naturally enhance the similar (closer) features, the extracted features are still predominantly influenced by the main volume. Therefore, it is undeniable that each branch in our model retains plane-specific features. As mentioned earlier, the radiological diagnosis of abnormalities is closely linked to the MRI planes. In other words, certain abnormalities exhibit distinct characteristics in specific MRI planes. In our model, the prediction from each branch does not directly correspond to the final outcome. This situation is akin to multi-instance learning, where only the patient-level label is available for training. Since we want to classify multiple abnormalities at once, it is challenging for the model to learn the relations with only the concatenation of features that lack task-related guidance. Wang et al.45 proposed a method of fusing multi-view information by discovering correlations within class labels. Similarly, in terms of the MRI, we propose a fusion strategy for the three branches in our model to excavate the relations between MRI planes and abnormalities.Figure 5 shows the predictions from the different branches representing multiple planes marked by orange, blue, and green. We form a plane-aware matrix by producing the dot product of each element from the predictions. The element xi,j,k is calculated by multiplying the ith, jth, and kth elements in sagittal, coronal, and axial predictions, which represent the possibility of a corresponding abnormality. Thus, the matrix contains both inter-class correlation, which is learned from multi-task learning, and abnormality-plane correlation information. We then use the correlation mining module to excavate the hidden pattern, i.e., the distribution of this probability. The diagonal of the plane-aware matrix represents the joint probability of each class from three planes, which is treated as the base prediction in this module. Meanwhile, a 12-channel convolution is used to learn position-related information in the matrix. Similar to SENet46, the matrix is squeezed into a vector, and then linear transformation is applied to generate the weight of basic predictions. Then the final prediction is yielded by scaling the diagonal vector with the weight. In the discussion section, we analyze the model predictions with CAM by backwarding from the final prediction.As the outputs of the three branches are expected to be the probability of abnormalities, we apply supervision on both the final result and the branch outputs. Specifically, we use Focal Loss47 to measure the distance between the final result y and label \(\hat{{{{\bf{y}}}}}\), which can help the model to focus on the difficult samples. For the prediction of three branches ybranch, the binary cross entropy (BCE) loss is applied. We use hyper-parameters α to balance the losses, and the total loss can be formulated as follows:$${L}_{total}=\alpha {\sum}_{branch}\left({\sum}_{i=1}^{12}{{{\rm{BCE}}}}({{{{\bf{y}}}}}_{branch}^{i},{\hat{{{{\bf{y}}}}}}^{i})\right)+{\sum}_{i=1}^{12}{{{\rm{FL}}}}({{\bf{y}}}^{i},{\hat{{\bf{y}}}}^{i}).$$
(2)
Observer studyThe clinical applicability of the model was evaluated in an observer study with six radiologists on the external test set II. Four junior musculoskeletal radiologists (C.X.Q., S.Y.Y., Z.Y.X., and Y.W.D., with 2, 2, 5, and 7 years of experience, respectively) and two senior musculoskeletal radiologists (C.L.L. and Y.H.Z., with 10 and 30 years of experience, respectively) who were blind to the reference standard independently performed MRI interpretations. They viewed MR images using RadiAnt DICOM Viewer software (Version 2020.1, Medixant, Poland) and performed binary classifications for each type of knee abnormality. After a washout period of 30 days, they performed repeated interpretations with model assistance. The comparison was performed between radiologists with and without model assistance.Statistical analysisIn this paper, the Kruskal-Wallis test was used to compare the ages of patients in different centers, and the chi-square test was used to compare sex. Patient demographics were analyzed using SPSS software (version 23.0, IBM, Armonk, NY). The experiments are conducted with the evaluation metrics of AUC-ROC and ACC. The threshold value for deciding whether a patient has an abnormality is determined by maximizing the F1 score. If not otherwise stated, the significance of the difference in AUC-ROC and ACC is validated by Delong-test48 and T-test, respectively, with P ≤ 0.05 as significant. In experiments that involve radiologists, the 95% confidence intervals were extracted using bootstrapping with 1000 redraws. Fleiss’ kappa score49 is calculated to ensure the inter-reader agreement in dataset labeling.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles