Effective descriptor extraction strategies for correspondence matching in coronary angiography images

Figure 3The points marked in both images correspond to the same anatomical landmark. However, they have significantly different local features captured in each images.This section outlines the primary methods and key findings of our study. We first provide an overview of the main contributions and activities, followed by detailed descriptions of data preparation and network structure. Figure 4 illustrates the architecture of our proposed model.Figure 4Illustrated here is the architecture of our proposed model. Two angiograms are processed through a multi-head descriptor feature extractor, generating features for each image. These features are then fed into an Attentional Graph Neural Network (AGNN), resulting in a score matrix that enables correspondence prediction.The key findings and activities in this study are summarized as follows:

We annotated a set of CAG images for training point detectors and descriptor extractors using supervised learning. This approach was necessary because the self-supervised learning method used in SuperPoint is ineffective for detecting candidate points in CAG images.

We employed the HigherHRNet architecture to train our point detector and integrated a Descriptor Head with a structure similar to SuperPoint for descriptor extraction. HigherHRNet, known for its high performance in keypoint detection tasks like pose estimation, was particularly suitable for detecting sparse and indistinct local features in CAG images.

We redesigned the loss function originally used in SuperPoint to better align with the characteristics of CAG images. The model was trained to bring descriptors closer for the same point pairs and push them further apart for different pairs, categorizing pairs into positive, sub-positive, negative, and hard-negative categories.

We implemented a multi-head architecture for the descriptor head to better capture the unique characteristics of CAG images. Descriptors from smaller patches contained more detailed information, while those from larger patches provided broader context. The multi-head architecture utilized these varying scales effectively.

Data preparationFeature points in CAG images are typically located at the center of blood vessels. However, these points often display non-distinguishing local features, This complicates the process of generating pseudo feature points with a feature-based detector, a commonly used in general domains other than CAG images. Additionally, pseudo pair images generated through homographic transformation cannot realistically represent actual pair images. Furthermore, feature points often exhibit highly different local features in the CAG images. For these reasons, the self-supervised learning approach employed by the conventional SuperPoint does not work as intended in the CAG images. Consequently, in this study, we conducted research by creating a supervised learning dataset to effectively apply deep learning-based feature matching methods while considering the characteristics of CAG images. We annotated all images with the names and location information of anatomical landmarks. Examples of our annotation can be seen in Fig. 5. Moreover, we grouped these images by patient and target vessel type. Specifically, we categorized LAD (Left Anterior Descending artery), LCx (Left Circumflex artery), LM (Left Main artery) as the LCA (Left Coronary Artery) group and the remainng vessels as the RCA(Right Coronary Artery) group to create pairs for 3D reconstruction.Figure 5An example showing landmark annotations in a CAG image.The network structureDistinct from SuperPoint, our approach involved annotationg a dataset specifically designed for the supervised learning of extremely sparse anatomical landmarks. For the detectection of these anatomical landmark points, we employed the HigherHRNet, well-known for its high performance in keypoint detection such as pose estimation30. The HigherHRNet is an enhanced model based on HRNet that previously demonstrated high performance in human pose estimation tasks by applying gaussian map to points to generate heatmaps31. It reportedly maintain the performance of large-scale features while incorporating smaller-scale features, leading to significant performance improvements.In this study, when training our keypoint detector with multi-head descriptor(Fig. 4, 6) for anatomical landmarks, a single CAG image was utilized as the input to the model. After sufficient training, it can be expected that the feature map fed into the point detecting head will effectively represent the image. Therefore, for efficient descriptor extractor training, all parameters from the trained keypoint detector were frozen and descriptor extractor head was added to this network.
Figure 6Visual examples of three patch sizes. The blue patch is less useful for training due to vague features. The orange patch shows detail but may miss vascular shape. The red patch captures vascular shape but could include interfering info.Redesigned loss functionIn the original method to train descriptor detector, the input image was segmented into patches of \(8\times 8\) pixels from which a single descriptor vector was extracted per patch. After obtaining descriptor vectors from all patches, interpolation was performed to ensure every pixel whithin these patches possessed an individual descriptor vector. Consider D and \(D’\) be sets of descriptor vectors (\(d, d’\)) extracted from all patches across each image, with S defined as a function returning 1 for positive pairs and 0 for negative pairs of these descriptor vectors. The original loss function utilized to train the Descriptor extractor is as follows28:$$\begin{aligned} Loss(D, D’, S) = \sum _{d \in D} \sum _{d’ \in D’} \{ \lambda \cdot S(d, d’) \cdot (- d^T d) + (1 – S(d, d’)) \cdot (d^T d) \} \end{aligned}$$
(1)
d and \(d’\) are normalized in advance, so \(d^T d’\) represents the cosine similarity between the two vectors. Essentially, (1) is devised to increase the cosine similarity for positive descriptor pairs while decrease it for negative descriptor pairs. However, through our empirical studies, we found that directly applying the original loss function was inefficient for training within CAG images, which is characterized by limited labeling points per images and insufficient number of images. As such, modifications were made to this loss function in order to better leverage constructed data as well as adapt more effectively to training process for CAG images. each alteration is explained in detail below.Exponential function and constant termsThe conventional loss function uses cosine similarity represented as \(d^T d’\). However, through our experimental trials applying this function directly onto CAG images led us to observe that the model consistently converged towards a dysfunctional state. In such conditions it generated descriptor vectors only for two particular instances: when an candidate point existed and when there were none present. To tackle this problem, we incorporated the convex exponential function into the cosine similarity. Additionally, we added a constant term to our loss function to ensure that its minimum value would always be zero; with increasing dot product values under theses adjustments we ensured that any resultant losses would converge progressively closer towards zero instead of increasing indefinitely. In other words, instead of \(d^T d’\) in the conventional SuperPoint loss function, we proposed an improved version denoted as \(similarity(d^T d’)\) , and the corresponding formulation is provided as follows:$$\begin{aligned} similarity(d^T d’) = exp( – d^T d’ – 1) – exp(-2) \end{aligned}$$
(2)
Focal loss for class imbalanceFor CAG images specifically, extreme class imbalance can significantly undermine training efficiency. To tackle this challenge, we have incorporated the concept of FocalLoss into our methodology32. FocalLoss is well-known for its effectiveness in mitigating performance degradation typically associated with training on imbalanced data sets. It achieves this by assigning less weight to easy samples, i.e., samples with smaller losses during training, to focus more on samples that do not achieve a sufficiently low loss. Essentially, FocalLoss is a modification of the cross-entropy loss, where the weight is scaled by an exponential term \((1-p)^{\gamma }\). The specific formulation for this adjustment as follows:$$\begin{aligned} FocalLoss = – \lambda (1-p)^{\gamma } ln(p) \end{aligned}$$
(3)
The objective of FocalLoss is to ensure a sharper decline in loss fuction as the probability value p nears 1. When integrating FocalLoss to (2), we add a weight that encourages a steeper reduction in loss as cosine similarity approaches 1. In this study, instead of the previously proposed term \(similarity(d^T d’)\), we adopt \(FocalSimilarity(d^T d’)\) as follows:$$\begin{aligned} FocalSimilarity(d^T d’) = (1-d^T d’)^{\gamma } similarity(d^T d’) \end{aligned}$$
(4)
Sub-positive pairsFeature points may be located on the boundaries of a patch or at the center of a vessel that is wider than the patch size. In these situations, it can be challenging for descriptors to fully represent the features of the patch. However, in CAG images, where feature point labels are extremely sparse, efficient learning from these instances is essential. Despite this need, conventional SuperPoint loss, which designates pairs of patches containing the same points in two images as positive pairs to encourage their descriptors to become closer during training, finds it difficult to effectively learn from these scenarios. This is primarily due to the fact that conventional SuperPoint loss was originally designed for general domains in mind, and thus, it struggles to effectively address these unique challenges that are inherent in CAG images To address these issues, we propose the concept of sub-positive pairs, assuming that neighboring patches may share some characteristics. These sub-positive pairs are incorporated into the our loss function as follows:$$\begin{aligned} loss_{pos} (d, d’, s_{pos} ) & = \lambda _{pos} \cdot s_{pos} \cdot FocalSimilarity(d^T d’) \\ loss_{sub} (d, d’, s_{sub} ) & = \lambda _{sub} \cdot s_{sub} \cdot FocalSimilarity(d^T d’) \end{aligned}$$
(5)
Hard-negative pairsIn conventional SuperPoint loss, all instances that are not positive pairs are deemed as negative pairs, and the descriptors are trained to diverge from each other. However, even among non-positive pairs, there is a difference in importance between patches where detected feature points and those without. If the point detector operates properly, descriptors of patches without feature points would less frequently serve as input for the feature matcher. Thus, we introduced hard-negative pairs to assign greater weight to descriptors of patches where feature points are detected in the loss calculation. In contrast to positive pairs where an increase in similarity is sought after, for hard-negative pairs we aim for a decrease in similarity. Hence, we invert the sign of \(d^T d’\) during similarity function computation as follows:$$\begin{aligned} loss_{neg} (d, d’, s_{neg} ) = \lambda _{neg} \cdot s_{neg} \cdot FocalSimilarity(d^T d’) \end{aligned}$$
(6)
$$\begin{aligned} loss_{hard} (d, d’, s_{hard} ) = \lambda _{hard} \cdot s_{hard} \cdot FocalSimilarity(d^T d’) \end{aligned}$$
(7)
CombinedFinally, we trained the descriptor extractor utilizing a modified loss function that incorporates four types of image patches. These four types of image patches incorporated in the loss function are illustrated in Fig. 7. Here, \(S_{pos}\) denotes a function that returns 1 for positive pairs and 0 otherwise. Functions \(S_{sub}\), \(S_{neg}\) and \(S_{hard}\) are defined analogously.$$\begin{aligned} Loss(D, D’, S_{pos}, S_{sub}, S_{neg}, S_{hard}) &= \sum _{d \in D} \sum _{d’ \in D’} \{loss_{pos} (d, d’, S_{pos} (d, d’) ) + loss_{sub} (d, d’, S_{sub} (d, d’) ) \\ & \quad + loss_{neg} (d, d’, S_{neg} (d, d’) ) + loss_{hard} (d, d’, S_{hard} (d, d’) ) \} \end{aligned}$$
(8)
Figure 7Examples illustrating four types of pairs for the proposed loss. When considering the red patch on the left image as the reference, the red patch on the right image becomes a positive pair, the orange patch becomes a sub-positive pair, the blue patch becomes a hard-negative pair, and all unmarked patches become negative pairs.Multi-head descriptor structureThe conventional SuperPoint employs a fixed patch size of \(8\times 8\) for descriptor computation. However, in the CAG image dataset, as illustrated in Fig. 6, the features that an \(8\times 8\) patch can represent are substantially limited. This necessitates adjusting the patch size used by descriptor head for better feature representation. As indicated by the red patch in Fig. 6, when a size of \(32\times 32\) is used, it adequately captures surrounding blood vessels at a point, but it may also include unnecessary parts of the vascular shape or position the patch too close to patches associated with other points. Conversely, as shown by the orange patch in Fig. 6, using a size of \(16\times 16\) enables more detailed feature representation within a single image, but may not completely capture features specific to an individual point. Moreover, determining appropriate patch size can fluctuate depending on factors such as image capture magnification or variations in vascular shape among individual patients. To mitigate this trade-off, we designed our model with a multi-head structure that accommodates both \(8\times 8\) and \(16\times 16\) scales of patches. This structure yields two descriptors for each point as its final output. The similarity is then computed by summing up similarities obtained from each descriptor.Figure 8Figure illustrating cases of correspondence matching failure using a conventional method.Figure 9Examples of inference results from the test set are displayed, where red dotted lines indicate the highest confidence matching point pairs for corresponding anatomical landmarks, while lime-colored dotted lines represent the output of our proposed model. The top 4 figures represent results from the in-house test set, and the bottom 6 figures represent results from the public dataset.

Hot Topics

Related Articles