Validation of neuron activation patterns for artificial intelligence models in oculomics

Experiment overviewDatabaseThe UKBB is an open-access research resource containing health information for over half a million participants from the UK who were initially recruited from 2006 to 2010, with follow-up visits occurring until 2022. Ethics approval was obtained from the Northwest Multi-center Research Ethics Committee. All methods were performed in accordance with the relevant guidelines and regulations as per ethics approval and the material transfer agreement signed between our research group and UKBB upon initial data acquisition. During the initial and subsequent assessments of the non-mydriatic, 45° primary field of view, macula-centered fundus images from both the left and right eyes were captured using the TOPCON 3D OCT 1000 Mk 2.Quality control and data curation175,788 fundus images from 85,707 participants were obtained from the UKBB. As the raw fundus images had black borders, an internally developed computer vision algorithm was used to crop the image. A preexisting AI-based automated image screening algorithm developed by our group was then used to remove poor-quality images35. This AI model was designed to remove images where a significant portion of the retina was missing, or key retinal landmarks were obscured because of poor or uneven illumination, artifacts, or excessive blurring. Samples of good and poor-quality images are shown in Fig. 1.Fig. 1Experiment overview. Flow diagram to illustrate the experimental process.For those participants who had good-quality images, key biometric parameters, such as age, sex, and blood pressure, were retrieved and matched to the fundus image taken during the same UKBB assessment visit. Participants with invalid systolic blood pressure (UKBB field 4080) measurements were removed. After this process, 95,669 good-quality images from 58,606 participants were available for use in this study.These good-quality images were then divided at 70:15:15 into train, validation, and test splits to develop the DL model. Subsequently, 10,000 images (background set) were sampled from the train and validation splits, and 5000 images (analysis set) were sampled from the test split to develop and validate NAPs (Fig. 1). The demographic details of these subsets are shown below (Table 1).Table 1 Table of demographic details.Development of systolic blood pressure prediction modelUsing the train, validation, and test splits, a custom CNN based on the EfficientNetV2S36 backbone was trained to predict SBP from fundus images. SBP was chosen as the training target because it is a proven and consistent oculomics task6,14. As the UKBB has two readings for SBP, the average of the two readings was used as the final training target. Information for model architecture and training configuration can be found in Supplementary Information 1. The mean absolute error (MAE) of the model was 11.65 mmHg. This figure was in line with similar studies conducted on the same dataset (MAE of 11.23 mmHg6).Implementation of neuron activation patternsOur proposed NAP framework has three stages (Fig. 2):

1.

Identification of key stages of CNN architecture that would serve as representative points for monitoring.

2.

Using “similarity averages” to reduce high dimension feature maps into NAPs.

3.

Further simplification of NAPs into “activation pattern scores” to facilitate hypothesis testing.

Fig. 2Illustration for the high-level process used to implement NAPs. The top element depicts the feature maps at different stages of the CNN, starting from thinner stacks of larger feature maps to deeper stacks of smaller feature maps. The middle element illustrates the concept of converting feature maps to simplified representations (similarity averages). The bottom element illustrates the approach in further simplifying similarity averages into activation pattern scores.Selection of key points for monitoringAs modern neural networks have multiple layers, monitoring all possible neuron activations is computationally prohibitive. Like previous studies, we defined key points of interest in the neural network based on architectural landmarks, such as pooling layers or the final outputs of a convolutional block, and only examined the activations at those key points. As EfficientNetV2S36 was inherently designed to be multi-staged, we leveraged this design intent and monitored activations at the end of stages 3, 4, 5, and 6.“Similarity averages” for simplifying feature maps into NAPsThe reduction strategy is illustrated in Fig. 3. For a given feature map, \({M}_{x,l}\in {R}^{C\times N\times M}\) generated from an input image \(x\) at stage \(l\) of a CNN, we first identified the 1% most similar feature maps to \({M}_{x,l}\) from the 10,000 background set. The background set, \({B}_{l}:=\left\{\left\{{M}_{1,l}, {\widehat{y}}_{1}\right\},\left\{{M}_{2,l}, {\widehat{y}}_{2}\right\}\cdots ,\left\{{M}_{b,l}, {\widehat{y}}_{b}\right\}\right\}\), can be considered to be a set of feature maps at a specific stage paired with corresponding predicted SBPs. To facilitate the application of image similarity measures, channel-wise min–max scaling was used to scale the N × M 2D matrix from every feature map channel to be within the range of 0–1.Fig. 3Illustration of the process used to calculate similarity averages. This illustration follows from Fig. 2, with the top element being identical. The bottom left element illustrates the expanded procedures for the “Reduction method” box presented in the middle element of Fig. 2.Image similarity measures, specifically MS-SSIM37, SSIM38, and the Frobenius norm were then applied as appropriate to quantify the degree of similarity between the feature map and feature map entries in the background set, \(\Vert {M}_{x,l},{M}_{b,l}\Vert \). Three different image similarity measures were chosen as different levels of information were captured in the feature maps at the different stages. The feature maps from Stage 3 were larger, more detailed, and more similar in appearance to the input fundus image and thus, to deal with this increased complexity, the more performant MS-SSIM was used to derive similarities. On the other hand, the feature maps from stages 4 and 5 were smaller and less detailed, with activations corresponding to higher-level features. Consequently, the less performant but more computationally efficient SSIM was used to derive similarities. Finally, the smallest feature maps from stage 6 were the least detailed, the simpler Frobenius norm, was deemed appropriate to derive similarities in this layer.From the set of most similar feature maps, \({S}_{x,l}\subset {B}_{l}\), we retrieved the SBP values and calculated a “similarity average”, \({\mu }_{x,l}\) (Eq. 1). This similarity average can be understood to be the simplified representation of a feature map, \({M}_{x,l}\). We repeated this process for the \(l=\left\{stage 3, stage 4, stage 5\right\}\) stages of EfficientNetV2S. The resulting vector of similarity averages, \(\{{\mu }_{x,stage 3},{\mu }_{x,stage 4},{\mu }_{x,stage 5},{\mu }_{x,stage 6}\}\), is the NAP synthesized by our framework.$$\begin{array}{c}{\mu }_{x,l}=\frac{1}{\left|{S}_{x,l}\right|}\sum_{\left\{M,\widehat{y}\right\}\in {S}_{x,l}}\widehat{y}\end{array}$$
(1)
Summarizing NAPs through an “activation pattern score”We then derived an “activation pattern score” (Eq. 2) from the vector of similarity averages to facilitate further statistical analysis. Internal experiments indicated that similarity averages from stage 6 had high levels of convergence with the predicted SBP. In contrast, the similarity averages in stage 3 were constrained in a narrow band centered around the population mean for the measured SBP. As such, we decided to define the activation pattern score based on the mean of similarity averages from stages 4 and 5.$$\begin{array}{c}{A}_{x}=\frac{{\mu }_{x, stage 5}+{\mu }_{x,stage 4}}{2}\end{array}$$
(2)
Statistical validationThe activation pattern score was validated across two outcomes on the 5000-analysis set (Fig. 1). As the analysis set was sampled from the hold-out testing set, this ensured it was independent from the 10,000 background set used to define the NAP.We first investigated the relationship between the activation pattern score and real-world outcomes that are known to be correlated to signs of elevated SBP identified from the fundus (Fig. 4). We chose the pooled cohort equation (PCE) 10-year atherosclerotic cardiovascular disease (ASCVD) risk score39 due to its widespread clinical use and the well-known associations between SBP and increase in ASCVD risk40. To account for the relationship between age, sex and cardiovascular risk, the experiment was performed on age and sex matched groups. As a second outcome (Fig. 5), we then examined whether participants with the same predicted SBP, but different activation pattern scores had differences in biomarkers that are known to be correlated to SBP.Fig. 4Data quality control for outcome 1 analysis. 763 images from participants with invalid cholesterol were removed as they could not be used calculation of PCE scores. 95 images from participants older than 70 were removed due to small sample size. PCE scores were calculated from remaining participants. Sex and age groups were then constructed for hypothesis testing.Fig. 5Data quality control for outcome 1 analysis. 101 images from participants with predicted SBP greater or equal to 160 were removed due to small sample size. The remaining images were divided into predicted SBP bands of 20 mmHg for hypothesis testing.In both experiments, the statistical significance of the outcomes was validated by comparing the first (0–25% of points) and fourth quartile (75–100% of points) groups. For continuous variables, such as biomarkers, the t-test was used to determine the significance of the difference between the two groups. For categorical variables, such as sex, the chi-squared contingency test was used. As the trained AI model only uses a single fundus image as an input, fundus images from different eyes can result in different activation pattern scores. Accordingly, the tests were performed on a per-image rather than per-participant basis, with the biomarkers from the participant being matched to the corresponding fundus images.

Hot Topics

Related Articles