Comparison of AI-integrated pathways with human-AI interaction in population mammographic screening for breast cancer

Inclusion and ethicsThe conceptualisation, design and implementation of this study was conducted with close collaboration between clinical staff working in the organised population breast screening service in Victoria, Australia and local academic researchers. This study was conducted using the ADMANI datasets created from the state screening programme, BreastScreen Victoria’s retrospective image and non-image database which was accessed and governed under the executed license agreement with BreastScreen Victoria and the BRAIx Project Multi-Institutional Agreement. The study’s conduct was approved by the St Vincent’s Hospital, Melbourne Human Research Ethics Committee (SVHM HREC) approval numbers; LNR/18/SVHM/162 and LNR/19/SVHM/123. All BreastScreen participants sign a consent form at screening registration that provides for the use of the de-identified data for research purposes. A unique identifier is used for the purposes of the ADMANI datasets, with all image and non-image data de-identified.Screening programmeThe BreastScreen Victoria screening programme is a population screening programme targeted at women aged 40+ with those between the ages 50–74 actively recruited. A typical BreastScreen Victoria client has a mammogram taken with a minimum of four standard mammographic views (left and right mediolateral oblique, MLO, and craniocaudal, CC) every 2 years. Annual screening is offered to a small proportion of high-risk clients (< 2%).Every client undergoing screening through BreastScreen Victoria experiences a standardised screening pathway and data generation process (Supplementary Fig. 11). Each mammogram is read independently by two breast imaging radiologists who indicate suspicion of cancer, all-clear, or technical rescreen. If there is disagreement, a third reader, with visibility of the original two readers’ decisions, determines the final reading outcome. Clients with a suspicion of cancer are recalled for assessment. At assessment, further clinical workup and imaging is performed. Any client who has a biopsy-confirmed cancer at assessment (within six months of screening) is classified as a screen-detected cancer (true positive). Any clients who are recalled but confirmed with no cancer after follow-up assessments are classified as either benign or no significant abnormality (false positive). Clients who were not recalled at reading and do not develop breast cancer within the next screening interval are classified as normal (true negative). Clients who develop breast cancer between 6 months after a screen and the date of their next screen (12 or 24 months) are classified as interval cancers (false negative). The datasets we use are structured around individual screening episodes of clients attending BreastScreen Victoria. A screening episode is defined as a single screening round that includes mammography, reading, assessment, and the subsequent screening interval.Study datasetsThe datasets used in this study were derived from the ADMANI datasets28. The ADMANI datasets comprise 2D screening mammograms with associated clinical data collected from 46 permanent screening clinics and two mobile services across the state of Victoria, Australia. The entire datasets span 2013–2019, 2013–2015 were cancer enriched samples and not used for testing, 2016–2019 were complete screening years containing all episodes. Screening episodes that were missing any of the standard mammographic views (left and right MLO and CC), had incomplete image or clinical data, were were excluded (Fig. 4). If a screening episode had multiple screening attempts only the final attempt was used. Clients with breast implants or other medical devices were included. After exclusions, a random number generator was used to allocate 20% of all screening clients randomly to the study cohort, only the complete screening years (2016–2019) were included to ensure a representative sample. All screening episodes associated with clients were then included in the study dataset. The remaining 80% of clients and associated screening episodes were used in model training and development. The study dataset was further split into testing (75% of clients in the study) and a development dataset (25% of clients in the study) on which operating points were set. The testing dataset comprised 149,105 screening episodes from 92,839 clients, and the development dataset 49,796 screening episodes from 30,946 clients (Table 2). The mammograms were processed using the Python programming language version 3.6 using packages gdcm version 2.8.9 and pydicom 2.1.2. The non-image datasets were processed using Python programming language version 3.11 using packages numpy version 1.25.1 and pandas version 2.0.3.Fig. 4: Screening episode exclusion criteria.Flow diagram of study exclusion criteria for screening episodes from the standardised screening pathway at BreastScreen Victoria. Missing data could be clinical data without mammograms or mammograms without clinical data, clinical data could also be incomplete missing assessment, reader or screening records. Earlier screening attempt refers to a client returning for imaging as part of the same screening round, only the last attempt was used. Failed outcome determination and failed outcome reduction refer to being unable to confirm the final screening outcome for the episode. Missing reader records refer to missing reader data. Inconsistent recall status refers to conflicting data sources on whether episodes was recalled. Incomplete screening years refers to years in which we did not have the full year of data to sample from (2013–2015), these years were excluded from testing and development datasets as they are not representative.Table 2 Summary and characteristics of data used in the studyThe study dataset has strong ground truths for all cancers (screen-detected and interval) and non-cancer (normal, benign, or no significant abnormality (NSA) with no interval cancer). Cancer was confirmed by histopathology for screen-detected cancers or obtained from cancer registries for interval cancers. The histopathological proof was predominantly from an assessment biopsy confirmed with subsequent surgery. The ground truth for clients without cancer was a non-cancer outcome after reading and no interval cancer (normal) or non-cancer outcome after assessment and no interval cancer (benign or NSA). Information on country of birth, whether or not the client identifies as Aboriginal and/or Torres Strait Islander, and age was collected at the time of screening. Responses for country of birth and Aboriginal and/or Torres Strait Islander identification were aggregated into categories of First Nations Australians and regions. No analysis on sex or gender was performed as it was not available in the dataset.Separately to the retrospective analysis a prospective dataset was collected from December 2021 to May 2022. Data were collected in real-time (daily) from a single reading and assessment unit (St Vincent’s Breastscreen, Melbourne, Australia) using two mammography machine manufacturers from December 2021 to May 2022. The prospective dataset contains the same ground truth and demographic information with the exception of interval cancer data as it was not yet available at the time of publication. The prospective dataset consisted of a total of 25,848 episodes and 108,654 images from 25,848 clients with a total of 195 screen-detected cancers (Supplementary Table 5).AI reader systemFor this study, we used the BRAIx AI Reader (v3.0.7), a mammography classification model developed by the BRAIx research programme. The model is based on an ensemble of modern deep-learning neural networks and trained on millions of screening mammograms. We studied and created an ensemble from ResNet30, DenseNet31, ECA-Net32, EfficientNet33, Inception34, Xception35, ConvNext36 and four model architectures developed specifically for our problem, including two multi-view models that use two mammographic views of the same breast concurrently37, and two single-image interpretable models that provide improved prediction localisation and interpretability38. Each model from the ensemble was implemented in PyTorch39 and trained on data splits from the training set. The models were trained for 10–20 epochs using the Adam optimiser40 with an initial learning rate of 10−5, with weight decay of 10−6 and with the AMSGrad variant enabled41. The training set was selected to have about a 10:1 ratio for non-cancers (benign, no significant abnormality and normal) and screen-detected cancers, respectively. To enforce a specific ratio, not necessarily all the available non-cancer images in the dataset are used during the training of the models. Images were pre-processed to remove text and background annotations outside the breast region, then cropped and padded to keep the same height-to-width ratio of 2:1. Data augmentation consisted of random affine transformations42.The AI reader is image-based and produces a score associated with the probability of malignancy for each image. Image scores are combined to produce a score for each breast, and the maximum breast score is the episode score. Decision thresholds convert each episode (or breast) score to a recall or no-recall decision. There are no minimum number of images required. Elliptical region-of-interest annotations are produced from the pixels that contribute most to the classification score, and multiple regions are ranked by importance (Supplementary Fig. 12). The reader has been evaluated on publicly available international datasets and achieved state-of-the-art performance (Supplementary Table 4). The distribution of episode scores from the study dataset, useful for inter-study comparisons, are also available (Supplementary Table 3).Simulation design, operating points and evaluation metricsTo provide insights into the AI reader and its potential in clinical application, we performed retrospective simulation studies, where we evaluated the AI reader performance as a standalone reader and in three AI-integrated screening scenarios. The simulation studies are conducted using various packages in the R programming language version 4.2.2, including dplyr version 1.1.0, tidyr version 1.3.0 and renv version 0.15.543.AI-integrated screening scenariosFive scenarios were considered to evaluate the AI reader integrated in the screening pathway, AI standalone, AI single-reader, AI reader-replacement, AI band-pass, and AI triage. In the one-reader pathway, the reader makes a decision on all episodes, and the decision is final. This includes the AI standalone and the AI single-reader scenarios. In the two-reader-and-consensus pathway, the first two readers individually make a decision on whether or not to recall the client for further assessment. If the two readers agree, that is the final reading outcome. If they disagree, a third reader, who has access to the first two readers’ decisions and image annotations, arbitrates the decision. This pathway includes AI reader-replacement, AI band-pass and AI triage scenarios.In the AI standalone scenario, the AI reader replaces the (only) human reader in the one-reader pathway and provides the same binary recall or no-recall outcome as the human readers on all episodes. In the AI single-reader scenario, the AI reader acts as an assistive tool to human readers. It provides the binary recall or no-recall outcome to the human reader, but it does not make any decision on its own. The human reader with access to the AI output would first make a decision and then consider whether to revise its decision should there be disagreement with the AI output.In the AI reader-replacement scenario, the AI reader replaces one of the first two readers in the screening pathway and provides the same binary recall or no-recall outcome as the human readers. The first and second readers are replaced at random (with equal probability) for each episode. As the AI reader could trigger a third read that did not exist in the original dataset, the third reader was simulated for all episodes even if an original third read was present. This approach is to prevent unduly tying the result to the dataset and to obtain better variability estimates. Sensitivity analysis, where the third reader uses the real data when possible and where the replaced second reader is used as the third reader, was also performed (Supplementary Table 6). The third reader in our retrospective cohort operated with a sensitivity of 97.5% and a specificity of 56.3%, and we simulated the third reader with respect to this performance. Concretely, whenever an episode reaches the simulated third reader, the reader will make a recall decision by first inspecting the actual episode outcome and then using it as a prediction 97.5% of the time if the outcome is cancer (and 2.5% of the time using the opposite case as prediction), or 56.3% if the outcome is normal. This is achieved by sampling from the uniform(0,1) distribution with the corresponding probability, and it ensures that the simulated performance matches the real-world performance. Confidence intervals were generated through 1000 repetitions of each simulation.As a remark, we emphasise that the simulation of the third reader should be performed with reference to the real-world Reader 3, rather than by reusing data of the earlier replaced readers (e.g. Reader 1 or Reader 2)20,22. Reusing data is convenient, as the replaced readers have seen all the episodes, and it avoids the simulation of the third reader. However, this overlooks the fact that while Reader 1 and 2 make independent judgements, they are conditionally dependent. In simple terms, a difficult cancer case is difficult for any reader. In such cases, the two readers would frequently miss together even when they make independent judgement, and the overall sensitivity would drop if either of them is used as an arbiter. In general, Reader 3 (the arbiter) makes decisions differently than Readers 1 and 2 because Reader 3 has access to their decisions and analyses. If Readers 1 and 2 were used in place of Reader 3 for simulation, then the result would be distorted, as we see in Supplementary Table 6.In the AI band-pass scenario, the AI reader was used analogously to a band-pass filter. The AI reader provided one of three outcomes: recall, pass, and no-recall. All episodes with recall outcomes were automatically recalled. All episodes with the no-recall outcome were not recalled. All episodes with the pass outcome were sent to the usual human screening pathway. The AI reader made the final decision on the recall and no-recall episodes with no human reader involvement, and for all episodes that passed the human screening pathway, the original reader decisions were used.In the AI triage scenario, the AI reader triages the episodes before the human readers. Episodes with high scores continue to the standard pathway, and episodes with low scores go through the pathway with only 1 reader. For episodes sent to the standard pathway, the original reader decisions were used, and for episodes sent to the single-reader path, the reader decision is sampled randomly (with equal probability) from the first and second readers. The AI reader made no final decision on any of the episodes.AI operating pointsThree sets of operating points were used as part of the study: the AI reader-replacement reader, the AI band-pass reader and the AI triage reader. There are three sets for five scenarios because the AI standalone reader and the AI single reader uses the same operating point as the AI reader-replacement reader. All operating points were set on the development set (Supplementary Fig. 13).The AI reader-replacement operating point used a set of manufacturer-specific thresholds to convert the prediction scores into a binary outcome: recall or no-recall. The operating point was chosen to improve on the weighted mean individual reader’s sensitivity and specificity. The weighted mean individual reader was the weighted (by number of reads) mean of the sensitivity and specificity of the individual (first and second) radiologists when they were operating as a first or second reader. For all operating points that improved on the weighted mean individual readers sensitivity and specificity, the point with the maximum Youden’s index44 was chosen.The AI band-pass reader used two sets of manufacturer-specific thresholds to convert the prediction scores into three outcomes: recall, pass and no-recall. The AI band-pass simulation was evaluated at different AI reader thresholds via a grid search. At each evaluation, two thresholds for each manufacturer were set to a target high specificity and high sensitivity point. All episodes with a score above the high sensitivity point were given the recall outcome, all episodes below the high specificity point the no-recall outcome, and all episodes in between points were given the pass outcome. The final AI band-pass reader thresholds were chosen from the simulation result with the maximum Youden’s index of the points with non-inferior sensitivity and specificity than that of the two readers with an arbitration system.The AI triage reader used the 90% quantile of the prediction scores as the threshold to convert the prediction scores into the triage outcome: the standard pathway or the one-reader pathway. Episodes with prediction scores less than the threshold are assigned to the one-reader pathway; otherwise, they are assigned to the standard pathway.When there is interaction between the AI reader and the human reader, i.e. the human may revise their decision based on the AI output, the AI reader, in all cases, uses the reader-replacement operating point (which is also the standalone operating point). To clarify, taking the AI triage scenario as an example, the AI triage reader uses the triage operating point to decide whether an episode should go to the standard pathway or the one-reader pathway. Once that is decided and the episode reaches any reader, the reader would have access to an AI-assist reading tool that operates at the reader-replacement operating point. So overall, there would be two operating points functioning at the same time.Human–AI interactionWe simulate three interaction effects, the positive, the neutral and the negative effect. All three interactions involve an AI reader and a human reader. The AI reader first makes a decision about recall (using the assistive operating point), and then the human reader makes their decision with access to the AI output. The human may adjust their decisions if they differ from the AI’s, and this happens (100 × p)% of the time, where p is a parameter that varies between 0 and 1 across multiple simulations. For example, when p = 0.1, human readers will adjust the decision 10% of the time when their decisions differ from the AI, and when p = 1, human readers will change all their decisions to align with the AI. This models the automation effect, which we refer to as the neutral interaction.For positive interactions, the human readers would only adjust the decision if the AI is correct. This models the situation where AI enhances human readings by reducing occasional misses and assisting in complex cases. And for negative interactions, the human readers would change the decision only if the AI is incorrect. This models the situation where human is confused by the AI reader’s output and mistakenly changes their correct decision into an incorrect one.Evaluation metricsThe AUC, based on the receiver operating characteristic or ROC curve, is used to summarise the AI reader’s standalone performance.Sensitivity and specificity are used to compare the AI reader with the radiologists and the AI-integrated screening scenarios with the current screening pathway. Sensitivity, or the true positive rate (TPR), is computed by dividing the number of correctly identified cancers (the true positives) by the total number of observed cancers (all positives, i.e. including both screen-detected cancers and interval cancers). It measures the success rate of the classifier in detecting the cancer. This is a key performance metric because early detection of cancer leads to more effective treatment45, due to a timely intervention, requirement for a less aggressive treatment, and improved survival rates. Specificity, or the true negative rate (TNR), is computed by dividing the number of correctly identified non-cancer episodes (true negatives) by the total number of observed non-cancer episodes (all negatives). It measures the success rate of the classifier in correctly not recalling a client when cancer is absent in the client. This is a key performance metric because unnecessarily recalling clients to the assessment centre is costly and induces significant stress on clients and their families46.To compare the sensitivity and specificity of the AI-integrated screening scenarios with the current screening pathway, McNemar’s test (with continuity correction to improve the approximation of binomial distribution by the chi-squared distribution) is used to test for differences47, and the binomial exact test (one-sided) is used to test for superiority. Both tests adhere to the correct design for McNemar’s test, in which a 2-by-2 contingency table is constructed based on the paired samples from the two comparison scenarios. The samples are paired by the episode ID, and the tests are conducted once for each of the sensitivity and specificity48,49 at a significance level of 5%. For the effect size calculation, Cramér’s V (also known as Φ) is used for McNemar’s test. It is calculated as \(\sqrt{\frac{{\chi }^{2}}{N}}\), where χ2 is the test statistics and N is the number of samples. For the binomial test, Cohen’s h is used, calculated as \(\left\vert 2\,\,{\mbox{arcsin}}\,\sqrt{{p}_{1}}-2\,\,{\mbox{arcsin}}\,\sqrt{{p}_{2}}\right\vert \), where p1 is the test statistics, and p2 is the expected proportion under the null hypothesis.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles