AI-based automation of enrollment criteria and endpoint assessment in clinical trials in liver diseases

ComplianceAI-based computational pathology models and platforms to support model functionality were developed using Good Clinical Practice/Good Clinical Laboratory Practice principles, including controlled process and testing documentation.EthicsThis study was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. Anonymized liver tissue samples and digitized WSIs of H&E- and trichrome-stained liver biopsies were obtained from adult patients with MASH that had participated in any of the following complete randomized controlled trials of MASH therapeutics: NCT03053050 (ref. 15), NCT03053063 (ref. 15), NCT01672866 (ref. 16), NCT01672879 (ref. 17), NCT02466516 (ref. 18), NCT03551522 (ref. 21), NCT00117676 (ref. 19), NCT00116805 (ref. 19), NCT01672853 (ref. 20), NCT02784444 (ref. 24), NCT03449446 (ref. 25). Approval by central institutional review boards was previously described15,16,17,18,19,20,21,24,25. All patients had provided informed consent for future research and tissue histology as previously described15,16,17,18,19,20,21,24,25.Data collectionDatasetsML model development and external, held-out test sets are summarized in Supplementary Table 1. ML models for segmenting and grading/staging MASH histologic features were trained using 8,747 H&E and 7,660 MT WSIs from six completed phase 2b and phase 3 MASH clinical trials, covering a range of drug classes, trial enrollment criteria and patient statuses (screen fail versus enrolled) (Supplementary Table 1)15,16,17,18,19,20,21. Samples were collected and processed according to the protocols of their respective trials and were scanned on Leica Aperio AT2 or Scanscope V1 scanners at either ×20 or ×40 magnification. H&E and MT liver biopsy WSIs from primary sclerosing cholangitis and chronic hepatitis B infection were also included in model training. The latter dataset enabled the models to learn to distinguish between histologic features that may visually appear to be similar but are not as frequently present in MASH (for example, interface hepatitis)42 in addition to enabling coverage of a wider range of disease severity than is typically enrolled in MASH clinical trials.Model performance repeatability assessments and accuracy verification were conducted in an external, held-out validation dataset (analytic performance test set) comprising WSIs of baseline and end-of-treatment (EOT) biopsies from a completed phase 2b MASH clinical trial (Supplementary Table 1)24,25. The clinical trial methodology and results have been described previously24. Digitized WSIs were reviewed for CRN grading and staging by the clinical trial’s three CPs, who have extensive experience evaluating MASH histology in pivotal phase 2 clinical trials and in the MASH CRN and European MASH pathology communities6. Images for which CP scores were not available were excluded from the model performance accuracy analysis. Median scores of the three pathologists were computed for all WSIs and used as a reference for AI model performance. Importantly, this dataset was not used for model development and thus served as a robust external validation dataset against which model performance could be fairly tested.The clinical utility of model-derived features was assessed by generated ordinal and continuous ML features in WSIs from four completed MASH clinical trials: 1,882 baseline and EOT WSIs from 395 patients enrolled in the ATLAS phase 2b clinical trial25, 1,519 baseline WSIs from patients enrolled in the STELLAR-3 (n = 725 patients) and STELLAR-4 (n = 794 patients) clinical trials15, and 640 H&E and 634 trichrome WSIs (combined baseline and EOT) from the EMINENCE trial24. Dataset characteristics for these trials have been published previously15,24,25.PathologistsBoard-certified pathologists with experience in evaluating MASH histology assisted in the development of the present MASH AI algorithms by providing (1) hand-drawn annotations of key histologic features for training image segmentation models (see the section ‘Annotations’ and Supplementary Table 5); (2) slide-level MASH CRN steatosis grades, ballooning grades, lobular inflammation grades and fibrosis stages for training the AI scoring models (see the section ‘Model development’); or (3) both. Pathologists who provided slide-level MASH CRN grades/stages for model development were required to pass a proficiency examination, in which they were asked to provide MASH CRN grades/stages for 20 MASH cases, and their scores were compared with a consensus median provided by three MASH CRN pathologists. Agreement statistics were reviewed by a PathAI pathologist with expertise in MASH and leveraged to select pathologists for assisting in model development. In total, 59 pathologists provided feature annotations for model training; five pathologists provided slide-level MASH CRN grades/stages (see the section ‘Annotations’).Annotations
Tissue feature annotations
Pathologists provided pixel-level annotations on WSIs using a proprietary digital WSI viewer interface. Pathologists were specifically instructed to draw, or ‘annotate’, over the H&E and MT WSIs to collect many examples of substances relevant to MASH, in addition to examples of artifact and background. Instructions provided to pathologists for select histologic substances are included in Supplementary Table 4 (refs. 33,34,35,36). In total, 103,579 feature annotations were collected to train the ML models to detect and quantify features relevant to image/tissue artifact, foreground versus background separation and MASH histology.

Slide-level MASH CRN grading and staging
All pathologists who provided slide-level MASH CRN grades/stages received and were asked to evaluate histologic features according to the MAS and CRN fibrosis staging rubrics developed by Kleiner et al.9. All cases were reviewed and scored using the aforementioned WSI viewer.
Model developmentDataset splittingThe model development dataset described above was split into training (~70%), validation (~15%) and held-out test (∼15%) sets. The dataset was split at the patient level, with all WSIs from the same patient allocated to the same development set. Sets were also balanced for key MASH disease severity metrics, such as MASH CRN steatosis grade, ballooning grade, lobular inflammation grade and fibrosis stage, to the greatest extent possible. The balancing step was occasionally challenging because of the MASH clinical trial enrollment criteria, which restricted the patient population to those fitting within specific ranges of the disease severity spectrum. The held-out test set contains a dataset from an independent clinical trial to ensure algorithm performance is meeting acceptance criteria on a completely held-out patient cohort in an independent clinical trial and avoiding any test data leakage43.CNNsThe present AI MASH algorithms were trained using the three categories of tissue compartment segmentation models described below. Summaries of each model and their respective objectives are included in Supplementary Table 6, and detailed descriptions of each model’s purpose, input and output, as well as training parameters, can be found in Supplementary Tables 7–9. For all CNNs, cloud-computing infrastructure allowed massively parallel patch-wise inference to be efficiently and exhaustively performed on every tissue-containing region of a WSI, with a spatial precision of 4–8 pixels.
Artifact segmentation model
A CNN was trained to differentiate (1) evaluable liver tissue from WSI background and (2) evaluable tissue from artifacts introduced via tissue preparation (for example, tissue folds) or slide scanning (for example, out-of-focus regions). A single CNN for artifact/background detection and segmentation was developed for both H&E and MT stains (Fig. 1).

H&E segmentation model
For H&E WSIs, a CNN was trained to segment both the cardinal MASH H&E histologic features (macrovesicular steatosis, hepatocellular ballooning, lobular inflammation) and other relevant features, including portal inflammation, microvesicular steatosis, interface hepatitis and normal hepatocytes (that is, hepatocytes not exhibiting steatosis or ballooning; Fig. 1).

MT segmentation models
For MT WSIs, CNNs were trained to segment large intrahepatic septal and subcapsular regions (comprising nonpathologic fibrosis), pathologic fibrosis, bile ducts and blood vessels (Fig. 1). All three segmentation models were trained utilizing an iterative model development process, schematized in Extended Data Fig. 2. First, the training set of WSIs was shared with a select team of pathologists with expertise in assessment of MASH histology who were instructed to annotate over the H&E and MT WSIs, as described above. This first set of annotations is referred to as ‘primary annotations’. Once collected, primary annotations were reviewed by internal pathologists, who removed annotations from pathologists who had misunderstood instructions or otherwise provided inappropriate annotations. The final subset of primary annotations was used to train the first iteration of all three segmentation models described above, and segmentation overlays (Fig. 2) were generated. Internal pathologists then reviewed the model-derived segmentation overlays, identifying areas of model failure and requesting correction annotations for substances for which the model was performing poorly. At this stage, the trained CNN models were also deployed on the validation set of images to quantitatively evaluate the model’s performance on collected annotations. After identifying areas for performance improvement, correction annotations were collected from expert pathologists to provide further improved examples of MASH histologic features to the model. Model training was monitored, and hyperparameters were adjusted based on the model’s performance on pathologist annotations from the held-out validation set until convergence was achieved and pathologists confirmed qualitatively that model performance was strong.
The artifact, H&E tissue and MT tissue CNNs were trained using pathologist annotations comprising 8–12 blocks of compound layers with a topology inspired by residual networks and inception networks with a softmax loss44,45,46. A pipeline of image augmentations was used during training for all CNN segmentation models. CNN models’ learning was augmented using distributionally robust optimization47,48 to achieve model generalization across multiple clinical and research contexts and augmentations. For each training patch, augmentations were uniformly sampled from the following options and applied to the input patch, forming training examples. The augmentations included random crops (within padding of 5 pixels), random rotation (≤360°), color perturbations (hue, saturation and brightness) and random noise addition (Gaussian, binary-uniform). Input- and feature-level mix-up49,50 was also employed (as a regularization technique to further increase model robustness). After application of augmentations, images were zero-mean normalized. Specifically, zero-mean normalization is applied to the color channels of the image, transforming the input RGB image with range [0–255] to BGR with range [−128–127]. This transformation is a fixed reordering of the channels and subtraction of a constant (−128), and requires no parameters to be estimated. This normalization is also applied identically to training and test images.
GNNsCNN model predictions were used in combination with MASH CRN scores from eight pathologists to train GNNs to predict ordinal MASH CRN grades for steatosis, lobular inflammation, ballooning and fibrosis. GNN methodology was leveraged for the present development effort because it is well suited to data types that can be modeled by a graph structure, such as human tissues that are organized into structural topologies, including fibrosis architecture51. Here, the CNN predictions (WSI overlays) of relevant histologic features were clustered into ‘superpixels’ to construct the nodes in the graph, reducing hundreds of thousands of pixel-level predictions into thousands of superpixel clusters. WSI regions predicted as background or artifact were excluded during clustering. Directed edges were placed between each node and its five nearest neighboring nodes (via the k-nearest neighbor algorithm). Each graph node was represented by three classes of features generated from previously trained CNN predictions predefined as biological classes of known clinical relevance. Spatial features included the mean and standard deviation of (x, y) coordinates. Topological features included area, perimeter and convexity of the cluster. Logit-related features included the mean and standard deviation of logits for each of the classes of CNN-generated overlays. Scores from multiple pathologists were used independently during training without taking consensus, and consensus (n = 3) scores were used for evaluating model performance on validation data. Leveraging scores from multiple pathologists reduced the potential impact of scoring variability and bias associated with a single reader.To further account for systemic bias, whereby some pathologists may consistently overestimate patient disease severity while others underestimate it, we specified the GNN model as a ‘mixed effects’ model. Each pathologist’s policy was specified in this model by a set of bias parameters learned during training and discarded at test time. Briefly, to learn these biases, we trained the model on all unique label–graph pairs, where the label was represented by a score and a variable that indicated which pathologist in the training set generated this score. The model then selected the specified pathologist bias parameter and added it to the unbiased estimate of the patient’s disease state. During training, these biases were updated via backpropagation only on WSIs scored by the corresponding pathologists. When the GNNs were deployed, the labels were produced using only the unbiased estimate.In contrast to our previous work, in which models were trained on scores from a single pathologist5, GNNs in this study were trained using MASH CRN scores from eight pathologists with experience in evaluating MASH histology on a subset of the data used for image segmentation model training (Supplementary Table 1). The GNN nodes and edges were built from CNN predictions of relevant histologic features in the first model training stage. This tiered approach improved upon our previous work, in which separate models were trained for slide-level scoring and histologic feature quantification. Here, ordinal scores were constructed directly from the CNN-labeled WSIs.GNN-derived continuous score generationContinuous MAS and CRN fibrosis scores were produced by mapping GNN-derived ordinal grades/stages to bins, such that ordinal scores were spread over a continuous range spanning a unit distance of 1 (Extended Data Fig. 2). Activation layer output logits were extracted from the GNN ordinal scoring model pipeline and averaged. The GNN learned inter-bin cutoffs during training, and piecewise linear mapping was performed per logit ordinal bin from the logits to binned continuous scores using the logit-valued cutoffs to separate bins. Bins on either end of the disease severity continuum per histologic feature have long-tailed distributions that are not penalized during training. To ensure balanced linear mapping of these outer bins, logit values in the first and last bins were restricted to minimum and maximum values, respectively, during a post-processing step. These values were defined by outer-edge cutoffs chosen to maximize the uniformity of logit value distributions across training data. GNN continuous feature training and ordinal mapping were performed for each MASH CRN and MAS component fibrosis separately.Quality control measuresSeveral quality control measures were implemented to ensure model learning from high-quality data: (1) PathAI liver pathologists evaluated all annotators for annotation/scoring performance at project initiation; (2) PathAI pathologists performed quality control review on all annotations collected throughout model training; following review, annotations deemed to be of high quality by PathAI pathologists were used for model training, while all other annotations were excluded from model development; (3) PathAI pathologists performed slide-level review of the model’s performance after every iteration of model training, providing specific qualitative feedback on areas of strength/weakness after each iteration; (4) model performance was characterized at the patch and slide levels in an internal (held-out) test set; (5) model performance was compared against pathologist consensus scoring in an entirely held-out test set, which contained images that were out of distribution relative to images from which the model had learned during development.Statistical analysisModel performance repeatabilityRepeatability of AI-based scoring (intra-method variability) was assessed by deploying the present AI algorithms on the same held-out analytic performance test set ten times and computing percentage positive agreement across the ten reads by the model.Model performance accuracyTo verify model performance accuracy, model-derived predictions for ordinal MASH CRN steatosis grade, ballooning grade, lobular inflammation grade and fibrosis stage were compared with median consensus grades/stages provided by a panel of three expert pathologists who had evaluated MASH biopsies in a recently completed phase 2b MASH clinical trial (Supplementary Table 1). Importantly, images from this clinical trial were not included in model training and served as an external, held-out test set for model performance evaluation. Alignment between model predictions and pathologist consensus was measured via agreement rates, reflecting the proportion of positive agreements between the model and consensus.We also evaluated the performance of each expert reader against a consensus to provide a benchmark for algorithm performance. For this MLOO analysis, the model was considered a fourth ‘reader’, and a consensus, determined from the model-derived score and that of two pathologists, was used to evaluate the performance of the third pathologist left out of the consensus. The average individual pathologist versus consensus agreement rate was computed per histologic feature as a reference for model versus consensus per feature. Confidence intervals were computed using bootstrapping. Concordance was assessed for scoring of steatosis, lobular inflammation, hepatocellular ballooning and fibrosis using the MASH CRN system.AI-based assessment of clinical trial enrollment criteria and endpointsThe analytic performance test set (Supplementary Table 1) was leveraged to assess the AI’s ability to recapitulate MASH clinical trial enrollment criteria and efficacy endpoints. Baseline and EOT biopsies across treatment arms were grouped, and efficacy endpoints were computed using each study patient’s paired baseline and EOT biopsies. For all endpoints, the statistical method used to compare treatment with placebo was a Cochran–Mantel–Haenszel test, and P values were based on response stratified by diabetes status and cirrhosis at baseline (by manual assessment). Concordance was assessed with κ statistics, and accuracy was evaluated by computing F1 scores. A consensus determination (n = 3 expert pathologists) of enrollment criteria and efficacy served as a reference for evaluating AI concordance and accuracy. To evaluate the concordance and accuracy of each of the three pathologists, AI was treated as an independent, fourth ‘reader’, and consensus determinations were composed of the AIM and two pathologists for evaluating the third pathologist not included in the consensus. This MLOO approach was followed to evaluate the performance of each pathologist against a consensus determination.Continuous score interpretabilityTo demonstrate interpretability of the continuous scoring system, we first generated MASH CRN continuous scores in WSIs from a completed phase 2b MASH clinical trial (Supplementary Table 1, analytic performance test set). The continuous scores across all four histologic features were then compared with the mean pathologist scores from the three study central readers, using Kendall rank correlation. The goal in measuring the mean pathologist score was to capture the directional bias of this panel per feature and verify whether the AI-derived continuous score reflected the same directional bias.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles