Automated multi-scale computational pathotyping (AMSCP) of inflamed synovial tissue

Dataset descriptionAll mouse work was approved by the University Committee on Animal Resources at the University of Rochester Medical Center and the Institutional Animal Care and Use Committee at the Hospital for Special Surgery. Whole slide images (WSIs) of sagittal mouse knee sections were taken from two different mouse models of inflammatory arthritis and the accompanying controls for segmentation experiments (Supplemental Fig. 1A). Batch A consisted of male and female TNF Transgenic mice (TNF-Tg, n = 47) and wild-type littermates (WT, n = 15) used in previous publications19,20. Batch B consisted of previously unpublished male and female knees that received intra-articular injections of 180 ug of Zymosan to induce Zymosan Induced Arthritis (ZIA, n = 24) and control contralateral limbs (Control, n = 8)48 that were euthanized on Day 7 after injection. Different batches were used to test model generalizability across different biological mechanisms of arthritis development, differences in H&E staining protocols and slide scanners used to digitally capture slides at 40 x magnification (Batch A: VS120 Olympus, 0.173 μm per pixel; Batch B: CS2 Aperio Leica, 0.253 μm per pixel).To further test model generalizability, 2 different independent holdout datasets were used to validate the model: (1) the remaining H&E-stained sagittal knee WSIs from Bell et al.19,20 that were not annotated or used in model training and (2) Orange G-H&E stained sagittal knee WSIs from Kenney et al.21. Slides from Bell et al. were ensured to not have been used in the initial model training, internal validation, or testing. These included slides from 6 month-old male TNF-Tg (n = 33 slides from 14 knees) and WT littermates (n = 43 slides from 17 knees), and slides from 9.5 month old male TNF-Tg mice (anti-TNF: n = 24 slides from 8 knees; Placebo: n = 29 slides from 10 knees) and WT littermates (Placebo: n = 42 slides from 15 knees) either treated with a 6 week course of anti-TNF antibodies or placebo control (irrelevant IgG). To generate downstream measurements of tissues of interest, a region of interest (ROI) was drawn from the tibial growth plate to femoral growth plate including the anterior and posterior extra articular tissue.WSIs of H&E-stained human synovial biopsies were collected from the Accelerating Medicines Partnership Rheumatoid Arthritis (AMP-RA) Phase II consortium47. In short, synovial tissue biopsies were acquired from RA patients at 13 different clinical sites in the United States and 2 in the United Kingdom from October 2016 to February 2020. The study was performed in accordance with protocols approved by the institutional review board at each site. The tissue was paraffin embedded, stained with H&E and imaged on a VS120 Olympus. Three pathologists independently determined Krenn lining and inflammatory infiltrates scores (0–3 each) for each tissue sample27, and the mode of the three scores was used for further analysis. To classify the cases into H&E based pathotypes, the UK Birmingham group developed consensus semiquantitative four point scores for infiltrate density and aggregate radial size on a per fragment basis with a custom atlas using a test set of tissues from the Birmingham Early Arthritis Cohort49, scored by three pathologists. Aggregate grade was derived as follows: Grade 3; high ≥ 20 radial count. Grade 2: medium 10–19 radial count. Grade 1: low 6–9 radial cell count. Grade 0: No aggregates. This approach was validated by scoring tissues from the first AMP RA cohort46 and original data is presented from the second AMP RA cohort47. These semiquantitative scores were then used to classify the cases into three pathotypes, either lymphoid (n = 27), diffuse (n = 26) or pauci-immune (n = 5) according to the following rules: Lymphoid: The presence of ≥1 grade 1 aggregate in at least two fragments, or any grade 2 aggregate, or any grade 3 aggregate. Diffuse: Does not meet lymphoid criteria but with a mean fragment density score ≥ 1. Pauci-immune: Does not meet lymphoid criteria, mean fragment infiltrate density score < 1.Semantic segmentation annotation and preprocessingManual annotations were performed within QuPath50 to assign labels for WSIs. To test model performance across tissue types at different granularity, multiple different class structures were tested (Supplemental Fig. 2). Eleven different classes were manually annotated, including synovium, muscle and tendon, growth plate, bone marrow, cortical bone, trabecular bone, articular cartilage, meniscus, fat, bone marrow fat, and histology artifact (i.e., out of focus). A seven-class, nine-class, and ten-class segmentation task was generated by merging histologically similar tissues, such as merging cartilage and meniscus into the same class (Supplemental Fig. 2). Overall, we estimate about 250 hours were spent annotating the 11 tissue classes on 94 WSIs.Due to the gigapixel nature of WSIs, the entire slide cannot be fed directly into a deep learning model. Previous work has shown that WSIs can be broken into patches, in this case 512 pixel x 512 pixel, to perform downstream learning tasks31. For semantic segmentation, a custom QuPath script was used to export patches at a 4x downsample while filtering out regions of the scanned slide that lacked annotations or without tissue. Additionally, images were normalized to mean 0 and standard deviation 1 by sampling a subset of patches to get mean and standard deviation RGB statistics.Semantic segmentation models and training strategyInitially, a stratified random sampling method was used to randomize the 94 annotated WSIs into a Training, Validation, and Test split, using 70%, 15%, and 15% for each split, respectively (3 splits total, Supplemental Fig. 1B). We stratified by batch (e.g. staining and site differences) and disease type (e.g. healthy and disease) to ensure even allocation of data variation into each set. Randomization occurred at the slide level, as opposed to the patch level, to ensure no data leakage across splits. During our initial qualitative explorations of models and hyperparameters (detailed below), we only utilized Training and Validation sets; and calculated the Dice score on the validation set to measure performance. Dice score was calculated as$${Dice}=\frac{2 * {intersection}}{{union}+{intersetion}}\,$$These initial experiments included variations in SLIC feature extraction, model selection (UNET vs UNET++), efficient-net backbone size (B0, B2, B5), loss metrics (weight loss vs unweighted), and learning rate parameters as described below. Once these items were tuned, quantitative experiments (details below) were performed varying the tissue segmentation number (7, 9, 10, or 11), image augmentation (None, Low, Medium, or High), or patch overlap percentage (0%, 50%, 66%). These experiments were performed by training the models with the Training and Validation sets, freezing the models’ weights, and then inferring segmentation on the Test set. These inferences were then compared to the ground truth labels to calculate the mean Intersection over Union (mIOU) or the frequency weighted mIOU (fwIOU). The fwIOU is the class frequency weighted sum of the mIOU.In addition, to assess how site-specific differences in histology slides may impact model performance51, we performed a single batch training method while varying the image augmentation style. In this training strategy, Batch A (n = 62) was used for the Training (45%) and Validation (20%) sets, and Batch B (n = 32, 35%) was used as the held-out Test set. These experiments were performed to assess if image augmentation could overcome batch related staining differences that are seen in the real world.Data augmentation strategies were also tested to assess their impact across the different training strategies (Supplemental Fig. 3). We had three different levels of augmentation tested, (1) None, (2) Low, (3) Medium, and (4) High. The python package imgaug52 as used to implement augmentation. Augmentation was applied in the following way: (1) None had zero augmentation, (2) Low had 2–4 augmentation process applied 25% of the time, (3) Medium had 3–7 augmentations applied 50% of the time and (4) High augmentation had 5–11 augmentation process applied 50% of the time during training. Augmentation was randomly selected from 11 different types of augmentations including, horizontal flip (p = 0.5), coarse dropout or pixel dropout from 0.2x the original image resolution (p = 0.1), one of three different rotation types at 90°, 180°, and 270°, additive gaussian noise sampled from a normal distribution with mean 0 and variance 0.2*255, blur using gaussian kernel with sigma of 1.5, hue modification using addition (−30,10) and saturation modification using multiplication (0.5,1.5) and linear contrast (0.5,2), brightness adjustment both add(−30,30) and multiply (0.5,1.5), and color change adjustment (3000, 8000). Data augmentation was performed only during the training of the models, not during testing or inference.To compare segmentation performance using both conventional machine learning and deep learning methods, we tested two different model architectures. A Random Forest (RF) model implemented in QuPath with OpenCV53 and an U-Net++54 deep learning model implemented in PyTorch (1.8.0)55. We qualitatively assessed both pixel level segmentation and super-pixel level segmentation in QuPath and determined super-pixel segmentation performed better. To generate super-pixels, we applied a Simple Linear Iterative Clustering (SLIC) algorithm (σ = 5, spacing = 20 μm, Max Iterations = 1, Regularization = 0.01) in which each over segmented area was considered a super-pixel. We qualitatively assessed SLIC feature extraction variations and chose to extract RGB, estimated Hematoxylin, Eosin and residual stain means, standard deviations, min, max, and median values from each SLIC super pixel. We also calculated the Haralick value using a distance of 1 and bin of 32. We next calculated the average features of super-pixels within 40 and 80 μms. These features were then used in the RF, which was implemented fully in QuPath using their “Train Object Classifier” GUI with max depth = 20, min sample count = 10, Active variable count = 0, maximum trees = 50 and termination epsilon = 0. Four models were built, one for each of the 7 class, 9 class, 10 class and 11 class segmentation tasks. After the models were trained, they were used to infer on the Test set and performance metrics calculated.For our deep learning pipeline, we utilized the segmentation_models_pytorch56 package for the UNET++ implementation with an ImageNet pre-trained EfficientNet-B5 backbone for the encoder to improve training time and computational efficiency57. The decoder was left unchanged from the native UNET++ architecture except for the final layer which was changed to match the number of tissue types being segmented. We also initially explored using the UNET architecture and other efficient net backbones, however UNET++ with a B5 backbone was found to be most performant. We initially explored both class frequency weighted loss and un-weighted loss and the un-weighted loss demonstrated improved performance. We utilized a combo loss calculation, which was the arithmetic average of the dice loss and binary cross entropy (BCE) loss during model training58. We explored a few hyperparameter variations for the learning rate scheduler including step size = [1,2], gamma = [0.25, 0.5] and learning rate start = [0.01, 0.02, 0.025]. From initial explorations we found that a step size of 2, a gamma of 0.5 and a learning rate start of 0.025 resulted in the best performance. We used a stochastic gradient descent (SGD) optimizer with momentum (0.9) and regularization of 1 × 10−4 and the models were trained for 10 epochs at a batch size of 40.When jointly training across 10 different classes, this model provided no prediction for the meniscus class resulting in a mean Intersection Over Union (mIOU) of 0 ± 0 (Supplemental Fig. 6), likely due to insufficient training examples (Supplemental Fig. 2; 0.9% frequency overall). As the meniscus was important for downstream biological analyses, we developed a strategy to improve predictions for this class. A second UNET++ model was fine-tuned from the nine-class model (i.e. the model that has cartilage and meniscus merged into one class) by changing the prediction (final) layer specifically to predict between cartilage and meniscus. We then re-trained the full model only on images and Ground Truth annotations (GTs) that contained either cartilage or meniscus within the mixed training set.Previous work had suggested that UNET based semantic segmentation can have image boundary level artifacts17. Therefore, we assessed how including patch level overlap for prediction can improve model performance. We included experiments using no overlap, 50% overlap and 66% overlap between them with images from both batches (Supplemental Fig. 4). To analyze the results, we looped through all the predictions (N) for the entire WSI and calculated a majority vote for each pixel after thresholding to remove low confidence predictions (pixel value of 75). N can be variable depending upon the region, for example if it is on border, but typically is between 4 and 9.Semantic segmentation model evaluation, inference on external validation set and statistical evaluationTo evaluate the semantic segmentation model performance for the Validation and Test set slides with ground-truth labels, we calculated the mean Intersection over Union (mIOU) and frequency-weighted mIOU (fwIOU) to prevent very rare classes from drastically impacting overall model performance59. All model hyperparameter optimization was performed on the Validation set, and once the above parameters were chosen the models were used to infer segmentation on the Test set and the mIOUs were calculated.We used the optimal settings from the training/validation process to evaluate the model performance on the held-out datasets. Pearson’s correlation was used to compare the hand drawn synovial tissue area reported in Bell et al.19 with model classified synovial tissue area. The fine-tuned 10-class model was used to infer tissue segmentation on the held-out data21,60. Specifically, the UNET++ 9 class model was used to predict tissues classes on each patch, and if the nine-class model had a prediction output for the combined Cartilage-Meniscus class on a patch, then the patch was passed through the fine-tuned 2 class model to assign to either the meniscus or cartilage class. Predictions were merged by only allowing the fined tuned predictions to be within the predictions from the combined Cartilage-Meniscus class. Once the inference was complete, a ROI was drawn on each slide from femoral growth plate to tibial growth plate around the joint to restrict the downstream analysis to the joint space, subchondral bone, and synovial adjacent tissue. Tissue area was calculated for each slide and averaged for each knee then a One-Way ANOVA with Tukey’s post-hoc adjustment was used to detect significant differences.Cell type classification framework and preprocessingFor cell type classification, a combination of transfer learning and active learning was used to identify several different cell types that exist within the joint tissue. Cell type classification can be broken into a two-step process, (1) segmentation and (2) classification. For cell segmentation, transfer learning was used by leveraging a deep learning model, HoVer-Net24, pretrained on the PanNuke dataset61, to extract nuclei regions. Image patches (1024 × 1024 pixels) at 40 x magnification were given as inputs, and ROI contours of nuclei were obtained to perform feature extraction upon. These nuclei with their features and labels (detailed below) were then leveraged in a gradient boosted decision tree (GBDT, XGBoost (https://xgboost.readthedocs.io/en/stable/) implemented within the ScikitLearn package (https://scikit-learn.org/stable/)) model to classify cells.The input for the classifier was derived from features extracted from each ROI generated by HoVer-Net. Specifically, every nucleus from the json file output of HoVer-Net was converted into a ROI (.roi) file to be read into FIJI/ImageJ62 for feature extraction using a custom script. Detailed workflow for the ImageJ/FIJI analysis is described as follows. Each image was split using the built-in color deconvolution63 algorithm in FIJI into hematoxylin and eosin color channels. For each channel the nuclei were measured for several different parameters, including morphological quantities (area, perimeter, circularity, feret’s diameter, feret angle, aspect ratio, roundness, and solidity) and staining color quantities (mean, mode, min, max, standard deviation, skewness, median, and kurtosis). The nuclei ROI was then enlarged by 20 px, the original nuclei masked out, and the data from this surround 20 px was used to calculate cell specific cytoplasmic H&E color information for each cell. These features are called cell intrinsic features (Supplemental Table 1). Neighborhood characteristics of cells at several different distance ranges (150 px and 300 px) were used to include local cell and tissue level context into cell type classification (Supplemental Table 2)25. Simply, a neighborhood is determined as all the cells within a certain distance to the parent cell. These neighborhood characteristics included the average, standard deviation, skew, kurtosis, Z-score, interquartile range, standard error of the mean and entropy the cell intrinsic features of cells in the neighborhood. We also calculated shape characteristics of the neighborhood including the average distance of the neighboring cells, the linear correlation coefficient of the cells within the neighborhood, a string of up to 30 cells linear correlation coefficient, the strait line distance of a string of up to 30 cells, a scored measure of density of the cells and the number of cells within the distance measure. A final total of 854 features were extracted for the downstream analysis.Known healthy and pathologic cell types that contribute to inflammatory arthritis were then annotated on both mouse and human tissues (details below, Supplemental Figs. 8 and 10). The mouse proof-of-concept classification task consisted of bone-embedded cells, blood vessel cells, adipo-stromal cells (both adipose and stromal cells within fatty tissue), synovial fibroblasts (healthy and pathologic), chondrocytes, lymphocytes, and other synovial lining cells (healthy and pathologic) as detailed in Supplemental Fig. 8; annotated by subject-matter experts familiar with histologic analysis of these cell types. These cells were annotated on six healthy, eight mild disease and five severely diseased TNF-Tg knee sections. For human samples, a clinically meaningful set of cell types were labeled by a senior pathologist, following a standard cell type hierarchy (Supplemental Fig. 10). These included stromal/connective tissue cells, synovial lining cells, synovial fibroblasts, vascular endothelial cells, tissue macrophages/histocytes, lymphocytes, and plasma cells. These cells were labeled on five lymphoid, five diffuse and three pauci-immune cases. Nuclei were then mapped to manually annotated nuclei by checking if a nuclei’s centroid (as determined by Hover-Net) was within an annotation mask.Mouse cell type classification modelA total of 4,712 cells were annotated for mouse cell type classification from seven different classes (Supplemental Fig. 8). Cells were labeled from a total of 19 different slides. A GBDT model was trained for cell type classification (XGBoost (https://xgboost.readthedocs.io/en/stable/) implemented within the Scikit-learn package (https://scikit-learn.org/stable/)), using stratified nested 5-fold cross validation with grid search to select the best models. In order to minimize the influence of annotations from any one slide, we enforced an even sampling method to ensure approximately equal numbers of cells from each slide appeared in all folds (sklearn.model_selection.StratifiedKFold). To tune the parameters of the GBDT, we performed a grid search (sklearn.model_selection.GridSearchCV) on the inner CV for learning rate = [0.05, 0.1, 0.2], colsample_bytree = [0.6, 0.8, 1.0], subsample = [0.25, 0.5], max_depth  [6,12], n_estimators = [10, 100, 200, 400], gamma = [0, 0.1, 0.3], and min_child_weight = [1,5,10]. To evaluate model performance, F1 statistics were calculated as the average of the 5 external folds and then the best performing models was used to infer cell type in two different biological settings, (1) to identify cell composition changes across different disease severities on the remaining cells (>300,000) on the 19 slides, and (2) identify differences between male and female mice in the context of disease progression in a held-out dataset19. Finally, average synovial inflammatory infiltrate scores and average pannus invasion scores were correlated with lymphocyte and synovial lining cell counts respectively (Spearman’s Correlation).Feature ablation studiesTo demonstrate the performance improvements of the distance features, we performed a feature ablation study within mouse cohort in which no distance features were utilized, features at distance 150 px and all features (cell intrinsic features, 150 px features and 300 px features) in out modeling framework (detailed above). To evaluate model performance, F1 statistics were calculated as the average of the 5 external folds.Active learning implementationHuman annotation was the time-consuming step for the cell type classification pipeline. Therefore, we applied the active learning strategy to improve the annotation efficiency for cell annotation of human samples. To develop this strategy, we tested a proof-of-concept active learning strategy using labeled data from the mouse H&E slides (Supplemental Fig. 9). Active learning is an iterative process that consists of three main steps, (1) annotation, (2) model training, and (3) sample selection for further annotations. Its goal is to select the samples that can lead to the largest model performance improvement when adding to the training data after annotation. To validate the strategy, 100 different rounds of 5-fold cross-validation were performed. Average F1-scores were reported for each class and a macro-F1 score was additionally reported. 25 runs of 5-fold cross validation were removed due to cells from a single class not being present in both the training and testing sets. For the training dataset for each split, 5% of cells were first randomly selected as the first set of cells selected as being labeled annotated. Subsequently, the GBDT classifier was trained using this randomly selected data. Several different metrics for determining cells for annotation and subsequent model finetuning, including smallest margin uncertainty64, least confidence uncertainty65 and entropy-based uncertainty64 were assessed. The top 5% of cells were added to the training dataset and the cycle of model training and evaluation and new cell annotations continued until the entire training dataset was used. A random selection of cells after shuffling was also tested to compare model performance to the various active learning strategies. The package modal66 was leveraged in our implementation. Mean and 95% confidence intervals are reports for each subset across the 75 different runs of 5-fold cross validation.Cell classification model evaluationsConfusion matrices were generated for model prediction along with F1-scores calculated as \(2\cdot \frac{{precision}\cdot {recall}}{{precision}+{recall}}\), where \({precision}=\frac{{TP}}{{TP}+{FP}}\) and \({recall}=\frac{{TP}}{{TP}+{FN}}\), where TP, FP, FN stand for true positives, false positives, false negatives. Models were tested using known cell types within specific tissue types to evaluate the model qualitatively.Human synovial biopsy cell type modelingActive learning was then leveraged for human cell type classification using H&E-stained slides of human synovial biopsy tissue of RA patients from the AMP consortium as described above. A subset of slides was selected to be annotated (Lymphoid, n = 5; Diffuse, n = 5; Pauci-Immune, n = 3) that represent the diversity of specimens within this cohort. Multiple rounds of cell type labeling were performed with the assistance of active learning, to obtain a total of 2,639 cells grouped in seven different cell types, detailed in Supplemental Fig. 10 (stromal/connective tissue cells n = 597, synovial lining cells n = 309, synovial fibroblasts n = 189, vascular endothelial cells n = 486, Tissue Macrophages/Histocytes n = 201, lymphocytes n = 826, and plasma cells n = 310). A cell type classification model using GBDT was trained using a stratified nested 5-fold cross validation with grid search strategy (as described above) to select the best models. F1 statistics were calculated as the average of the 5 external folds. The best performing model was used to infer cell types all cells on the slides within this patient cohort (n = 58 subjects; 2,976,535 total cells). Summary cell type quantification (total cell counts and percent of total) was then assessed for each patient. Two analyses were performed using the derived cell types from the cell classification model. First, cell type counts and proportions were correlated with either immunofluorescent stained adjacent sections (described below) or with a pathologist-derived, and clinically relevant Krenn inflammation scores. As these data were non-normal, we utilized a Spearman’s correlation. Second, we assessed the frequency of cell types across pathotypes. Specifically, statistical significance testing using lymphocyte, plasma cell, and fibroblast slide proportions were evaluated across pathotypes. Additionally, we performed a receiver-operator curve analysis of plasma cell frequency of total to predict if a biopsy was a lymphoid or diffuse case (n = 53).Immunofluorescence (IF) and histomorphometryAdjacent sections from 15 of the RA synovial biopsies were stained in batches with either CD3 (T-Cells), CD20 (B-Cells) and CD138 (Plasma Cells) or CLIC5 (Synovial Lining), CD3 (T-Cells), CD68 (Macrophages), and CD34 (Vascular Endothelial Cells) antibodies; and counter stained with DAPI. In depth staining procedures are described in the original work47. All IF images were imported into QuPath to perform histomorphometry. All biopsies were evaluated for tissue morphology similarity to the adjacent H&E to ensure as little physical distance between the sections as possible. To count IF+ cell, DAPI+ cells were first segmented with a watershed algorithm in QuPath (cell detection) and then mean CD138, CD3, and CD20 IF intensity for each cell was calculated. Staining batch specific thresholds for each channel were used to count positive cells.Visualization of dataUniform Manifold Approximation and Projection67 visualization was used for feature representations between batches and cell type framework features. Both tissue segmentation masks and cell type masks for each class were reimported into QuPath50 for visualization purposes.Statistical approach and implementationAll graphing and hypothesis testing statistics were performed in Prism (10.0, Graph Pad, Boston, MA). For all continuous variables, a Shapiro-Wilks Normality test was performed to assess normality. If the test determined the specific distribution to be non-normal, the equivalent non-parametric test was utilized to test for significance or correlation. If an ordinal variable was being associated with a continuous variable a non-parametric Spearman’s correlation was chosen. Otherwise, One-Way, Two-Way and Two-Way Repeated Measures ANOVAs with Tukey’s Post-Hoc tests were used to test for significant main effects, interaction effects and post-hoc pairwise differences. All pairwise tests are two-tailed. Specific test information including sample size for each figure is provided in the “Supplemental Statistical Information Pertaining to Data Presented in Figures” document.Software and hardwareQupath (0.3.2 or later) was used to visualize WSIs and annotate tissues or cells as well as perform some image processing (detailed above). All other machine learning or deep learning techniques were performed in Python (3.8.1) as described above. Primary machine learning libraries include PyTorch (1.8.0), segmentation_models_pytorch (0.1.3)56 Sklearn (scikit-learn, 1.3.2), and xgboost (2.0.2). Deep learning segmentation training was performed on four Nvidia V100’s GPUs with 16 gb of RAM in parallel with a CUDA implementation (11.6.2). Segmentation inference was performed on either a Nvidia 3070 or 3090. Cell type classification was performed on an Nvidia 3070.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles