Community assessment of methods to deconvolve cellular composition from bulk gene expression

Ethical regulationsFresh human whole blood was obtained from the Stanford Blood Center, in accordance with procedures approved by the Stanford Institutional Review Board (IRB, protocol #13942).Cell isolationCells were obtained from four sources, StemExpress (Folsom, CA), AllCells (Alameda, CA), ATCC (Manassas, VA), and the Human Immune Monitoring Centering (HIMC) at Stanford University. Immune and stromal cells provided by StemExpress and AllCells were isolated according to vendor protocols (Supplementary Data 1). Cells provided by StemExpress had the following vendor IDs: CD8 T cells (PB08010C), Monocytes (PB14010C), Memory B cells (PB192701C), Naive B cells (PB1927N01C), Tregs (PB425002C), Naive CD4 T cells (PB445RA005C), Memory CD4 T cells (PB445RO05C), NK cells (PB56005C), Naive CD8 T cells (PB845RA005C), Eosinophils (PBEOS01C), and Macrophages (PBMAC001.5C).Neutrophils and CD8+ memory T cells were isolated by HIMC as follows. Fresh human whole blood was obtained from the Stanford Blood Center, in accordance with procedures approved by the Stanford IRB. Human whole blood samples were collected under informed consent in EDTA-coated tubes. After 2 h resting, whole blood samples were split for neutrophils and CD8 memory T cell isolation.Neutrophil isolation was performed with MACSxpress® Whole Blood Neutrophil Isolation Kit (Miltenyi Biotec; #130-104-434) according to manufacturer instructions. Briefly, the whole blood samples were mixed with the appropriate amount of isolation mix buffer consisting of magnetically coated beads conjugated to antibodies targeting all the immune populations in the peripheral blood except for the neutrophils. The cell suspension containing the isolation mix was incubated for 5 min at room temperature on a low-speed rotator. Then magnetic separation was performed for 15 min prior to collecting the untouched neutrophils in a clean tube.For CD8+ memory T cell isolation, Peripheral Blood Mononuclear Cells (PBMCs) were first isolated by density gradient centrifugation using Ficoll-Paque™ Plus (Cytiva; #17144003). After washes, cell counts were obtained using a Vi-Cell XR cell viability analyzer (Beckman Coulter). Actual isolation was performed using a CD8+ Memory T Cell Isolation Kit (Miltenyi Biotec; #130-094-412) per the manufacturer’s instructions. Briefly, PBMCs were incubated at 4 °C for 10 min with a cocktail of biotin-conjugated monoclonal antibodies against CD4, CD11c, CD14, CD15, CD16, CD19, CD34, CD36, CD45RA, CD56, CD57, CD61, CD123, CD141, TCRgd, and CD235a. After washing, cells were resuspended in a solution of anti-biotin magnetic microbeads and incubated for 15 min at 4 °C. After another wash, magnetic separation was performed using LS columns (Miltenyi Biotec; #130-042-401), and we collected the cell fraction corresponding to CD45RO+CD45RA-CD56-CD57‑CD8+ T cells.Finally, isolated neutrophils and CD8+ memory T cells were resuspended in RNAprotect Cell Reagent (Qiagen; #76526) for RNA extraction.Cell linesCell lines were obtained from ATCC for colon cancer cells (HT-29 colon adenocarcinoma; HTB-38), breast cancer cells (BT-483 breast ductal carcinoma; HTB-121), fibroblasts (normal dermal fibroblasts; PCS-201-012), and endothelial cells (primary aortic endothelial cells; PCS-100-11). We did not perform cell line authentication nor did we test for mycoplasma contamination. None of the cell lines are on the list of known misidentified cell lines maintained by the International Cell Line Authentication Committee (https://iclac.org/databases/cross-contaminations/).Library preparation, RNA sequencing, and data processingLibraries were prepared using the Clontech SMARTer Stranded Total RNA-Seq v2 kit (Takara Bio) according to manufacturer instructions. Paired-end RNA sequencing of all in vitro admixtures and purified samples was performed by MedGenome Inc, by pooling the indexed libraries across four lanes of an Illumina NovaSeq S4 flowcell.Estimated transcript read counts and transcripts per million (tpm) were generated via pseudo-alignment with Kallisto v0.46.0 to hg38 using Homo_sapiens.GRCh38.cdna.all.idx (ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz). A translation table of Ensembl transcript ID to Ensembl gene ID and gene symbol was derived using biomaRt and stored on the Synapse platform at syn21574276. Estimated gene read counts and tpm were calculated as the sum of transcript counts and tpm, respectively, associated with the gene via the translation table.Training data curationParticipants were provided with a curated list of purified expression profiles in GEO74. GEO annotations were queried using regular expressions corresponding to cell populations of interest (e.g., with patterns “T.reg”, “regulatory”, and “FOXP3” for regulatory T-cells). Specifically, GEO annotations for fields “source_name” or involving “characteristic” (e.g., “characteristics_ch1”) were accessed via GEOmetadb in R75. Cell type patterns are available at: https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge/blob/master/scripts/training-data-curation/phenotypes-to-query.tsv. Matches identified via grepl were manually curated, resulting in tables associating cell populations with GEO samples. These were further summarized according to dataset, listing the cell populations and the number of cell populations represented by each dataset. This was intended to help participants prioritize datasets representing many or multiple cell types of interest. Per-sample and per-dataset tables were created separately for microarray (Supplementary Data 9, https://www.synapse.org/#!Synapse:syn18728081, and https://www.synapse.org/#!Synapse:syn18728088) and RNA-seq (Supplementary Data 10, https://www.synapse.org/#!Synapse:syn18751454, and https://www.synapse.org/#!Synapse:syn18751460) platforms. Microarray expression datasets were identified as those having “Expression profiling by array” in the “type” field of the “gse” SQLite table available in GEOmetadb and as being assayed in human [i.e., as having the pattern “sapiens” in the “organism” field of the “gpl” (platform) SQLite table]. RNA-seq expression datasets were similarly identified as having “Expression profiling by high throughput sequencing” in the “gse” table and as being assayed in human. Additionally, RNA-seq datasets were limited to those generated on Illumina platforms (i.e., as having a pattern of “illumina” in the “title” field of the “gpl” table), specifically the HiSeq (with pattern “hiseq” in the “title” field) or NextSeq (with pattern “nextseq” in the “title” field) platforms. Participants were provided only identifiers of GEO datasets (GSMxxx) and samples (GSExxx). In particular, cross dataset normalization to account for batch effects was not performed, but rather was left to participants.Unconstrained admixture generationUnconstrained admixtures were defined in two stages: (1) a broken-stick approach partitioned the entire admixture across n cell types and (2) the proportion of each cell type c was restricted to be within minc and maxc. In particular, for n cell types contributing a proportion p < = 1 (i.e., 100%) of the admixture total, the range 0 to p was randomly broken into n segments by choosing n-1 boundaries of those segments. The n-1 boundaries were uniformly sampled between a minimum cell population size of 0.01 (i.e., 1%) and p in fixed-sized steps (of 0.01 unless otherwise specified), thus ensuring that each of the n populations was represented at a frequency of at least 1%. The resulting candidate proportions were excluded if the proportion pc for any of the n cell types c was outside the bounds [minc, maxc]. minc was set to 0 and maxc was set to 1 (i.e., 100%), unless otherwise specified. Setting p < 1 allows the remaining 1-p proportion to be allocated to an (n + 1)st cell type, e.g., fixing a spike in population at proportion 1-p.Biological admixture generationBiologically constrained admixtures were defined such that cell population proportions were within biologically reasonable limits, in particular those detected by CyTOF in PBMCs and aggregated in the 10,000 Immunomes (10KIP) database76 (downloaded on July 9, 2018), observed in the Azizi single-cell (sc)RNA-seq breast cancer study43, the Tirosh scRNA-seq melanoma study63, or inferred by CIBERSORT in the Thorsson TCGA pan-cancer study77.As none of these resources included all (coarse- or fine-grained) cell types to be deconvolved, several were combined in a hierarchical fashion. Two such hierarchical models, one based on the Thorsson study and the second based on the Azizi study, were created. At each level of the hierarchy, the models defined a minimum and maximum proportion for each population relative to its parental population. The minimum and maximum proportions for a particular cell type in a particular dataset were defined as two standard deviations above (~98th percentile) or below (~2nd percentile), respectively, the mean of proportions observed for that cell type in that dataset. The root of the model corresponds to the admixture of n cell populations. In both hierarchical models, the entire admixture was partitioned into cancer cell, leukocyte, and non-leukocyte stromal compartments, with minimum and maximum proportions for each compartment defined using the Thorsson study. Specifically, from the stromal fraction (SF), or total non-tumor cellular fraction, and the leukocyte fraction (LF) defined by Thorsson, we define the cancer cell proportion as 1 – SF, the leukocyte proportion as LF, and the non-leukocyte stromal proportion as SF – LF. Both hierarchical models next subdivided the non-leukocyte stromal compartment into (cancer-associated) fibroblasts and endothelial cells using proportions of single cells observed in the Tirosh study. The original publication noted that four samples (CY58, CY67, CY72, and CY74) were experimentally enriched for immune infiltrates (CD45+). As such, the proportions inferred from them would not have represented cellular proportions relative to the entire cellular population. Hence, we excluded from analysis these samples, as well as CY75, which also did not have any tumor cells.The Thorsson-based hierarchical model subdivided the leukocyte component into those inferred using CIBERSORT in the original study. Specifically, the leukocyte fraction was subdivided into the following sub-compartments: memory CD4+ T (i.e., T.Cells.CD4.Memory.Activated + T.Cells.CD4.Memory.Resting in the original publication), naïve CD4+ T (i.e., T.Cells.CD4.Naive), regulatory CD4+ T (i.e., T.Cells.Regulatory.Tregs), CD8+ T (i.e., T.Cells.CD8), memory B (i.e., B.Cells.Memory), naïve B (i.e., B.Cells.Naive), natural killer (i.e., NK_cells), neutrophils (i.e., Neutrophils), dendritic cells (i.e., Dendritic.Cells.Activated + Dendritic.Cells.Resting), monocytes (i.e., Monocytes), and macrophages (i.e., Macrophages.M0 + Macrophages.M1 + Macrophages.M2). Finally, the CD8+ T cell proportion was subdivided into memory and naïve CD8+ T cells using the 10KIP database.The Azizi-based hierarchical model subdivided the leukocyte component into those reported in the Azizi study, specifically: T (i.e., T.cell in the original study), B (i.e., B.cell), natural killer (i.e., NK.cell), neutrophils (i.e., Neutrophil), dendritic cells (i.e., DC), monocytes (i.e., Monocyte), macrophages (i.e., Macrophage). Using the 10KIP database, the T cell compartment was further subdivided into memory, naïve, and regulatory CD4 T cells and memory and naïve CD8+ T cells, while the B cell compartment was further subdivided into memory and naïve B cells.A single, final model was created from the Thorsson and Azizi models. This final model had a maximum proportion for cell type c set to the maximum proportions for c within the Thorsson and Azizi models. The final model’s minimum proportion for c was set to the minimum proportions for c within the Thorsson and Azizi models, unless this was below 0.01, in which case it was set to 0.01.The biologically constrained admixtures were generated using the Hit and Run Markov Chain Monte Carlo (MCMC) method for sampling uniformly from convex samples defined by linear (equality and inequality) constraints, as implemented in the hitandrun library in R. The system of linear constraints included a variable for each of the n populations. As in the unconstrained admixtures, the corresponding n proportions sum to p <= 1, with p < 1 allowing the remaining 1-p proportion to be allocated to an (n + 1)st cell type. The resulting equality constraint was passed to the solution.basis function, whose output was in turn passed to the createTransform function. 2n linear inequality constraints were defined from the minimum and maximum proportions of each of the n populations. These were passed along with the output of createTransform to the transformConstraints function. An initial guess was created by passing these transformed constraints to createSeedPoint along with arguments homogeneous=TRUE and randomize=TRUE. Admixtures were sampled by passing the resulting seed and the transformed constraints to the har function along with parameters N, the number of iterations to run, set to 1000n3 and N.thin, the thinning factor indicating how many iterations to skip between samples, set to n3.Selection of extremal candidate admixturesUnless otherwise indicated, we ordered candidate admixtures so as to prioritize those most different from another. In particular, we select as the first two candidate admixtures those having maximum sum of squared (proportion) differences. Then, we greedily selected admixtures that maximized the minimal sum of squared differences to those admixtures already selected.In vitro validation admixture generation60 biological admixtures and 36 unconstrained admixtures were defined using the procedures described in the subsections Biological admixture generation and Unconstrained admixture generation, respectively, with the exceptions noted below. Admixtures were defined over the cell populations having samples with sufficient mass and high RNA integrity number upon first assessment (Supplementary Data 2): breast or colorectal cancer, endothelial cells, neutrophils, dendritic cells, monocytes, macrophages, NK, regulatory T, naïve CD4+ T, memory CD4+ T, naïve CD8+ T, memory CD8+ T, naïve B, and memory B cells. Admixtures were designed so as to minimize batch effects across vendors, with half of the biological and half of the unconstrained admixtures assigned immune cells from Stem Express wherever availability allowed (Supplementary Data 3 and 4, respectively) and the rest assigned immune cells from AllCells wherever availability allowed (Supplementary Data 5 and 6). However, following subsequent experimental quantification, several cell populations (neutrophils, naïve CD8+ T cells, and memory B cells) did not have sufficient material for inclusion in the admixtures. As such, the final in vitro admixtures used during the Challenge validation phase included: breast or colorectal cancer cells, endothelial cells, fibroblasts, dendritic cells, monocytes, macrophages, NK, regulatory T, naïveCD4+ T, memory CD4+ T, memory CD8+ T, and naïve B cells. The final relative concentrations were rescaled relative to those designed computationally after excluding neutrophils, naïve CD8+ T cells, and memory B cells. The final in vitro admixtures used during the Challenge validation phase are provided in Supplementary Data 7 and 8.Biological admixtures were generated with a fixed tumor proportion 1-p in the range 0.2 to 0.8 in steps of 0.01 (i.e., such that the n populations excluding the tumor cells have proportions summing to 1-p). This fixed tumor proportion overrode the tumor proportion bounds defined in the Thorsson-based and Azizi-based biological models.To assess the ability of methods to differentiate between closely related signal/decoy pairs of cell types (e.g., memory vs naïve CD4+ T cells) and to improve our sensitivity in measuring this ability, within each unconstrained in vitro admixture we included a signal cell type with a high proportion (minc of 0.2 and maxc of 0.35) and we excluded the decoy cell type (minc and maxc of 0). For all other non-cancer cell types c, minc was set to 0.01 and maxc to 0.5. We considered three ranges of cancer cell proportions: mincancer = 0.2 to maxcancer = 0.3, mincancer = 0.4 to maxcancer = 0.5 and mincancer = 0.6 to maxcancer = 0.7. For each combination of these three cancer ranges and the following signal/decoy pairs, we generated 1000 candidate admixtures:SignalDecoyMonocytesDendritic cellsMacrophagesMonocytesDendritic cellsMacrophagesNaïve CD4+ T cellsMemory CD4+ T cellsMemory CD4+ T cellsNaïve CD4+ T cellsNaïve CD8+ T cellsMemory CD8+ T cellsMemory CD8+ T cellsNaïve CD8+ T cellsNaïve B cellsMemory B cellsMemory B cellsNaïve B cellsTregsNaïve CD4+ T cellsNaïve CD4+ T cellsTregsFinally, we applied the strategy described in the subsection Selection of extremal candidate admixtures with a minor modification: in each selection round, we only considered candidate admixtures generated for a particular signal/decoy pair and we iterated through the list of pairs with each round (recycling pairs as necessary).Code to generate the (unconstrained and biological) in vitro validation admixtures is in https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge/blob/master/analysis/admixtures/new-admixtures/gen-admixtures-061819.R. The admixture expression data (i.e., represented as TPM and as read counts) are in Synapse folder syn21821096 and the (ground truth) admixtures are in Synapse folder syn21820011. They are the datasets designated “DS1”, “DS2”, “DS3”, and “DS4.” Participants were told the cancer cell type included in each dataset (BRCA or CRC), which was BRCA for DS1 and DS3 and CRC for DS2 and DS4.In silico validation admixture generationInsufficient RNA was available to include naïve CD8+ T cells and neutrophils in the in vitro admixtures. However, material was available to sequence the purified samples. This allowed us to generate in silico admixtures using the above biological and unconstrained procedures, such that the final in silico admixtures used during the Challenge validation phase included: breast or colorectal cancer cells, endothelial cells, fibroblasts, dendritic cells, monocytes, macrophages, neutrophils, NK, regulatory T, naïve CD4+ T, memory CD4+ T, memory CD8+ T, naïve CD8+ T, and naïve B cells. Memory B cells were unavailable to be included in either the in vitro or in silico admixtures.For each of the two cancers (breast or colorectal) and each of the two vendor batches (i.e., Stem Express-enriched or AllCells-enriched, as described in the subsection In vitro validation admixture generation), we generated 15 fine-grained unconstrained admixtures and 20 fine-grained biological admixtures. Unconstrained admixtures were generated as described in the subsection Unconstrained admixture generation, except with a step size of 0.001. Further, we did not diversify admixtures by attempting to maximize the distance between them (as described in the subsection Selection of extremal candidate admixtures). We did diversify the biological admixtures, by generating each of the 20 admixtures in five batches and by applying the distance maximization procedure to select the four most distant admixtures from those in each batch of MCMC samples. We used these 140 fine-grained admixtures to generate 140 coarse-grained proportions by summing fine-grained proportions across sub-populations.The transcripts per million (TPM)-based expression of in silico admixtures were generated as the weighted sum of the purified TPM expression profiles. For counts-based expression of the admixtures, we first normalized the gene counts for each purified sample by the total counts for that sample, multiplied by the median across samples of sample total counts to obtain pseudo-counts on the same scale for each sample, and finally derived the admixtures as the weighted sum of the pseudo-counts.Code to generate the (unconstrained and biological) in silico validation admixtures is in https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge/blob/master/analysis/validation_data/qc/generate-validation-in-silico-admixtures.R. The admixture expression data (i.e., represented as TPM and as read counts) and (ground truth) admixtures are in the same Synapse folders as the corresponding in vitro data—i.e., syn21821096 and syn21820011, respectively. They are the datasets designated “AA”, “AB”, “AE”, and “AF.” Participants were told the cancer cell type included in each dataset (BRCA or CRC), which was BRCA for AA and AE and CRC for AB and AF.Comparator method descriptionCIBERSORT13 computes the linear combination of cell type expression profiles that optimally approximates the observed admixture expression over a set of markers, using v-support vector regression (v-SVR)78,79, as implemented in svm of the R library e1071. CIBERSORT predicts 22 leukocyte populations, whose reference expression profiles across pre-defined markers are represented in the LM22 signature matrix. The optimization problem is solved in linear expression space, after the input admixture and the (vectorized) signature matrix are scaled to have zero mean and unit variance. All data for LM22 were obtained on the HGU133A microarray platform from healthy peripheral blood. Markers were defined following quantile normalization of the microarray expression data. Markers for a given cell type were defined so as to be: (1) differentially expressed between that cell type and all others; (2) amongst a subset of candidate markers, ordered by fold change, that optimized the condition number (i.e., quantified co-linearity) of the resulting signature matrix; and (3) not expressed in non-hematopoietic healthy and malignant cells.CIBERSORTx29 uses the same, v-SVR computational core as CIBERSORT. CIBERSORTx extends CIBERSORT, however, by correcting for differences between reference and input admixture data. In this Challenge, we applied bulk-mode (B-mode) batch correction, which re-optimizes the predicted fractions, β, relative to a ComBat-corrected80 admixture a*. CIBERSORTx first solves for β0 relative to input admixture a and signature matrix S (e.g., LM22), and then defines a* by ComBat-correctinga so as to minimize its differences relative to the estimated admixture S β0. The original CIBERSORTx publication defined several new signature matrices. As described below, we used one of these to first define the fraction of immune cells (relative to endothelial, stromal, and cancer cells) and then further subdivided this immune fraction across the 22 leukocytes of LM22.EPIC14 computes the linear combination of cell type expression profiles that optimally approximates the observed admixture expression over a set of markers, using constrained, weighted, least-squares optimization, as implemented in constrOptim of the R library stats. constrOptim minimizes a function subject to linear inequality constraints using an adaptive barrier algorithm. The inequality constraint on the sum allows for an uncharacterized cell type. EPIC operates in linear TPM expression space. Batch correction between reference and input admixture data consists of subsetting both datasets to a common set of genes and then normalizing the TPM expression values to sum to unity. EPIC collects cell type marker expression profiles in a signature matrix, which was derived using sorted immune cells from peripheral blood within three datasets of healthy patients following influenza vaccination and from healthy and non-malignant diseased patients. The authors performed no batch correction across the three studies, but noted that profiles segregated primarily by cell type, rather than study, using principal component analysis (Fig. 1 of the original study14). Markers were defined so as to be differentially expressed across reference cells and to be expressed by reference cells, but not within non-reference (or uncharacterized) cells (e.g., tumor). More specifically, differential expression was performed using DESeq281 without specifying any batch correction and using pairwise comparisons of one reference cell type versus another. Markers were confirmed not to be expressed in non-hematopoietic tissues in the Illumina Human Body Map 2.0 Project (ArrayExpress ID: E-MTAB-513) or GTEx82. Finally, to avoid selecting for markers of exhaustion phenotype, for example, markers were required to have expression in reference samples and in tumor-infiltrating immune cells in non-lymphoid tissues63.MCP-counter15 computes a cell type enrichment score as the arithmetic mean of the expression of that cell type’s markers in linear expression space. The authors applied frozen robust multiarray average (fRMA) separately to each GEO series generated from three Affymetrix microarray platforms: Human Genome 133A, Human Genome 133 Plus 2.0, and HuGene 1.0 ST. The enrichment-based approach obviates the need for batch correction between reference and input admixture datasets. The authors organized cell populations hierarchically, and then for each population applied a stringent test requiring markers be differentially expressed between a population and populations not overlapping it in the hierarchy and further requiring that the marker not be expressed in the non-overlapping populations.quanTIseq17 computes the linear combination of cell type expression profiles that optimally approximates the observed admixture expression over a set of markers, using constrained least-squares optimization, as implemented in lsei of the R library limSolve. Unlike in the original publication, as applied in the Challenge (see below), quanTIseq does not scale inferred coefficients by mRNA content. quanTIseq operates in linear TPM expression space. quanTIseq collects cell type marker expression profiles in a signature matrix. Each immune marker was defined by a strategy requiring that it: (1) be expressed in at least two immune libraries; (2) be specific to its corresponding cell type (i.e., not a marker for a different cell type according to the xCell method); (3) have high (binned) expression in all libraries of that cell type and low or medium quantized expression in all other libraries; (4) not be highly expressed in non-hematological cell lines in the the Cancer Cell Line Encyclopedia (CCLE)83; (5) not be expressed in TCGA bulk tumors; (6) not be highly expressed; and (7) be highly correlated with mixing fractions simulated in 1700 datasets. The resulting signature matrix is composed of the median expression for the marker genes over all samples corresponding to a specific cell type.xCell18 computes a cell type score using ssGSEA84 of the cell type’s markers that it then calibrates to a linear scale. It deconvolves 64 cell types and uses spillover compensation to resolve those that are highly related. Markers were defined across 1725 non-malignant samples in six datasets generated using Cap Analysis Gene Expression (CAGE), RNA-seq, and microarrays. RNA-seq and CAGE data were FPKM normalized, while microarray data were normalized using robust multi-array average (RMA). Markers were defined independently in each dataset, which obviates the need for inter-technology batch correction. Markers for a specific cell type were selected such that a quantile of low expression within that cell type exceeded quantiles of high expression for all other cell types. Further, markers were required not to be expressed in CCLE carcinomas using a previously developed technique85. Enrichment scores were calculated using ssGSEA for each marker set within each dataset. The best three such signatures were assessed via validation in a held-out dataset and the mean of these was computed and fit to an analytical model that mapped it to cell type abundances.Comparator method evaluationAll comparator methods were executed by Sage Bionetworks (A.L., V.C., or B.S.W.) and without modification. All methods were passed expression in linear form.CIBERSORT13 was executed with arguments abs_method = “sig.score”, absmean = TRUE, QN = FALSE, and all other arguments default (including absolute = FALSE) via the script CIBERSORT.R. Outputs from CIBERSORT were translated into Challenge populations as described in Table S9.CIBERSORTx29 was run in two phases: (1) The first phase separates immune cells (expressing CD45), endothelial cells (CD31), fibroblasts (CD10), and epithelial/tumor cells (EPCAM) using a signature matrix (Supplementary Table 2L of ref. 29) derived from FACS purification of these four cell types within 26 surgically resected primary non-small cell lung cancer biopsies. (2) The second phase further divides the immune compartment into the 22 immune sub-populations represented by the same LM22 signature matrix originally published with CIBERSORT13. In both cases, CIBERSORTx was executed using the cibersortx/fractions docker container obtained from https://cibersortx.stanford.edu/, with arguments –rmbatchBmode TRUE –perm 1 –verbose TRUE –QN FALSE. The –sigmatrix parameter was used to specify the appropriate signature matrix. Outputs from the two phases of CIBERSORTx were translated into Challenge populations by scaling the output of LM22 phase by the CD45 output from the first phase as described in Table S10.EPIC14 was executed using the EPIC function from the EPIC R library, with the arguments reference = “BRef” and mRNA_cell = FALSE. Outputs from EPIC were translated into Challenge populations as described in Table S11.MCP-counter15 was executed using the MCPcounter.estimate function from the MCPcounter R library, with the argument featuresType=”HUGO_symbols”. Outputs from MCP-counter were translated into Challenge populations as described in Table S12.quanTIseq17 was executed using the deconvolute_quantiseq function implemented in the immundeconv R library24. deconvolute_quantiseq was passed the arguments tumor = TRUE, arrays = FALSE, and scale_mrna = FALSE. If parameterization of deconvolute_quantiseq returned any invalid (i.e., not-a-number) results, it was re-run with the additional argument method = “huber”. Outputs from quanTIseq were translated into Challenge populations as described in Table S13.xCell18 was executed using the xCellAnalysis function of the xCell R library, with the argument rnaseq = TRUE and the argument cell.types.use set to the corresponding cell types within each challenge [i.e., to c(“B-cells”, “CD4+ T-cells”, “CD8+ T-cells”, “NK cells”, “Neutrophils”, “Monocytes”, “Fibroblasts”, and “Endothelial cells”) in the coarse-grained sub-Challenge and to c(“Memory B-cells”, “naive B-cells”, “CD4+ memory T-cells”, “CD4+ naive T-cells”, “Treg”, “CD8+ Tem”, “CD8+ naive T-cells”, “NK cells”, “Neutrophils”, “Monocytes”, “DC”, “Macrophages”, “Fibroblasts”, “Endothelial cells”) in the fine-grained sub-Challenge]. Outputs from xCell were translated into Challenge populations as described in Table S14.Deconvolution method scoring and comparisonPearson and Spearman correlation-based scores were calculated hierarchically for a given method $a$: For each cell type $c$ and validation dataset $d$ (i.e., DS1, DS2, DS3, DS4, AA, AB, AE, and AF), the correlation between the values predicted by $a$ and the ground truth was calculated. These correlations were then averaged over all cell types $c$ to define the score of method $a$ for dataset $d$. These dataset-level scores were finally averaged over all datasets $d$ to define the aggregate score for method $a$.To assess scoring differences in the primary metric between a top-performing method $a$ and another method $b$, we computed a Bayes factor ${K}_{a,b}$ over 1000 bootstrap samples and considered ${K}_{a,b} > 3$ as indicating a significant difference. More specifically, we bootstrap sampled (i.e., sampled with replacement) prediction scores separately within each dataset (i.e., DS1, DS2, DS3, DS4, AA, AB, AE, and AF), calculated a Pearson correlation-based score ${S}_{i}^{a}$ between the predictions in bootstrap sample $i$ for method $a$ and the corresponding ground truth values (and similarly for ${S}_{i}^{b}$ and method $b$), and calculated ${K}_{a,b}$ as$${K}_{a,b} =\frac{{{\#}}\,{{\rm{of}}}\; {{\rm{bootstrap}}} \; {{\rm{samples}}}\; {{\rm{for}}}\; {{\rm{which}}}\; {{\rm{method\;}}}a\,{{\rm{outperformed}}}\; {{\rm{method\;}}}b}{{{{\#}}}\;{{\rm{ of}}}\; {{\rm{bootstrap}}}\; {{\rm{samples}}}\; {{\rm{for}}}\; {{\rm{which}}}\; {{\rm{method\;}}}b\;{{\rm{outperformed}}}\; {{\rm{method\;}}}a} \\ =\frac{{\sum }_{i=1}^{1000}1({S}_{i}^{a} > {S}_{i}^{b})}{{\sum }_{i=1}^{1000}1({S}_{i}^{b} > {S}_{i}^{a})}$$where $1(x)$ is the indicator function that equals $1$ if and only if $x$ is true and is $0$ otherwise. Any ties between methods $a$ and $b$ (i.e., ${K}_{a,b}\le 3$) were resolved using the secondary Spearman correlation-based metric. However, this did not occur in the first submission results. Distributions, medians, and means over the ${S}_{i}^{a}$ are reported for the Pearson correlation-based scores in the figures (e.g., Fig. 2A) and main text in lieu of a single score on the original validation data. Similar bootstrapped distributions, medians, and means were calculated for the Spearman correlation-based scores and are likewise reported.Within-sample deconvolution method assessmentWe assessed prediction performance across cell types within samples for those methods outputting normalized scores (CCB, D3Team, NYIT_glomerular), proportions (Patrick), and fractions (Aginome-XMU, Biogem, CIBERSORTx, DA_505, HM159, IZI, jbkcose, jdroz, LeiliLab, NPU,REGGEN_LAB, Rubbernecks, Tonys Boys, and xCell). We computed the Pearson correlation, Spearman correlation, and root-mean-square-error (RMSE) across cell types within a sample. To assess ties across teams, we fit a linear model whose response was the correlation or RMSE and whose dependent variable was the team. The top-scoring team (based on ordering of the median value across samples) was used as the reference in the linear model, which was fit using lm in R. Teams were considered tied with the top performer if their corresponding t-statistic p-value was >0.05, as computed from the model fit using summary.Several outliers were excluded from the RMSE sub-plots of Fig. 4 (Patrick from Fig. 4A, B and NYIT_glomerular from Fig. 4A).Deconvolution method specificity assessmentTo assess deconvolution method specificity, we calculated the (min-max) normalized prediction for a cell type X in a sample S purified for some cell type other Y ≠ X. These normalized predictions are displayed in the heatmap of Fig. 5A, with cell types as columns and samples as rows. Predictions were normalized so as to be comparable across methods independent of the scale of the prediction (e.g., both unnormalized scores comparable across samples and proportions comparable across samples and cell types). The min-max normalization of a prediction pred(X, S, m) for cell type X, method m, and purified sample S was defined as$$ [{pred}(X,\, S,m)\, -\,{\min }_{S^{\prime} }\, {pred}(X,\,S^{\prime},\,m)]\,/\,[\,{\max }_{S^{\prime} \, }{pred}(X,\,S^{\prime},\,m) \\ -\,{\min }_{S^{\prime} }\, {pred}(X,\,S^{\prime},\,m)].$$Spillover into (predicted) cell type X for method m was calculated as the above normalized prediction for cell type X and method m averaged over samples S purified for some cell type Y ≠ X (i.e., the mean of the column corresponding to cell type X in Fig. 5A that excludes elements in which X is in the sample corresponding to the row). These spillovers were then averaged over sub-Challenges and the resulting distributions were plotted in Fig. 5B. Distributions of spillovers over cell types are plotted for each method in the coarse- (Fig. 5C) and fine-grained (Fig. 5D) sub-Challenges.Code to format the purified expression profiles is in https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge/blob/master/analysis/specificity-analysis/create-spillover-dataset.R. The processed expression data (i.e., represented as TPM and as read counts) and the (ground truth) admixtures are in Synapse folder syn22392130.Deconvolution method sensitivity assessmentTo assess deconvolution method sensitivity in detecting each cell type X, we generated in silico admixtures in which we computationally spiked in X at regular proportions. We considered 49 spike-in levels from 0% to 0.1% in increments of 0.01%, from 0.1% to 1% in increments of 0.1%, from 1% to 20% in increments of 1%, and from 20% to 40% in increments of 2%. Cancer cells were neither used as spike ins nor included within the admixtures. Otherwise, all cell types with available purified expression profiles were included, namely endothelial cells, fibroblasts, dendritic cells, monocytes, macrophages, neutrophils, NK, regulatory T, naïve CD4+ T, memory CD4+ T, memory CD8+ T, naïve CD8+ T, and naïve B cells. Expression profiles of in silico admixtures were generated as described in the subsection In silico validation admixture generation.We defined the limit of detection (LoD) for cell type X and method m as the least frequency at and above which m’s prediction for X is statistically distinct from the baseline admixture (0% spike in). We assessed statistical significance using the two-sided Wilcoxon test as implemented in the compare_means function of the ggpubr library and using a raw (uncorrected) p-value cutoff of 0.01.We generated both unconstrained and biological admixtures, using both fine- and coarse-grained populations. For unconstrained admixtures, we used the broken stick procedure described in the subsection Unconstrained admixture generation, except with a step size of 0.001 and without diversifying admixtures as described in the subsection Selection of extremal candidate admixtures. For each of the two vendor batches (i.e., Stem Express-enriched or AllCells-enriched, as described in the subsection In vitro validation admixture generation) and each spike in level s, we generated five coarse- and five fine-grained unconstrained admixtures such that the proportions of the n populations summed to 1-s. We used these same five admixtures for each of the spike-in experiments by simply assigning the population with fixed proportion s the name of the population to be spiked in.For unconstrained coarse-grained populations, we wanted to fix the level of the parental population (e.g., CD8+ T cells) rather than the sub-populations comprising it (i.e., memory and naïve CD8+ T cells). We defined coarse-grained admixtures at the level of the coarse-grained populations, but to concretely instantiate them we distributed the proportion of each parental population into its corresponding sub-populations. We did so by randomly dividing the proportion into msub-populations using a flat Dirichlet distribution (using the rdirichlet function in the dirichlet library) whose m parameters were set to 1/m.Code to generate the in silico spike in admixtures is in https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge/blob/master/analysis/in-silico-admixtures/generate-in-silico-admixtures.R. The processed expression data (i.e., represented as TPM and as read counts) and the (ground truth) admixtures are in Synapse folder syn22361008.Statistical analysesAll analyses were performed using R statistical software86. Pearson and Spearman correlations were calculated with cor.test and two-sided Wilcoxon tests were performed using compare_means.Consensus rank methodWe sought to define an ensemble method to aggregate predictions across all participant and comparator methods. Since the scales of predicted values vary according to the type of method output (scores, normalized scores, or fractions), we decided to aggregate the ranks of the predicted values across methods rather than the predicted values themselves. This is an instance of the consensus ranking, or social choice, problem in which we seek a ranking that summarizes the individual rankings of n judges (or, in our case, methods) for m objects (here, samples). We could define a consensus rank-based ensemble method using ConsRank87, for example, which uses heuristic algorithms to define one or more consensus rankings. However, as the (approximate) solutions are not guaranteed to be unique, we decided instead to take the more straightforward and more computationally efficient approach of simply defining the ensemble ranking as the mean of the individual rankings.Deconvolution method performance across healthy and cancer-associated immune cell typesWe compared deconvolution performance across datasets independently for each cell type and after controlling for the mean performance of each method. Specifically, within each dataset, cell type, and method, we adjusted the Pearson correlation between the ground truth and predicted values by subtracting off the mean Pearson correlation across datasets for the respective cell type and method. We then performed an ANOVA by passing the formula pearson.r ~ 0 + dataset to lm in R. To test whether the Pearson correlation within any individual dataset was less than that of the mean across datasets, we computed one-sidedp-values using the Student t distribution function pt with lower=TRUE. Finally, we adjusted all p-values calculated across all datasets and cell types for multiple hypothesis testing using the Holm-Bonferroni method implemented in p.adjust. We used the Holm-Bonferroni method as it makes no assumptions of the p-values, in particular, that they are independent, a condition that would not be met for our testing scenario. 95% confidence intervals were computed with confint.Pseudo-bulk Wu et al. BRCA dataset generationWe downloaded the raw count data and metadata generated by the Wu et al. BRCA single-cell study34 from the Broad Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). Specifically, we downloaded Allcells_raw_count_out.zip and Whole_miniatlas_meta.csv from https://singlecell.broadinstitute.org/single_cell/study/SCP1039/a-single-cell-and-spatially-resolved-atlas-of-human-breast-cancers. We extracted read counts from the raw data using Read10X from Seurat88. We mapped single cells as annotated in the metadata of the published study using the fields celltype_major, celltype_minor, and celltype_subset to those in the coarse-grained sub-Challenge as described in Table S15. The fine-grained cell types were subsequently mapped to coarse-grained cell types.Separately for the coarse- and fine-grained translations, we defined the frequency of Challenge cell types within each patient. We further summed the raw gene counts for cells having the same annotation within a patient, resulting in raw counts for each cell type within a patient rather than for each cell. We excluded genes from the original study not present in the Challenge validation data. We excluded patients whose respective cells don’t cluster with cells of the corresponding type from other patients. More specifically, we computed a reduced dimensionality UMAP projection from CPM-normalized deconvolution-related genes. If the most frequent annotation of a cell’s ten nearest neighbors differed from its own annotation, that cell (type) was considered problematic and the corresponding patient was excluded from further processing. The following patients were excluded: CID4513 and CID4465 (because of mis-clustering of B cells), CID4461 (monocytic lineage), and CID4523 (endothelial cells). Deconvolution-related genes were those used in four published methods: MCP-Counter genes, CIBERSORT’s LM22 genes, EPIC’s TRef and BRef signature genes, and quanTIseq’s TIL10 genes.As some deconvolution methods expect Hugo gene symbols, whereas others expect Ensembl gene identifiers, we translated the former to the latter using the same translation table applied during the Challenge (available on Synapse via syn22394938). Where multiple Ensembl identifiers mapped to the same gene symbol, the associated raw counts were summed. Counts were normalized such that genes within each cell type/patient pair summed to the same value – namely, the maximum sum across all cell type/patient pairs. This facilitated generation of a pseudo-bulk sample for each patient by taking a sum of raw count, cell type expression profiles for the patient weighted by their corresponding proportional frequency within that patient. These were then translated to CPM values and both the raw count and CPMs in both Ensembl and Hugo namespaces were supplied as input to Challenge participant and comparator methods. These were then executed against the data using the Dockerized methods submitted to the Challenge, with the exception of CIBERSORTx, which was run manually (using the same version and procedure as in the Challenge).Pseudo-bulk Pelka et al. CRC dataset generationWe generated pseudo-bulk CRC data from the Pelka et al. single-cell study35, using the same procedure as described above for the Wu et al. study, except as noted below. We downloaded the raw data file GSE178341_crc10x_full_c295v4_submit.h5 from GEO and the metadata file crc10x_tSNE_cl_global.tsv from the Broad Single Cell Portal (https://singlecell.broadinstitute.org/single_cell/study/SCP1162/human-colon-cancer-atlas-c295). We extracted the raw count data, with genes represented with Ensembl identifiers, using Read10X_h5 with arguments use.names=FALSE and unique.features=FALSE. Using the hdf5r library, we further extracted the translation between Ensembl identifiers (matrix/features/id slot) and Hugo symbols (matrix/features/name slot). We removed genes from pseudoautosomal regions (i.e., with PAR_Y in their Ensembl identifier).We mapped single cells as annotated in the metadata of the published study using the field ClusterFull to those in the fine-grained sub-Challenge as described in Supplementary Data 17. The fine-grained cell types were subsequently mapped to coarse-grained cell types.Statistics and reproducibility1000 bootstraps were used to compare methods. A sample size of 1000 was chosen so that intra-method variance across bootstraps was small (i.e., relative to inter-method variance). No data were excluded. Cell populations were generally derived from two biological replicates (often from different vendors). Exceptions were memory B cells, which were not replicated owing to poor RNA integrity of the second sample. Further, fibroblasts, endothelial cells, and colon and breast cancer cells were not replicated, as they were derived from cell lines. The experiments were not randomized, since there were no experimental groups in our study and randomization was not relevant. The investigators were not blinded during the study, as there were no group allocations.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Community assessment of methods to deconvolve cellular composition from bulk gene expression

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis