Identifying cell lines across pan-cancer to be used in preclinical research as a proxy for patient tumor samples

All the experiments were conducted in accordance with relevant guidelines and regulations.DatasetsData preprocessing for running CTDPathSim2.0In CTDPathSim2.0, we integrated DNA methylation, gene expression, and CNA datasets from 22 different cancer types. We downloaded the genomic data of cancer patient samples from the TCGA database. Using TCGAbiolinks41, we downloaded RNA sequencing (RNA-Seq) data in HTSeq-FPKM format. We downloaded DNA methylation data from Infinium HumanMethylation27 Bead-Chip (27 K) and Infinium HumanMethylation450 Bead-Chip (450 K) platforms. We retrieved masked copy number variation (Affymetrix SNP Array 6.0) and computed the gene-centric copy number value compatible with hg38 genome assembly using R Bioconductor package CNTools42. From the COSMIC (Catalogue Of Somatic Mutation In Cancer)43 cancer gene census project, we downloaded a list of 702 cancer-driving genes that are found to be frequently mutated in different cancer types.We downloaded gene-centric copy number data for 1040 cell lines from the CCLE database. We downloaded the CCLE_RNAseq_gene_rpkm _10180929.gct.gz file for gene expression for 1019 cell lines and CCLE_RRBS_cgi_CpG_clusters_20181119.txt.gz file for DNA methylation data of 831 cell lines. Using the R package BioMethyl44, we removed CpG sites that had missing values in more than half of the samples and imputed the rest of the missing values by employing the Enmix45 R package with default parameters.To prepare the reference DNA methylation profiles for the deconvolution step, we processed raw methylation probes provided in FlowSorted.Blood.450k46 R package that consists of DNA methylation probes from peripheral blood samples with five different cell types, namely, B cell (B), natural killer (NK), CD4+ T, CD8+ T, and monocytes (MC), generated from adult men with replicates. We also processed raw DNA methylation data available at NCBI Gene Expression Omnibus (GEO) database repository with the dataset identifier GSE122126 for three more cell types, namely, adipocytes (AC) (for 450 K DNA methylation only), cortical neurons (CN), and vascular endothelial (VE) cells with replicates. We processed these raw methylation files using the R package minfi47 to prepare our reference DNA methylation profiles of different cell types. Two different reference files, one for 450 K probes and the other for 27 K probes, with eight different cell types were prepared. Please see Table 1 to find the sample sizes which were used after preprocessing in this work.Data preprocessing for evaluating CTDPathSim2.0’s resultsTo compare the drug responses of patient tumor sample-cell line pairs, we compiled our ground truth drug response data for the patients from the TCGA database using Overall Survival (OS) and Progression-Free Survival (PFI) as clinical endpoints. These survival endpoints were considered representative of the drug response status of the patients. In OS, the patients who were dead from any cause were considered dead, otherwise censored. In PFI, the patients having a new tumor event whether it was a progression of the disease, local recurrence, distant metastasis, new primary tumor event, or died with cancer without a new tumor event, including cases with a new tumor event whose type is N/A were considered as the event occurred and all other patients were censored. We considered survival status 0 (i.e., censored), representative of sensitive drug response whereas survival status 1 (i.e., dead/event occurred), representative of resistant drug response as we mapped the patients’ survival status to drug response. In order to determine if PFI or OS status can be used as a drug response metric, we looked at the time intervals between drug applications. The following five rules were used to map the patients’ survival status to drug response (the details of these rules can be accessed from Supplementary Note S3):Drug response = OS status; if PFI status = 0 and PFI.time = OS.timeDrug response = PFI status; if PFI status = 1 and PFI.time = OS.timeDrug response = PFI status; if drug application end days ≤ PFI days and PFI status = 1 with PFI. time < OS.timeDrug response = OS status; if drug application end days ≤ PFI days and PFI status = 0 with PFI.time < OS.timeDrug response = OS status; if drug application start days ≥ PFI daysFor cell lines’ drug response data, we considered IC50 (the concentration of a drug that is required for 50% inhibition in vitro) values as drug response quantification. Furthermore, \({{\mathrm{ln}}}\,({{IC}}_{50})\, < \,-2.0\) was used to define positive (i.e., sensitive) anti-cancer drug response, and other \({{\mathrm{ln}}}\,({{IC}}_{50})\) values were considered as negative (i.e., resistant) drug responses. Since common drugs between TCGA and CCLE were few (only twelve), we also added cancer cell lines drug response data with IC50 values from the Genomics of Drug Sensitivity in Cancer (GDSC)48 database in our ground truth drug response data. We found 989 overlapping cell lines between the CCLE and the GDSC databases. We processed IC50 values of the GDSC drug similarly as we did for the cell lines in the CCLE. We retrieved 24 common drugs between TCGA patients of 19 different cancer types and cell lines from CCLE and GDSC after using the five filtering criteria mentioned below. A list of these drugs with the number of patient-cell line pairs, number of resistant and sensitive patients, and cell lines can be accessed from Supplementary Table S1.To compare the drug responses between sample-cell line pairs according to the computed similarity scores, we used z-normalized scores and considered only those drugs and similarity scores that passed the following filtering criteria to be eligible for the drug response concordance test between sample-cell line pairs (shown in Table 9 and Table 10 as an example):Table 9 Selection of the drugs D1 and D2 based on match-mismatch ratioTable 10 Selection of the similarity scores for drug D based on hypergeometric test and match-mismatch difference1. Filtering the drugs based on match-mismatch ratio: To check if a drug has a consistent response between sample-cell line pairs in the ground truth data, we computed a match-mismatch ratio, i.e., the ratio between the number of matched (i.e., same drug response) sample-cell line pairs to the number of mismatches (i.e., different drug response) sample-cell line pairs. The drugs that had a match-mismatch ratio < 0.2 were eliminated as even in the ground truth data, the majority of sample-cell line pairs have discordant drug responses. Table 9 shows an example of this filtering step.2. Filtering the drugs based on similarity thresholds: We considered the drugs for which the sample-cell line pairs had similarity thresholds at least above or below 1 and −1, respectively to allow enough data points in high and low-similarity sample-cell line groups.3. Filtering the computed similarity scores based on hypergeometric p-value: Since we computed millions of sample-cell line similarity scores, we had discrete similarity scores, each with multiple sample-cell line pairs after rounding the similarity scores to three digits. This rounding precision allowed us to generate a substantial number of data points for our evaluation. Decreasing the rounding threshold by two digits or increasing the rounding threshold by more than five digits would have led to the elimination of several “discrete scores.” Therefore, after careful consideration, we chose to use three digits for rounding. Furthermore, to assess the significance of each discrete similarity score, we implemented a stringent filter as described below:For a given cancer type, we calculated the number of sample-cell line pairs for all similarity scores that had drug response data for drug D, forming Set 1. We also computed the number of sample-cell line pairs for a specific discrete similarity score, denoted as K, forming Set 2. To determine if K is a significant discrete similarity score, we conducted a hypergeometric test to assess the overlap between Set 1 and Set 2 (hypergeometric p-value < 0.05). The universe for this hypergeometric test encompassed all sample-cell line pairs of the cancer type under study.This approach ensured that we considered only those discrete similarity scores that exhibited a significant overlap between Set 1 and Set 2, thereby enabling us to evaluate drug response match and mismatch between different sample-cell line pairs for drug D. Table 10 shows an example of this filtering step.4. Filtering the computed similarity scores based on match-mismatch difference: We filtered the similarity scores that showed at least a ten percent difference between match and mismatch (i.e., the difference between the last two columns in Table 10) drug response within sample-cell line pairs for considering these as confident scores for the drug response concordance test (Table 10).Running CTDPathSim2.0CTDPathSim2.0 has five main computational steps. In the following paragraphs, we describe the CTDPathSim2.0 R functions to run these steps. We performed these steps for each of 22 different cancer types separately. The entire pipeline of CTDPathSim2.0 is illustrated in Fig. 1.Step 1: Computing sample-specific deconvoluted DNA methylation profileTo capture the accurate DNA methylation signal of tumor samples, in the first step, we utilized a deconvolution-based algorithm49 to infer the sample’s deconvoluted DNA methylation profile with proportions of different cell types in the sample’s bulk tumor tissue. The deconvolution algorithm requires a reference DNA methylation profile of different cell types. Hence, we compiled a list of reference DNA methylation profiles with eight different cell types, namely, B, NK, CD4+ T, CD8+ T, MC, AC, CN, and VE with replicates. We considered these cell types since a blood-based DNA methylation profile can depict the distribution of immune cell types50,51, and these cell types play roles in targeted therapies in cancer11,12,13. This algorithm computes a matrix with DNA methylation profiles of each of the constituent cell types (i.e., W = (wij) where W ∈ \({{\mathbb{R}}}^{l{{\times }}c}\) ; l is the number of loci, c is the number of cell types, i = 1, 2,…., l and j = 1, 2,…., c.) and a matrix with the proportions of constituent cell types in each input tumor sample (i.e., H = \(({h}_{{ij}})\) where H ∈ \({{\mathbb{R}}}^{c{{\times }}s}\) ; c is the number of cell types, s is the number of tumor samples, i = 1, 2,…., c and j = 1, 2,…., s.)) minimizing the Euclidean distance between their linear combination (i.e., WH) and the original matrix of samples’ DNA methylation profiles (i.e., V = (\({v}_{{ij}}\)) where V ∈ \({{\mathbb{R}}}^{l {\times } s}\) ; l is the number of loci, s is the number of tumor samples, i = 1, 2,…., l and j = 1, 2,…., s.) (Fig. 1, Step 1). We designed a function RunDeconvMethylFun() that ran a minimization algorithm in an iterative procedure that, in each round, alternates between estimating constituent cell-type proportions and DNA methylation profiles by solving constrained least-squares problems of \({{||}{{{\boldsymbol{V}}}}\,-{{{\boldsymbol{WH}}}}{||}}_{F}^{2}\) using quadratic programming method49 with the constraints that 0 ≤ vij ≤ 1, 0 ≤ wij ≤ 1 and \({\sum}_{i}{h}_{{ij}}=1\). These three constraints for this problem were from the fact that DNA methylation measurements and cell type proportions were numbers in the range of 0 to 1 (hence, 0 ≤ vij ≤ 1 and 0 ≤ wij ≤ 1), and cell type proportions within a sample added up to 1 (hence, \({\sum}_{i}{h}_{{ij}}=1\)) (Fig. 1, Step 1). These constraints restrict the space of possible solutions, thus making it possible for the local iterative search to find a global minimum and an accurate solution reproducibly. To initiate the algorithm, we utilized the reference profiles as W with a set of selected 500 marker loci (probes) based on comparisons of each class of reference against all other samples using t-test. These loci were likely to exhibit variation in DNA methylation levels across different constituent cell types.We designed a function, SampleMethFun() that utilized the estimated proportion of constituent cell types (i.e., HE) and the estimated probe-centric DNA methylation profile of constituent cell types (i.e., WE) from the output of RunDeconvMethylFun() to compute the sample-specific deconvoluted DNA methylation profile for each tumor sample (see Fig. 1, Step 1). We ran this step for 450 K DNA methylation probes (i.e., loci) and 27 K DNA methylation probes separately. For each sample-specific deconvoluted DNA methylation profile, we computed gene-centric DNA methylation beta values from the probe-centric DNA methylation values with ProbeToGeneFun() function in our R package. For the 450 K platform, the average beta value for promoter-specific probes was being considered due to their role in transcriptional silencing52. Given lower coverage in the 27 K platform, the average beta value of all the probes of a gene was considered as the gene’s DNA methylation level.Step 2: Computing sample-specific deconvoluted expression profileIn this step, we computed the deconvoluted expression profile for each tumor sample. We computed cell-type-specific expression profiles for the reference cell types utilized in Step 1 minimizing the objective function \({{||}{{{\boldsymbol{Y}}}}-{{{\boldsymbol{ZH}}}}{||}}_{F}^{2}\) where Y is the original matrix of tumor samples’ expression profiles (i.e., Y = \(({y}_{{ij}})\) where Y ∈ \({{\mathbb{R}}}^{g{{\times }}s}\), g is the number of genes, s is the number of samples, i = 1, 2,…., g and j = 1, 2,…., s.)), and Z is the cell-type-specific expression profiles (i.e., Z = \(({z}_{{ij}})\) where Z ∈ \({{\mathbb{R}}}^{g{{\times }}c}\), \({g}\) is the number of genes, c is the number of cell types, i = 1, 2,…., g and j = 1, 2,…., c.) (Fig. 1, Step 2) to be estimated. We provided RunDeconvExpr() function in the package that utilized the estimated cell proportions (HE) from Step 1 as a fixed input (i.e., H = HE) and computed average gene expression profiles of constituent cell types through a constrained least-squares of \({{||}{{{\boldsymbol{Y}}}}{{{\boldsymbol{-}}}}{{{\boldsymbol{ZH}}}}{||}}_{F}^{2}\) using quadratic programming with solutions constrained to [0,∞) for gene expression measurements49. Then, using estimated cell-type-specific proportions in each sample (HE) and estimated cell-type-specific expression (ZE) in SampleExprFun() function, we computed the deconvoluted gene-centric expression profile of each tumor sample (Fig. 1, Step 2).Step 3: Computing sample- and cell line-specific DE genes and biological pathwaysIn this step, we computed enriched biological pathways listed in REACTOME53 database for each sample and cell line using sample-specific DE genes. We designed GetSampleDE() function to compute the median expression of a gene across all samples and further computed the fold-change of that gene to the computed median expression. A gene was considered up if the fold-change ≥4 and down if the fold-change ≤−4. Likewise, using the GetCellLineDEFun() function and similar thresholds, we computed cell line-specific DE genes. Considering the critical role of the biological pathway activities in cancer16 and the previous studies that suggested that pathway activation status in cancer cell lines could connect tumor samples17,18, we identified enriched biological pathways for each sample and cell line using GetPathFun() function in our R package utilizing these DE genes. GetPathFun() found enriched biological pathways listed in the REACTOME database via a pathway enrichment tool Pathfinder54. To prioritize cancer-related biological pathways, specifically, we used the known frequently mutated genes that were considered cancer-driving genes from the DE gene list for this analysis. We obtained the frequently mutated cancer-driving genes from the COSMIC43 database. GetPath() considered the pathways that were significantly enriched with FDR-corrected hypergeometric p-values < 0.05 in this DE cancer gene list.Step 4: Computing sample- and cell line-specific differentially methylated (DM) and differentially aberrated (DA) genesIn this step, we computed sample-specific and cell line-specific DM and DA genes. For sample-specific DM genes, we computed M-values55 from gene-centric beta values and utilized GetSampleDMFun() function in the R package to compute the median of gene-centric M-values across all the samples in each cancer type. This function considered a gene hypermethylated if the M-value fold-change ≥4 and hypomethylated if fold-change ≤−4. Likewise, using GetCellLineDMFun() with the same thresholds, we computed cell line-specific DM genes among all available cell lines in the study.To compute sample-specific DA genes, we computed highly amplified and deleted genes utilizing chromosomal copy number profiles of each cancer type cohort as inputs in the GISTIC 2.056 tool in the GenePattern57 web server using a confidence interval of 0.90. We selected the genes that exceeded the high-level GISTIC thresholds for amplification and deletions as 2 and −2, respectively as DA genes.For cell lines, we selected the highly amplified and deleted genes for each cell line compared with all other cell lines from various cancer types to capture the cancer-specific variation in copy number profiles in computing the similarity score between a patient sample and cell line. For this purpose, we utilized the GetCellLineDA() function, which selects cell line-specific DA genes based on Tukey’s mean-difference58 curve based on gene-centric copy number data of cell lines. To determine whether a gene is DA for a specific cell line, we calculated the mean of that gene’s copy number values across all cell lines and the difference between the copy number value of that gene in the cell line under consideration and the mean of the copy number values of that gene in the other cell lines. For each cell line, the highly amplified genes were chosen based on mean ≥0.5 and difference ≥1, and highly deleted genes were chosen based on mean ≥0.5 and difference ≤−1.Step 5: Computing sample-cell line pathway activity-based similarity scoreIn this step, we computed Spearman rank correlation between each sample-cell line pair using the sample-specific deconvoluted expression, sample-specific deconvoluted DNA methylation, and sample’s gene-centric copy number values. All DM genes that occurred in an enriched pathway of a sample or cell line were used to compute the Spearman rank correlation between each sample-cell line pair via FindSim() function in the R package (Fig. 1, Step 5). Similarly, using DE and DA genes that were present in the gene list of the union of enriched pathways, we compute the Spearman rank correlation between each sample-cell line pair. Using DM and DE genes, we used the sample’s deconvoluted profiles to compute Spearman rank correlation, whereas for DA genes, we used the sample’s gene-centric copy number profile to compute Spearman rank correlation (Fig. 1, Step 5). We scaled the Spearman similarities to the range of 0 to 1 using min-max normalization and computed an average similarity score taking a mean of expression-based correlation, DNA methylation-based correlation, and copy number-based correlation for each sample-cell line pair. Since we did not have DNA methylation data and copy number data for all the cell lines, there were sample-cell line pairs that had only expression-based scores.We presented the CTDPathSim2.0 pipeline in this section, but our R package can be used to compute similarity scores without using CNA data i.e., running CTDPathSim1.0. In that case, the FindSim() function of the package can be used to run the pipeline without DA genes (please check the vignette of the CTDPathSim2.0 in Supplementary Software 1 for these details).Running state-of-the-art-methodsWe compared CTDPathSim2.0 with our previous method CTDPathSim1.0 and three other state-of-the-art methods, namely, TSI, TC analysis, and Celligner by running them on each of 22 different cancer type cohorts from TCGA and cancer cell line datasets from CCLE. For CTDPathSim1.0, we computed only DNA methylation-based and gene expression-based similarity scores and averaged them to get a unified similarity score for each sample-cell line pair in a cohort.Running TSI methodFor the TSI method, we performed z-normalization of RNA-Seq data for each gene in the samples and cell lines and performed all the required steps to compute tumor sample-cell line similarity scores based on the TSI paper7. We applied SVD on the normalized gene expression matrix of samples. Then each sample was projected into SVD space by measuring its correlation to the first 16 Eigenarrays (i.e., left singular vectors) that explain the most variance of the data. Similarly, we projected each cell line into SVD space. We computed a Pearson correlation score between each patient and cell line as a similarity score, namely, the TSI score in the SVD space. We applied z-normalization on TSI scores computed between each cancer sample and cell line.Running TC analysisFor TC analysis, we used 30,681 common genes between CCLE and TCGA RNA-Seq gene expression datasets with log2 transformation. Then, we rank-transformed gene RPKM (reads per kilobase of transcript per million reads mapped) values for each CCLE cell line and ranked all the genes according to their expression variation across all CCLE cell lines. The 1000 most variable genes were kept as “marker genes” to compute Spearman rank correlation, namely, TC scores, between each cell line and sample pair using the respective expression profiles.Running CellignerConsidering Celligner’s global aligning approach of gene expression data between cell lines and tumor samples that depends on various cancer types and the composition of the dataset, we ran this pipeline as described in their paper utilizing the RNA-Seq data in TPM (transcript per million reads) of 12,236 tumor samples and 1249 cell lines as log2(TPM + 1) transformation from Cancer Dependency Map [https://depmap.org] data portal for 37 different cancer types. We used the gene expression data without ‘non-coding RNA’ and ‘pseudogene’ as input in the Celligner pipeline. Celligner first performed unsupervised clustering of each dataset and removed the expression signatures with excess intra-cluster variance in the sample compared to cell line data. Then, using DE genes between clusters in the data it ran contrastive Principal Component Analysis (cPCA) followed by a batch correction method, MNN (mutual nearest neighbors) that aligned similar sample-cell line pairs to produce corrected gene expression data. We further filtered this gene expression matrix for 22 different TCGA cancer types utilized in the study and 1015 CCLE cell lines that were common to CTDPathSim2.0. Then, we performed PCA on the Celligner-aligned data and took the Euclidean distance between each cell line and patient sample in PCA space (using the first 70 principal components following59) as a similarity measure.Performing k-means clustering and Random Forest analysisWe filtered annotation labels with disease type (i.e., cancer type) and subtype for which at least ten tumor samples were available in TCGA. To display the clusters of patient tumor samples based on their similarity scores with 1018 cell lines, we used 7228 tumor samples with 19 distinct cancer type labels and 28 distinct cancer subtype labels. For clustering analysis, we utilized the patient-cell line similarity matrix. This matrix was constructed with similarity scores as entries, where each row represents a patient, and each column represents a cell line.Using this matrix, we performed k-means clustering on patient tumor samples based on their cell line-specific similarity scores. We then compared these clusters with annotated cancer type groups i.e., 19 disease labels and 28 subtype labels. We used R base package Stats with k = 19 and k = 28 for getting similarity matrix-based clusters of two groups while setting all other parameters as default. We also compared the resulting clusters with those derived from the original omics data such as gene expression, DNA methylation, and CNA. Using the RI, a measure ranging from 0 (no agreement) to 1 (complete agreement), we quantified the similarity between two clustering groups.For the Random Forest analysis, we treated each cell line as a feature and used the patient-cell line similarity scores as feature value. We partitioned our data, using 75% of the tumor samples (selected randomly) for training and the remaining 25% for testing. To train the model and tune the hyperparameters, we performed 10-fold cross-validation using the training data. Then the trained modeled was tested on the held-out test data. This approach allowed us to evaluate the model’s performance and adjust parameters as needed to optimize accuracy. The Random Forest classifier was implemented using the ‘randomForest’ package in R.Statistics and reproducibilityThe statistical analyses were conducted using publicly available datasets, as detailed in the datasets section. To ensure reproducibility of our results, we have released CTDpathsim2.0 as open-source R software and made all processed datasets60, and scripts available.

Identifying cell lines across pan-cancer to be used in preclinical research as a proxy for patient tumor samples

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

Hot Topics

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Popular Articles

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle