Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Omics datasets curation from UCSC XenaWe have curated multi-omics pan-cancer datasets from UCSC Xena data hubs (https://toil.xenahubs.net, https://gdc.xenahubs.net, https://tcga.xenahubs.net, https://pancanatlas.xenahubs.net) for each of TPC databases. For TCGA, 14 selected datasets involve 7 types of molecular profiling (Gene Expression, Transcript Expression, DNA Methylation, Protein Expression, miRNA Expression, Gene Mutation, Copy Number Variation). There are respectively 3 datasets characterized by different normalization methods (TPM (transcripts per million), FPKM (fragments per kilobase of transcript per million mapped reads), Count) for gene and transcript molecules. In addition, the copy number variation and methylation profiling also have alternative datasets available due to different identification algorithms or sequencing platforms. Regarding the other two projects, we totally selected 8 datasets involving 5 types of molecules (Gene Expression, Promoter Activity, Gene Fusion, miRNA Expression, APOBEC Mutagenesis) for PCAWG and 5 datasets involving 4 types of molecules (Gene Expression, Protein Expression, Gene Mutation, Copy Number Variation) for CCLE.Non-omics feature collection and calculationFour categories of non-omics data mainly for TCGA database were collected from UCSC Xena platform or other resource for extensive analysis. Firstly, basic clinical phenotypes of patients (e.g., Age, Gender, Tissue Code, Stage) were incorporated. Then, diverse features for five types of tumor indexes were also compiled, including Tumor Purity, Tumor Stemness, Tumor Mutation Burden, Microsatellite Instability, Genome Instability. Next, we estimated the immune infiltration conditions and pathway expression scores among TCGA samples. In detail, the compositions of immune cells based on 7 types of deconvolution algorithms (CIBERSORT, CIBERSORT-ABS, EPIC, MCPCOUNTER, QUANTISEQ, TIMER, XCELL) were obtained from the TIMER2.0 website9, which were calculated via immunedeconv package32. The expression scores of hundreds of gene sets from three signature resources (HALLMARK, KEGG, IOBR)12,23,33 were calculated through the ssGSEA method of GSVA package34 based on the “TcgaTargetGtex_rsem_gene_tpm” dataset of UCSC Xena. Afterward, we endeavored to collect the same identifiers of non-omics data for PCAWG and CCLE databases. Specifically, the “tophat_star_fpkm_uq.v2_aliquot_gl.sp.log” dataset was utilized to evaluate immune infiltration and pathway activity of PCAWG samples. Data sourced outside of the UCSC Xena platform have been archived on Zenodo (https://zenodo.org/doi/10.5281/zenodo.4625639)35 for accessibility and preservation.Custom molecular signature designUser-designed molecular signature can be comprised of \({{n}}\) molecules from any one of curated molecular types of TPC databases. For each constituent molecule \({{m}}\), its corresponding coefficient \({{w}}\) can be set and the default value is 1. Then, the signature score is calculated through the aggregation of the products between molecular values and their coefficients.$${{{{\rm{Signature}}}}\; {{{\rm{score}}}}}={\sum}_{{{i}}={{1}}}^{{{n}}}{{{w}}}_{{{i}}}\times {{{m}}}_{{{i}}}$$Two preprocessing modules of the TPC pipelinesThe filtering module at the upstream of pipelines enables precise selection of tumor subpopulations with specific characterizations. Any identifier from the integrated TPC data can be used as the condition. Multiple data operators were designed for versatile filtering. In detail, “+” or “-” are used to retain or discard samples for categorical (character) conditions. There are two types of operators to set absolute (“>”, “<”) or percentile (“%>”, “%<”) thresholds for continuous (numeric) conditions. Ordered combinations of multiple conditions are also supported for intricate filtering operations. Another preprocessing module is for grouping samples according to user-defined conditions. Two non-overlapping subgroup ranges can be flexibly set depending on the type of one selected condition.Three analysis methods of the TPC pipelinesFundamental tumor data analyses, including correlation, comparison and survival, are generally incorporated with various analysis and visualization parameters, implemented by corresponding R packages. The ggscatterstats and ggbetweenstats functions of the ggstatsplot package36 are applied for correlation and comparison analysis as well as visualization, respectively. Regarding the analysis methods, two correlation coefficients (Pearson, Spearman) and two comparison options (Student’s t-test, Wilcoxon test) are both supported. Two survival analyses (log-rank test and univariate Cox regression) between two groups of samples, are implemented by survdiff and coxph functions of the survival package37. Noteworthily, if the grouping condition is continuous, alternative analyses are supported instead of pre-setting groups: for log-rank test, the optimal cutoff can be automatically decided, while for Cox regression, the continuous variable can be directly included in model. The survival analysis is mainly available in TCGA and PCAWG databases with OS (overall survival), DSS (disease-specific survival), DFI (disease-free interval), PFI (progression-free interval) endpoints for TCGA samples and OS endpoint for PCAWG samples.Three analysis modes of the TPC pipelinesDepending on different purposes, we have designed three modes for each analysis method, termed as individual mode, pan-cancer mode and batch screen mode. In the basic individual mode, one identifier-specific analysis can be performed in the context of one cancer. In the pan-cancer mode, the same individual analysis can be consecutively performed across multiple cancers. There are various visualization plots for reasonable display of analytical results in above two modes. The batch screen mode is used for identifying statistically significant candidate identifiers for one cancer. In detail, three ways are supported to choose batch identifiers. Except for one-by-one selection, user can upload a text file with eligible identifiers or directly select all identifiers of one data type. We enable users to choose all identifiers in any one pathway gene set curated from the Molecular Signatures Database (MSigDB)38. Finally, three types of results can be downloaded, including the raw data, detailed statistical results and visualization plot.Well-organized pan-cancer HTML reportThe quick generation of pan-cancer analysis report enables the exploration of multi-faceted features of one molecule from seven omics types based on the integrated TCGA data. Given the prepared R markdown script, one well-organized report in HTML format can be rendered via the knitr39 package and it comprises five sections of pan-cancer analysis, involving the relationships of one molecule with clinical phenotypes, survival events, tumor indexes, immune infiltration and pathway activity. The interactive figures and tables embedded in the report are implemented via the DT40 and plotly41 packages, respectively.Custom download modulesTwo download modules are furtherly added to support the custom acquisition of original datasets. The first module is used to directly fetch matrix data of interesting samples and identifiers from the TPC omics tumor data. Other non-omics data, like survival information, can be fully obtained through corresponding buttons. The second module can be generally applied to download the subset of most matrix datasets in UCSC Xena repository, where multiple molecules can be selected through the original identifiers or additional probe map annotation.Pharmacogenomics data collectionWe have totally collected comprehensive drug screening databases from six publicly accessible pharmacogenomics studies (Supplementary Table 7), including two datasets (GDSC1, GDSC2) from the Genomics of Drug Sensitivity in Cancer (GDSC) project14, two datasets (CTRP1, CTRP2) from the Cancer Therapeutics Response Portal15,16, one dataset (PRISM) from the Cancer Dependency Map Consortium’s DepMap portal17, and one dataset (gCSI) from the Genentech Cell Screening Initiative18. Six types of omics profiling (Gene Expression, Protein Expression, Copy Number Variations, DNA Methylation, Gene Fusion, and Gene Mutation) are collected from the Cancer Cell Line Encyclopedia (CCLE) and ORCESTRA portal42. Given that there are overlapping cells in different drug and omics datasets, we have utilized the common data to assess correlations, thereby maximizing the utilization of existing information. For instance, the designation “gdsc_ctrp1” indicates that the omics data is sourced from the GDSC project, while the drug sensitivity data is derived from the CTRP1 project. In the evaluation of projects based on DepMap, GDSC, and CTRP, we employed the reported area under the dose-response curve (AUC) values as a measure of therapeutic efficacy. Conversely, for the gCSI project, the area above the dose-response curve (AAC) served as the indicator for drug sensitivity. Generally, lower values of AUC signifies enhanced sensitivity to drug treatment. To ensure consistency across all datasets, wherein a lower metric reflects higher drug sensitivity, we transformed the AAC values from the gCSI dataset using the formula max(AAC) – AAC.Implementation of pharmacogenomic modulesThe total features for pharmacogenomic analysis include drug sensitivity and multiple molecular information, where the types of drug sensitivity, mRNA expression, DNA methylation, protein expression, and copy number variable are continuous and the types of gene fusion, gene mutation, and gene site mutation are categorical. The first module investigates the relationship between molecular characteristics and drug sensitivity across diverse cell line types. Boxplots with Kruskal-Wallis tests or bar plots with Chi-squared tests are utilized for continuous or categorical features, respectively. To conduct t-SNE analysis, drugs with over 80% missing records and cells with over 50% missing records are excluded from each dataset. Then R package impute43 is then applied to impute remaining missing data using the nearest neighbor averaging function. In the analysis of scaling feature associations, the determination of effect size and statistical significance varies depending on the types of features being compared:

1.

For continuous features compared against continuous datasets (e.g., levels of drug A vs. all CNV features), the Spearman correlation coefficient (R) ranging from −1 to 1 is employed.

2.

When assessing categorical features against categorical datasets (e.g., TP53 mutation events vs. all recorded gene fusions), the effect size is measured using the log2 odds ratio, with P values computed using the Chi-squared test.

3.

In cases where continuous features are compared against categorical datasets or categorical features are compared against continuous datasets, the log2 fold change (events/wildtype) is used as the effect size metric, with P values derived from the Wilcoxon test.

Users have the flexibility to set a threshold to filter the absolute value of the effect size, with default values established at 0.2, 2, and 4 for the aforementioned scenarios. Following table review, users can download the tabular result containing all significant pairs, with statistical significance determined at P < 0.05 and effect size above the user-defined threshold.Statistics and reproducibilityThree main statistical analyses are supported in the Shiny web. Correlation analysis can be performed using the Spearman or Pearson method. Comparison analysis between two groups can be performed using Wilcoxon test or Student’s t-test. In general, the robust non-parametric tests (Spearman correlation analysis and Wilcoxon comparison analysis) are recommended. The log-rank test and univariate Cox regression can be implemented for survival analysis. The 95% confidence intervals are added into Kaplan–Meier survival curves. All reported P-values are two-tailed, and P value <= 0.05 is considered statistically significant for all analyses (n >= 3). All statistical analyses were conducted using R version 4.2.2.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles