Multiregional transcriptomic profiling provides improved prognostic insight in localized non-small cell lung cancer

Multiregional RNA-seq reveals transcriptomic intratumor heterogeneityTo investigate the transcriptomic intratumor heterogeneity (RNA-ITH) and its effect on patient prognosis in NSCLC, we analyzed the multiregional RNA-seq data from the TRACERx (TRAcking non-small cell lung Cancer Evolution through therapy [Rx]) project32,35,39,40. In the pursuit of robust and meaningful insights from the dataset and considering the data was collected and published over distinct timeframes, we partitioned the data into three cohorts, each serving a unique purpose (Fig. 1a, Supplementary Table 1). The first cohort of TRACERx (TRACERxC1) was released in 2019 (64 patients, 164 samples)35. We studied the RNA-seq data from this cohort, aiming to explore the intricacies of RNA-ITH and identify how to integrate RNA-ITH to improve the development and application of transcriptome-based signatures. Also, our results from this cohort are potentially comparable to previous studies, since it is widely investigated. The second cohort (TRACERxC2) we analyzed was published recently in 202340. To ensure it is an independent validation of our findings, we excluded the TRACERxC1 cohort from the latest dataset (261 patients, 652 samples). To control for histological subtypes that may confound our analysis, our third group exclusively focused on lung adenocarcinoma (LUAD) samples, the predominant histologic subtype in NSCLC and TRACERx project (TRACERxLUAD: 187 patients, 472 samples). To further confirm our findings, we also generated multiregional RNA sequencing data of 64 tumor regions from 25 NSCLC patients (MDAMPLC, MD Anderson Cancer Center Multiregional Profiling in Lung Cancer) (Fig. 1a, Supplementary Table 1).Fig. 1: Transcriptomic intratumor heterogeneity (RNA-ITH) in NSCLC.a Overview of the study. Multiregional RNA-seq data from two cohorts were used in this study. The main findings were identified in the TRACERxC1 cohort and confirmed in the TRACERxC2, TRACERxLUAD, and MDAMPLC cohorts. b, c Violin plot showing that RNA-ITH is lower than intertumor heterogeneity in both b TRACERxC1 and c MDAMPLC cohorts. d, e Patient-specific RNA-ITH in the d TRACERxC1 and e MDAMPLC cohorts. Each dot represents the paired tumor region from the patient. The curve indicates the patient-specific RNA-ITH, calculated by averaging the RNA-ITH of all region pairs. Significance is determined by the Wilcoxon Rank-Sum one-sided test.For each pair of samples, we calculated the transcriptomic heterogeneity by comparing their expression profiles. The divergencies for within-patient region pairs reflect the RNA-ITH of individuals, which are compared with the intertumor heterogeneity, namely, divergencies for between-patient sample pairs (see Methods for details). As expected, the RNA-ITH is significantly lower than the intertumor heterogeneity in both the TRACERxC1 cohort (Fig. 1b) and the MDAMPLC cohort (Fig. 1c), which is consistent with the previous report35.In Fig. D and E, we displayed the normalized expression divergencies of all within-patient region pairs for each patient and used the average value (curve) to quantify the RNA-ITH at the patient level. In both cohorts, we observed substantial variations in the RNA-ITH among patients, which may offer critical information on patient prognosis that was rarely utilized by previous RNA-seq-based prognostic signatures or models.Integrating gene-level RNA-ITH may improve biomarker designBiomarker selection is one of the determinants of the performance of gene expression-based models. To further investigate the RNA-ITH in individual gene expression, we decomposed the total expression variance of each gene into between- and within-patient variances, which reflected its intertumor and intratumor expression diversity, respectively. As shown in Fig. 2a, the intratumor variation is highly correlated with the intertumor variation (R = 0.732, p < 0.001), indicating that genes informative for distinguishing patients (i.e., potential biomarkers) also tend to have high diversity among tumor regions within the same tumors, and vice versa indicating that leveraging RNA-ITH may improve biomarker design.Fig. 2: Association of regional gene expression with patient survival in the TRACERxC1 cohort.a Scatter plot showing a strong correlation between transcriptomic intra and intertumor variance at the gene expression level. The Pearson correlation coefficient (R) and the p-value (p) are shown in the figure with red text. b Volcano plots demonstrating the survival association of genes when their average, maximal, or minimal expression across all regions was used to represent its patient-specific expression level. The hazard ratio and p-value were calculated using the univariate Cox regression that fitted the expression of each gene as a continuous variable. c Venn diagram showing the numbers and overlap of survival-associated genes based on their average, maximal, and minimal expression, respectively. d Scatter plots comparing the C-index of selected prognostic genes using different representative expressions. Hazardous (HR > 1, p-value < 0.01) and protective genes (HR < 1, p-value < 0.01) were selected based on average expression and compared with results using maximal and minimal expression separately. The text labels the percentage of genes in that area. e, f Kaplan–Meier curve showing the recurrence-free survival of patients with high(red) or low(blue) expression of MDS2 (e) or KAT6B (f) using the median as a cutoff. The maximal expression of the hazardous gene MDS2 and the minimal expression of the protective gene KAT6B is more prognostic. The shadow represents the 95% confidence interval. The survival analysis measures disease-free survival.To investigate the impact of RNA-ITH on the prognostic association at the gene level, we utilized the average, maximal, and minimal expression of each gene across all tumor regions within the same tumors to represent a patient-level expression. The prognostic association of resultant average, maximal, and minimal expression values was examined using univariate Cox regression models (Fig. 2b). Based on average expression, we identified a total of 631 prognostic genes (p < 0.01), including 340 hazardous (Hazard Ratio, HR > 1 for disease-free survival, DFS) and 291 protective (HR < 1) genes. The use of the maximal expression was more sensitive to identifying hazardous genes, resulting in 983 hazardous but only 46 protective genes (Fig. 2b). It captured the largest number of hazardous genes, including the majority identified using the average expression approach (Fig. 2c). In contrast, the use of the minimal expression was more sensitive to identifying protective genes, resulting in 53 hazardous and 583 protective genes (Fig. 2b). This approach had the advantage of protective gene identification over than average expression approach (Fig. 2c). Therefore, taking advantage of multiregional RNA-seq in the context of RNA-ITH may offer more prognostic biomarker candidates and improve the development of prognostic models.Furthermore, the hazardous genes selected based on average expression demonstrated higher prognostic association when their maximal expression was used for survival analysis, as indicated by higher concordance indices and more significant p-values (Fig. 2d, Supplementary Fig. 1). In contrast, protective genes identified by average expression were more prognostic when their minimal expression was used in the survival analysis (Fig. 2d, Supplementary Fig. 1). As an example shown in Fig. 2e, f, the maximal expression of proto-oncogene MDS241,42 (hazardous gene, HR > 1) and the minimal expression of the tumor suppressor KAT6B41,42 (protective gene, HR < 1) achieved the best prognostic stratification of patients. Likely, the tumor region with the extreme expression of the prognostic gene represents the most aggressive or resistant (to treatments) region within the tumor and thus poses the strongest prognostic effects on patient survival. Taken together, these findings suggest that the performance of single region-derived signatures may be greatly improved in the application when integrated with RNA-ITH through multiregional profiling assays.Improve performance of prognostic signatures with RNA-ITHSince gene level- RNA-ITH has demonstrated the potential to improve prognostic biomarker selection and application, we next examined whether consideration of RNA-ITH can improve existing prognostic signatures.To this end, we adopted two methods to calculate patient-level risk scores from multiregional expression data (Fig. 3a).Fig. 3: Calculation of patient-specific risk scores using multiregional RNA-seq data.a Two methods for calculating patient risk scores based on gene signatures. Method 1: Transform the expression of signature genes into patient-specific values and then compute the risk score of patients based on the signature. The transformation function calculated the average(Avg.)/maximal (Max.) (Adj.)/minimal (Min.) gene expression across all regions of each individual or summarized the maximal expression of hazardous genes and the minimal expression of protective genes (Adjusted, Adj.) or the reverse calculation (Rev.). Method 2: Apply the gene signature to all regions of a patient to obtain region-specific risk scores and then transform them into individual-level risk scores. The transformation function calculated the average(Avg.)/maximal (Max.) (Adj.)/minimal (Min.) region-specific scores of each individual. b–d Performance of three different prognostic signatures: b ORACLE, c WTGS, and d PACEG applied with eight functions from two methods of quantifying patient-specific risk score in the TRACERxC1 cohort. In the Hazard Ratio column, a 95% confidence interval was shown as a dotted line. The survival analysis measures disease-free survival.Method 1 (M1)Transformed gene expression. For each signature gene, the expression values across all regions from a patient were combined using transformation functions to obtain patient-level expression, which was then used to calculate the risk score of that patient. We tested five different transformation functions: (1) Average (Avg.), (2) Maximal (Max.), (3) Minimal (Min.) Function calculated average/maximal/minimal gene expression across all regions within each tumor, respectively; (4) Adjusted Function (Adj.) selected the maximal expression of hazardous genes (positive coefficients) and the minimal expression of protective genes (negative coefficients) across multiple regions within the same tumor, which was supposed to achieve the best performance based on the gene-level results; and (5) Reverse Function (Rev.) calculated maximal expression of protective genes and minimal expression of hazardous genes, which was anticipated to get the worst accuracy.Method 2 (M2)Region-specific risk score. The signature scores were calculated for each tumor region, and then the transformation function was applied to gain the patient-level scores based on the region-specific scores. The transformed functions could be Average (Avg.)/Maximal (Max.)/Minimal (Min.) Function that calculated average/maximal/minimal region scores across all regions of each individual, respectively.ORACLE (outcome risk associated clonal lung expression) signature was developed in NSCLC from genes with low intra-tumor but high intertumor diversity using TRACERxC1 multiregional RNA-seq and TCGA LUAD RNA-seq data35. It was a hazardous signature for which higher scores denoted higher risks. We first applied the above-mentioned two methods to this signature to assess whether its prediction accuracy could be improved when RNA-ITH was taken into consideration. As shown in Fig. 3b, the signature score is based on the Adj. and Max. Functions of Method 1(M1-Adj and M1-Max) achieve the best performance in the TRACERxC1 cohort, as indicated by greater C-Indices and more significant p-values. The improved performance of M1-Adj (p-value = 0.012, C-index = 0.67) is in line with the prognostic results of individual genes, supporting our hypothesis that integrating RNA-ITH improves the existing signature application. In addition, the M1-Max achieves a comparable accuracy as the M1-Adj (p-value = 0.011, C-index = 0.671), presumably because most of the ORACLE signature genes (19 out of 23) are hazardous, defined by the positive coefficients. As for Method 2, the results of M2-Max achieve a more significant prognostic association (p-value = 0.03, C-index = 0.639), while the scores based on M2-Avg and M2-Min are not significant (Fig. 3b).To validate our findings, we extended our analysis to the TRACERxC2 cohort, exclusively consisting of independent patients to the TRACERxC1 cohort to ensure the independence of the analysis (Supplementary Fig. 2a). With a larger cohort size and enhanced statistical power, our investigation consistently highlighted the superior performance of the M1-Adj (p-value = 0.0001, C-index = 0.61) and the M2-Max (p-value = 0.0007, C-index = 0.603).To control the potential impact of histology on our analysis, we sought to perform the analysis within the same histology. As the ORACLE signature was originally developed for lung adenocarcinoma, we conducted survival analysis within the TRACERxLUAD cohort (Supplementary Fig. 2b). In this context, our results revealed that both the M1-Adj (p-value = 3 × 10−5, C-index = 0.627) and the M2-Max (p-value = 0.0003, C-index = 0.616) outperformed the M2-Avg, demonstrating the significant enhancement in prognostic capabilities achieved by incorporating RNA-ITH into the ORACLE signature.We also tested the ORACLE signature integrated with the two methods above in our MDAMPLC cohort. Although the p-value was not significant due to the small sample size, we observed the same trends that the Max. Function of Method 1 and 2 enable prediction improvement compared to Avg. Function (Supplementary Fig. 2c). In summary, even though the ORACLE signature is selected from genes with low ITH, it can be improved when considering RNA-ITH in its application. In addition to ORACLE, the finding that integrating RNA-ITH will improve the performance of expression-based signatures was also supported by nine public hazardous signatures evaluated in the same way (Table 1, Supplementary Table 2, Supplementary Table 3).Table 1 Nine public signatures improved by multiregional RNA-seq dataAs ORACLE and the other nine existing signatures only harbored a small subset of genes, the impact of RNA-ITH in prognostic signature might not be fully captured. Therefore, we next tested another signature called whole-transcriptomic gene signature (WTGS), which was a protective signature (higher score indicating longer survival) using all genes for the prognostication43. In this signature, each gene was assigned a weight based on its prognostic significance43. As shown in Fig. 3c, the survival analysis with the five transformation functions of Method 1 was performed in the TRACERxC1 cohort. Again, the best performance is achieved by the M1-Adj (p-value = 0.019, C-index = 0.691). Signature scores based on M1-Avg, M1-Max, and M1-Min result in weak prognostic associations, while the score based on M1-Rev is not prognostic (Fig. 3c). Using Method 2, the best performance is obtained by M2-Min (p-value = 0.036, C-index = 0.674) in line with the fact that WTGS is a protective signature (Fig. 3c).After validation using the TRACERxC2 cohort, M1-Adj (p-value = 5 × 10−7., C-index = 0.65) and M2-Min (p-value = 5 × 10−5, C-index = 0.623) demonstrated great improvements compared to the corresponding Avg. Function (Supplementary Fig. 3a). When narrowing down to the adenocarcinoma subtype, both M1-Adj (p-value = 3 × 10−7, C-index = 0.687) and M2-Min (p-value = 5 × 10−5, C-index = 0.65) continued to exhibit superior prognostic value in the TRACERxLUAD cohort (Supplementary Fig. 3b).Consistently, the M1-Adj and the M2-Min achieve the highest C-index in our MDAMPLC cohort (Supplementary Fig. 3c). Taken together, these results indicate that the RNA-ITH has a remarkable impact on the performance of prognostic signatures, and when multiregional expression data is available, M1-Adj achieves the best prognostic performance.The PACEG gene signatureBiswas et al. utilized stably expressed genes with high intertumor but low intratumor diversity, as revealed by the TRACERxC1 multiregional expression data (defined as Q4 genes), and developed the ORACLE signature35. Although these genes are, in principle, less impacted by RNA-ITH, we observed improved performance using the Max. and Adj. Functions of Method 1 (Fig. 3B). Considering the high correlation between intertumor and intratumor variations at the gene level (Fig. 2a), the informative genes for patient stratification had probably been missed by the filter used in ORACLE. Indeed, only ~6.6% of genes were identified as the Q4 genes for the development of ORACLE, with many prognostic genes likely excluded35.To improve ORACLE and further demonstrate the potential of integrating RNA-ITH in the signature application, we developed a new gene signature called PACEG (Prognosis-Associated Clonally Expressed Genes) using a similar procedure as ORACLE (see “Methods” for details). A critical finding of ORACLE is that Q4 genes were more likely to be clonal genes whose expression was driven by their copy numbers in the dominant clone ORACLE35. Instead of focusing on Q4 genes, PACEG selected signature genes from clonal genes identified from paired RNA-seq and copy number variation (CNV) data of LUAD (see Methods for details). As such, PACEG was developed fully based on LUAD data, which ensures its independent and unbiased application in the multiregional cohorts. In total, the signature contains 26 genes non-overlapped with ORACLE, including sixteen hazardous and ten protective genes (Supplementary Table 4).Consistent with the previous 11 signatures, integrating PACEG with M1-Adj achieves the best performance (Fig. 3d, p-value = 0.006, C-index = 0.693) in the TRACERxC1 cohort. The M1-Max results in a slightly more significant p-value, but a smaller C-index. Similarly, the M2-Max demonstrates the highest performance (p-value = 0.012, C-index = 0.661) compared to M2-Avg and M2-Min (Fig. 3d). Within the TRACERxC2 cohort, it is noteworthy that the M1-Adj (p-value = 5 × 10−5, C-index = 0.614) and the M2-Max (p-value = 0.0002, C-index = 0.603) consistently exhibit the best performance (Supplementary Fig. 4a). Also confirmed in the MDAMPLC cohort, M1-Adj and M2-Max show the highest C-index among all transformation functions in the same methods (Supplementary Fig. 4b). Of note, when applied to the multiregional data, PACEG has a much better prediction accuracy than the ORACLE signature, especially with M1-Adj (Supplementary Table 5), possibly from the inclusion of more prognostically informative genes (non-Q4 genes).Considering the signature was originally developed using TCGA LUAD data, we also assessed its performance within the adenocarcinoma subtype (TRACERxLUAD). Consistently, M1-Adj (p-value = 3 × 10−6, C-index = 0.642) and M2-Max (p-value = 8 × 10−5, C-index = 0.621) demonstrated the best results (Supplementary Fig. 4c). To examine whether PACEG provides independent prognostic values after considering established clinical factors, we performed multivariable Cox regression analysis, which includes variables such as the PACEG score calculated from the M1-Adj (PACEG-M1-Adj), smoking status, stage, age, and gender (Supplementary Table 6). As a result, PACEG-M1-Adj turned out to be the strongest predictor in the model (HR = 1.69, p-value = 0.0008), with a significant p-value after adjusting for those clinical factors. This result indicates that PACEG-M1-Adj offers valuable insights into survival predictions while complementing those important clinical factors. Integrating the expression-based signature with multiregional information not only improves the clinical utility of the signature itself but may surpass the predictive capability of several crucial clinical prognostic variables.The prognostic impact of RNA-ITH in the tumor microenvironmentThe infiltration level of immune cells in the tumor microenvironment has been reported to be associated with the prognosis of NSCLC patients. Our previous study has also demonstrated that a higher degree of T cell receptor (TCR) ITH was associated with an increased risk of postsurgical recurrence of NSCLC44. Next, we investigated the prognostic effect of tumor immune microenvironment ITH by computationally inferring the infiltration of major immune cell types from the multiregional RNA-seq45,46,47 (see “Methods” for details). After obtaining the immune infiltration scores in all tumor regions, we converted them into patient-specific infiltration scores by calculating the average, maximal, and minimal values and investigated their association with prognosis using Cox regression.As shown in Fig. 4a, based on the average infiltration level, we found that Naïve B, Memory B, CD8+ T, and CD4+ T cells are protective immune cells with higher infiltration associated with longer survival, while Monocytes predominance is hazardous in NSCLC, consistent with previous studies47. Importantly, for protective immune cells, the more accurate prediction is achieved by using the minimal infiltration scores as indicated by both the p-values and C-indices, while for the hazardous immune cell, Monocytes, the best prognostic association is observed when the maximal infiltration level is used. Consistently in the MDAMPLC cohort, the highest C-indices of Naïve B, Memory B, CD8+ T, and CD4+ T cells are from using the minimal infiltration regions, while the highest C-index of Monocyte is obtained with the maximal infiltration regions (Fig. 4b). These results suggest that the tumor regions with the least favorable immune microenvironment contribute the most to the overall prognosis of patients.Fig. 4: Association of immune cell infiltration with patient survival.a Forest plot showing the survival analysis results of six immune cells in TRACERxC1 cohort. The average (Avg.), maximal (Max.), and minimal (Min.) infiltration values of each immune cell were used to obtain the patient-level immune infiltration and then were analyzed with univariate Cox regression. In the Hazard Ratio column, a 95% confidence interval was shown as a dotted line. b The bar graph of the C-index evaluating the same immune cells in the MDAMPLC cohort. The infiltration level of six immune cell types was calculated in all regions of each patient. The minimal infiltration level of Naïve B, Memory B, CD8+ T, and CD4+ T cells (protective) but the maximal infiltration level of Monocyte (hazardous) achieves the highest prognostic association. The survival analysis measures disease-free survival.Prognostic risk variation across different regions and signaturesHerein, we revealed the impact of RNA-ITH at the gene expression level, gene signature level and tumor microenvironment level (Supplementary Fig. 5). For both tumor-intrinsic gene signatures (ORACLE, WTGS, and PACEG) and tumor immune microenvironment signatures (immune cell infiltration), we have shown that tumor regions with the least favorable signatures scores (i.e., maximal for hazardous and minimal for protective signatures) are more informative for prognostication. We next sought to investigate whether the same regions within each tumor would be defined as the least favorable by different signatures. We constructed univariate Cox regression models for the three tumor-intrinsic prognostic signatures and five immune cell signatures. The models were trained based on the maximal scores from the three hazardous signatures (ORACLE, PACEG, and Monocyte) and the minimum scores for the five protective signatures (WTGS, Naïve B, Memory B, CD8+ T, and CD4+ T cells). Then all models were applied to the expression data of all tumor regions to calculate the region-specific risk scores. After normalization to make the risk scores comparable, patients with more than three tumor regions were included for further analysis (Supplementary Fig. 6).As shown in Fig. 5a, overall, the risk scores from the three tumor-intrinsic gene signatures (ORACLE, PACEG, and WTGS) are highly correlated with each other. However, the immune cell signatures fall into two groups: CD8+ T, CD4+ T, and Monocyte signatures are correlated, while Memory and Naïve B cell signatures make similar predictions with tumor-intrinsic signatures. For each gene signature, the predicted prognostic risk scores from different regions within the same tumors overall are similar, with some exceptions whereby substantial variations among different tumor regions are observed (e.g., CRUK0035 by the CD8+ T cell signature) (Supplementary Fig. 6). Of note, the WTGS signature, which uses the whole transcriptome for risk prediction, is the least influenced by RNA-ITH.Fig. 5: Region-specific risk scores predicted by different gene signatures.a Heatmap showing the correlation between risk scores predicted by the eight signatures. b, c Heatmap displaying the risk scores of different regions from selected patients in b TRACERxC1 and c MDAMPLC cohorts predicted by eight gene signatures. The risk score was normalized by subtracting the median value and scaled to the (−1, 1) interval“*” represents the highest risk scores across different regions of the same patients. “+” shows the second highest risk scores.More interestingly, we found that the eight signatures identified the same tumor regions to carry the highest risks for recurrence in some patients (Fig. 5b, c). When we divided the signatures into intrinsic gene signatures (ORACLE, PACEG, and WTGS) and the tumor immune microenvironment signatures (Monocyte, Naïve B, Memory B, CD8+ T, and CD4+ T cells), there was more consistency within the two groups of signatures (Supplementary Fig. 6). For the three tumor-intrinsic gene signatures, the same regions in 75% (21 out of 28) TRACERxC1 cohort and 71% (10 out of 14) MDAMPLC cohort are predicted as the top two high-risk regions within the same tumors (e.g., CRUK0050), which suggests that these regions may be sensitive to chemotherapy. The five tumor immune microenvironment signatures identify the same regions as the top two high-risk regions in 43% (12 out of 28) TRACERxC1 cohort and 86% (12 out of 14) MDAMPLC cohort (e.g., CRUK0050), which indicates that these regions may be sensitive to immunotherapy. These findings suggest that the most aggressive and/or the least immune-infiltrated tumor regions may have the most prognostic impact.

Hot Topics

Related Articles