The spatial and cellular portrait of transposable element expression during gastric cancer

TE expression during the progression of gastritis to early GCThe single-cell raw sequencing data generated at the NAG (3 samples), CAG (3 samples), IM (3 samples, one wild and 2 severe) and EGC (1 sample) stages was aligned to the human genome. Afterwards, the resulting BAM alignment files were processed using SoloTE29 to get matrices containing gene and TE expression per each cell. In order to get a preliminary overview on TE expression across the early GC cascade, a Principal Component Analysis (PCA) was carried out using a pseudobulk approach in which total expression was summarized at the sample level. This procedure was done 3 times, by using a matrix subsetted only to genes, another subsetted only to TEs, and one containing gene and TE expression (Fig. 2a). This analysis illustrates that gene expression alone recapitulates the differences between each time point, with the exception of the IM-Wild sample, which seems closer to the CAG samples. This suggests that only subtle changes occur during the IM-Wild timepoint. Using only TE expression, a similar distinction between timepoints can be seen. Although less variance is explained, the samples are still reasonably grouped in terms of their stage. As expected, the analysis using the complete Gene + TE matrix shows a result consistent with the ones done independently.Fig. 2Dimensional reduction analyses of the progression from gastritis to early gastric cancer. (a) Principal Component Analysis of the pseudobulked scRNA-Seq data, using gene expression only (first panel), TE expression only (second panel), and gene and TE expression (third panel). Each sample is color-coded according to the condition to which they belong. (b) Dimensional reduction plots using the t-distributed stochastic neighbor embedding (tSNE). Similarly, the first panel shows the tSNE plot using gene expression only, the second using TE expression only, and the third using gene and TE expression. The plots are color-coded according to the cell types identified.Next, to understand the influence of TEs at the cellular level, t-distributed stochastic neighbor embedding (tSNE) dimensional reduction analysis was carried out similarly, using the gene expression matrix, the TE expression matrix and the combined gene and TE expression matrix (Fig. 2b). The full Gene + TE matrix was processed first to generate a cell type annotation, resulting in 10 epithelial types, and 7 non-epithelial types. The epithelial population is comprised by Pit Mucous Cells (PMC), Neck-like, Gland Mucous Cells (GMC), Proliferative Cells (PC), Enteroendocrine, Chief, Enterocytes, Goblet, Metaplastic Stem-like Cells (MSCs) and Cancer cells. On the other hand, the non-epithelial population is comprised by T cells, B cells, Endothelial, Macrophages, Mast cells, Smooth Muscle Cells (SMC) and Fibroblasts. This cell type annotation was later transferred to the single-cell analyses done on genes and on TEs independently. In contrast to result obtained at the Gene level, the tSNE dimensional reduction of TE expression depicts more heterogeneity between cell types, in which PMCs, GMCs and MSCs seem to be more spread-out. It was previously shown that PMCs decrease and MSCs increase along the early GC cascade, and that there are transcriptional similarities between PMCs with both GMCs and MSCs, and between MSCs and Cancer cells6. Indeed, comparison of the cell type annotations with the unbiased clustering shows that some TE clusters are comprised by different populations (Supplementary Fig. 1). For example, cluster 0 is comprised by PMCs and GMCs, while cluster 1 and 2 span several types from PMCs, GMCs up to MSCs and Cancer cells. It has been proposed that mixed cells in the single-cell projection could be predictive of intermediate cell states31, and thus, it can be hypothesized that TE expression could represent such states. Interestingly, the opposite can be seen in some non-epithelial subtypes, such as T cells and B cells. These cells exhibit a well-defined clustering pattern, hinting at a differential activation of a subset of TEs that might be modulating gene expression32, with potential implications in T cell exhaustion33.Overall, these results show that although there is some variability in TE expression at the single-cell level, they are globally expressed throughout the progression from gastritis to EGC and recapitulate changes in each timepoint similar to those observed at the gene level.The single-cell expression of TEs in early GCGiven the global TE expression revealed by the previous analyses, I investigated what TEs could exhibit consistent or increasing expression from the premalignant gastritis status to EGC, and in what cell populations they might be enriched. To this end, the pseudobulked matrix was used, and a series of differential expression (DE) analyses were carried out with DESeq234, using NAG as the baseline condition. Then, TEs significantly up-regulated (having log2(Fold Change) > 0 and adjusted P-value ≤ 0.05) in CAG, IMW, IMS and EGC timepoints were selected. This resulted in a total of 2,581 TEs (Supplementary data 1, Supplementary Fig. 2). Although the EGC DE analysis might be statistically underpowered due to only having one sample, it still recapitulates changes reported in previous works (Supplementary Fig. 3). To also consider TEs that might already be activated in NAG, and whose expression remains constant throughout the EGC cascade (and thus, not appearing in the DE analysis), the TEs in the top 5% of highest expression across all stages were also selected (Supplementary data 1). This added 14,566 TEs to the set, resulting in a total of 17,147 TEs potentially associated with EGC progression (Fig. 3a). Furthermore, there is a consistent percentage of TE expression across the single-cell transcriptomes: NAG—37.822%, CAG—41.247%, IMW—50.282%, IMS—41.989%, EGC—40.931% (Fig. 3b).Fig. 3The single-cell profile of TEs across EGC progression. (a) Pseudobulked log2-normalized expression of TEs that were either differentially up-regulated at any time point with respect to NAG, or that were in the upper 5% of normalized expression at all time points. (b) tSNE plots showing the per-cell expression percentage of genes (first row) or TEs (second row) across all timepoints. The horizontal bars show the percentage of the transcriptome at each timepoint corresponding to genes (blue) or TEs (red). (c) Distribution of marker TEs across stages. (d) Class distribution of marker TEs of each cell type. (e) Dot plot depicting the top 5 marker TEs of each cell type. In bold, TEs selected to show as example in the following tSNE plots. (f) tSNE plots of selected TEs.Marker analysis revealed that out of the TEs selected in the previous step, 2,430 have increased expression in the different cell populations (Table 1). Some of the TE markers appear as enriched in more than one cell population, as evidenced in the numbers after de-duplication (Table 1, “Unique marker TEs”). Interestingly, a high proportion of these markers have locus resolution, which is essential to analyze their potential influence in gene regulation. Additionally, marker TE distribution per timepoint indicates that for most part they are equally distributed (Fig. 3c), suggesting that once their expression is increased in a cell type, it remains stable throughout the early GC progression. In Chief cells, a predominance at the CAG stage is observed, while Proliferative cells, Enterocytes, and Metaplastic Stem-like Cells seem to be predominantly expressed at the IMS stage, consistent with the emergence of these cell types during IM6. Analysis of the major TE types revealed that they are all present to varying degrees in the set of markers (Fig. 3d), with a clear predominance of SINEs. Although previous works have pointed out a role of LINE L1 and LTR HERV TEs in cancer30, there is evidence showing that SINE and DNA TEs are also involved by driving oncogene activation35. This is also consistent with a recent work in colorectal cancer in which TEs were found to modulate gene expression and that SINEs were predominantly expressed36.Table 1 Marker TEs per cell cluster.A caveat here is that the 10X single-cell data has a 3’ bias, and thus usually the terminal region of transcripts is captured and sequenced. Amongst TEs, the Alu family, part of the SINE group, has been reported to lead to truncated isoforms by acting as alternative transcription end sites37,38. In turn, out of the 927 marker SINE TEs, 875 (94.4%) were part of the Alu family, suggesting that in EGC these TEs could be effectively acting as premature transcription end sites. It has been proposed that the impact of these events could be associated with disease progression37,38, and indeed there is an example of such events in liver cancer, where Alu TEs were identified as the major TE becoming a terminal exon39. Alternatively, the predominance of Alu TEs could be explained by their genetic structure: these elements harbor both a linker and a terminal A-stretch40, which in turn could be causing internal poly-A priming during 10X 3’ single-cell sequencing.Marker TEs, which are defined as TEs with high expression, and expressed in a high proportion of the cells of a specific type (labeled as “pct.1”), could be classified in 2 groups based on their percentage expressed in the remaining cell types (labeled as “pct.2”): group 1—high pct.2; group 2—low pct.2. For example, some of the top PMC, MSCs, Cancer, Chief and Neck-like markers belong to group 1, while the top T cells and B cells markers belong to group 2 (Fig. 3e, Supplementary Fig. 4, Supplementary data 2). In turn, group 1 would define markers that have a gradient of expression, and in some cases, might be statistically enriched in more than one cell type (Table 1, Supplementary Data 2). As mentioned earlier, in the original work, it was observed that MSCs have a high transcriptional similarity to Cancer cells6. Here, I also noted that in terms of enriched TEs, there is also transcriptional similarity between these cell types, as indicated by the number of shared markers between them (29 common marker TEs, Supplementary Data 2). Similarly, it has been previously indicated that neck cells differentiate into chief cells6, and marker expression seem to coincide with this: the top neck cell markers are broadly expressed, while the top chief cells markers seem a bit more specific. To further illustrate these results, tSNE dimensional reduction expression is depicted for 3 examples (Fig. 3f): AluSx1, located in chr8:118,090,496-118,090,808; L1MEc, located in chr18:9,678,049-9,678,172; and AluSc8 located in chr9:75,994,376-75,994,721. AluSx1 can be classified to group 1 considering that it is expressed in a high number of cells, but enriched in MSCs. The same argument applies to L1MEc with high expression in Cancer cells, but expressed at lower levels in other cells. Finally, AluSc8, also appearing as a Cancer marker, shows expression restricted to the region comprised by Cancer, MSCs and Enterocytes. It is worth noting that the expression of these TEs was measured with locus resolution (i.e., having only uniquely mapped reads), making unlikely that the observed expression heterogeneity could be attributed to ambiguity in read assignment. As suggested earlier, such gradient of TE expression spanning from PMC, GMC to MSC and then Cancer could suggest intermediate status between these cell types that could be mediated by TEs. It has been reported that as EGC progresses, PMCs decrease and MSCs increase, and that they have some transcriptional similarity6, though it is unclear whether some PMC cells might be undergoing a malignant transformation to MSCs. If this is the case, these results would suggest that TEs are playing a role in that event.Altogether, the evidence presented shows that TEs are expressed throughout the progression from gastritis to GC, and that their expression characterizes the different cell types, potentially highlighting intermediate status. In addition, the high proportion of TEs identified with locus resolution suggest that TEs becoming transcriptionally active in EGC have accumulated discriminative mutations, allowing the unambiguous assignment of sequencing reads. In turn, this would support the idea that TEs are playing a role in EGC progression via epigenetic polymorphisms, where changes in the transcriptional activity of fixed TE copies characterize cellular differences30.TE expression is associated with the origin of cancer cellsAfter getting a global overview of TE expression, and assessing their cellular profile, I then asked if their expression is associated with the origin of cancer cells. To this end, I applied Trajectory Inference (TI) to model the pseudotemporal progression of cells from the normal to the malignant status using the dynverse R package15. Briefly, this package evaluates more than 50 TI methods and identifies the one most suitable to the dataset, which in this case was PAGA-Tree41. Then, dynverse represents the trajectory topology in a network of milestones where the cells are placed. The resulting trajectory was rooted at the milestone containing the highest number of NAG cells, and further analyzed (Fig. 4a).Fig. 4The single-cell trajectory of EGC. (a) Inferred pseudotemporal trajectory of the EGC dataset colored by condition: NAG in orange, CAG in yellow, IMW in cyan, IMS in light blue and EGC in purple. (b) Pseudotemporal trajectory colored by inferCNV score: minimum values in light yellow, and maximum values in dark purple. (c) Heatmap showing the expression of TEs associated with the increase in cell malignancy as measured by inferCNV scores. Colored dots above the heatmap correspond to a one-dimensional representation of the trajectory. Highlighted in red are the cells that go from the beginning of the trajectory to the Cancer milestone (d) Example of TEs that are expressed in the malignant (i.e., high inferCNV score) branches of the trajectory.The trajectory broadly recapitulates the progression from NAG to EGC: the majority of cancer cells appear in a lineage that originates from IMS cells, which in turn, originates from CAG cells. Notably, there are 2 milestones with a high number of IMS cells, with one of them continuing directly to the Cancer milestone. To add support to the trajectory in the context of cancer evolution, I also applied inferCNV42 to generate per-cell scores of copy number variations (CNVs), which has been used as proxy of malignancy development13. The inferCNV scores projected in the trajectory are also in concordance with the progression of CAG to IM, and subsequently to EGC (Fig. 4b). Cell type and marker gene analysis of the trajectory shows that the two branches with the highest inferCNV scores depict the transitions between different cell types (Supplementary Figs. 5 and 6). The first branch reveals a transition from Enterocytes, to MSCs and Cancer, while the second branch reveals a transition from a milestone comprised by PMCs and some Enterocytes to one comprised by MSCs and Goblet cells. Although these cell types share a link on the basis of their transcriptional similarity6, it is unclear whether they represent the actual cell evolution in EGC. In cancer, plasticity allows tumor cells to change between cell status43. Thus, these observations might be indicative of cellular plasticity occurring during EGC, which would explain the cell type diversity in some branches with a malignant profile as revealed by their high CNV scores.To better understand the impact of TE expression in the EGC cascade, the enrichment of TEs in the malignant milestones was assessed. TE expression seems to occur throughout the entire trajectory, and 111 were TI-selected TEs (Fig. 4c, Supplementary Fig. 7), with 50 of them corresponding to markers of different cell populations and the remaining 61 without significant cell type specific expression (Supplementary data 3). Broadly speaking, three types of expression patterns throughout the trajectory were revealed by this analysis: 1. TE showing high and consistent expression (Fig. 4d, “LTR37-int”), 2. TE showing moderate levels of expression (Fig. 4d, “MSTA” located in chr14:98,972,560-98,972,796) and 3. TE expression mostly restricted to the malignant milestone (Fig. 4d, “L1MEc”, located in chr18:9,678,049-9,678,172). For example, “L1MEc” also appeared in the marker analysis as a Cancer cells-enriched TE (depicted in Fig. 3d), indicating some agreement between the 2 analyses. Notably, the 50 TEs that are also cell population markers all have locus resolution, while none of the other 61 have. The lack of locus resolution is usually indicative of expression of TE copies with a negligible number of discriminative mutations (i.e., evolutionary younger copies), hence the difficulty on accurately assigning reads to specific instances in the genome26. In this line, a possible scenario is that many of these copies become transcriptionally activated concurrently, pointing out to a pattern of global activation. For example, the widespread expression of “LTR37-in”, a TE belonging to the group of retroviruses, would be in agreement with evidence indicating that HERVs undergo global activation in cancer44.Collectively, this analysis showed that by leveraging TI methods, TEs are potentially contributing to the evolution of cancer cells and to the transition and interplay between cell status during EGC progression. In turn, the detection of TEs in the premalignant milestones might be informative to their role as biomarkers for the early detection of GC.The spatial portrait of TE expression in GCThe single-cell findings indicate that TEs are expressed in early pre-malignant GC stages, and that their expression occurs in the different cell types, hinting at a role of TEs in the tumor microenvironment. However, two outstanding questions from these results are (1) is TE expression also occurring in the malignant GC? and (2) are TEs expressed in the tumor microenvironment? To address them, I studied 4 spatially-resolved transcriptomes from GC sections14. In the original study, the tumor and surrounding regions in these sections were annotated by pathologists. By leveraging the annotations, I assessed if TE expression is associated with the tumor. To calculate TE expression, I processed the dataset with SoloTE, and the resulting gene + TE expression matrices were further analyzed with STutility45. With STutility, I characterized the in situ expression of TEs and contrasted it with the pathologist-annotated tumor and normal epithelium regions. First, I studied global TE expression across all tissue sections to assess the extent of TE activation in GC.All 4 samples exhibit TE expression that is higher in the tumor regions and lower in the normal epithelium, indicating that enhanced TE expression is a hallmark of GC (Fig. 5a, “All TEs”). Furthermore, activation of TEs in the tumor region suggests that these elements could be involved in tumorigenesis. When observing the expression at the major TE type level, subtle patterns can be seen. For example, a clear enrichment of LTRs in tumors is revealed (Fig. 5, “LTR”), with LINEs and SINEs having a moderate increase in the tumor area, and higher expression in regions not annotated as normal nor as tumor (Fig. 5, “LINE” and “SINE”, respectively). Interestingly, DNA TEs are also enriched in tumors, despite their lower presence in the human genome compared to retroelements. Nonetheless, at the statistical level their expression is still relatively increased when comparing tumor versus normal regions (Fig. 5b). Collectively, these results bridge the findings obtained in the single-cell section of this work, by showing that, in addition to being activated in early GC stages, TEs are also expressed in the malignant tumor regions.Fig. 5TEs are enriched in the tumor regions of GC tissue sections. (a) Spatially-resolved expression of TEs when analyzed collectively (“All TEs), or at the class level (LTR, LINE, SINE, DNA). “H&E” column shows the pathologist-annotated normal and tumor regions in green and red, respectively. (b) Violin plots showing TE expression in the normal and tumor regions. Asterisks denote statistical significance at the following levels: ****p ≤ 0.0001, ***p > 0.0001 and p ≤ 0.001, **p > 0.001 and p ≤ 0.01, *p > 0.01 and p ≤ 0.05. Annotation of H&E images was done with Inkscape 1.3.2 (https://inkscape.org).The spatial transcriptomics analysis also revealed that TE expression extends beyond the tumor regions. Activation of TEs in the tumor microenvironment could also play a role in GC initiation and progression. As mentioned earlier, in the single-cell results it was also observed that TEs were enriched in non-tumor cell populations, hinting at this hypothesis. Furthermore, as observed in the hematoxylin and eosin-stained (H&E) tissue sections, the profiled sections display variability in their morphology (Fig. 5). In the original study, the pathologists selected the most heterogeneous gastric cancer sections in order to characterize intratumor heterogeneity14. Therefore, in addition to Normal epithelium (NE) and Tumor Tissue (TT), other regions identified in the tissue histological images were: Tumor tissue and glands (TTG), Intestinal metaplasia (IM), Serrated glandular structure (SGS), Lymphoid follicle (LF), Muscularis mucosa (MM), Peritumoral muscularis (PM), Muscle tissue (MT), Heterotopic cystic malformation (HCM), Lamina propria (LP), Blood − containing tissue (BCT), Connective tissue (CNT) (Fig. 6). Some of these regions were identified in one patient only: for example, Intestinal metaplasia in patient JJ, Serrated glandular structure in patient JJ62, and Heterotopic cystic malformation in patient ZL716, further highlighting intertumor diversity.Fig. 6The spatial portrait of Transposable Element expression in Gastric Cancer. (a) Annotated H&E image of GC tissue section of sample JJ, followed by the heatmap depicting the expression of spatially enriched TEs, the class distribution bar plots, and the spatial expression plots of representative TEs. (b) Annotated H&E image of GC tissue section of sample JJ62, followed by the heatmap depicting the expression of spatially enriched TEs, the class distribution bar plots, and the spatial expression plots of representative TEs. (c) Annotated H&E image of GC tissue section of sample ZL69, followed by the heatmap depicting the expression of spatially enriched TEs, the class distribution bar plots, and the spatial expression plots of representative TEs. (d) Annotated H&E image of GC tissue section of sample ZL716, followed by the heatmap depicting the expression of spatially enriched TEs, the class distribution bar plots, and the spatial expression plots of representative TEs. Annotation of H&E images was done with Inkscape 1.3.2 (https://inkscape.org).By leveraging the GC tissue annotations, I then investigated the spatial enrichment of specific TEs, in order to address whether TEs are also expressed in the tumor microenvironment. To this end, I applied the FindMarkers function to compare expression in the different tissue regions using the normal epithelium as control. This analysis showed substantial variability in the number of spatially enriched TEs detected on each sample: JJ62 – 148, JJ – 75, ZL716 – 57, ZL69 – 38 (Fig. 6, Supplementary Fig. 8, Supplementary data 4), which makes sense given the heterogeneity observed between the GC tissue sections. In agreement with the global TE analysis, specific TEs were enriched in the tumor regions of all samples. Sample JJ exhibits an overall increase in TE activity in the tumor and regions surrounding it, such as Lamina propria, Muscularis mucosa, Peritumoral muscularis, Lymphoid follicle, Connective tissue and Blood-containing tissue (Fig. 6a). Interestingly, this sample also has a region of Intestinal metaplasia, where enriched TEs were also identified. This finding provides further evidence of TE activation during intestinal metaplasia, as observed in the single-cell analyses. Additionally, Sample JJ62 shows a Serrated glandular structure next to the tumor region, and it has been reported that the region represents an intermediate step in the alteration of the normal epithelium during dysplasia, resulting in GC. TE activation between Serrated glandular structure and the tumor tissue was observed, implicating that TEs might be contributing to GC development (Fig. 6b). Interestingly, the Lymphoid follicles and muscular tissue surrounding the tumor also seems to show activation of TEs. The enhanced TE expression in the muscle tissue would match the single-cell result for smooth muscle cells. Sample ZL79 corresponds mostly to tumor tissue and Muscularis mucosa, with both regions having a signature of TE expression (Fig. 6c). Sample ZL716 display a large Heterotopic cystic malformation region with several Lymphoid follicles. Heterotopic cystic malformation might be associated with early GC46, and for most part, TEs detected in this sample seem to be expressed across it and the tumor regions (Fig. 6d). In terms of TE types, both intra- and inter-tumor, there seems to be as similar distribution, in which LINEs and SINEs are the most represented, followed by LTRs, and then DNA TEs. Interestingly, Muscularis mucosa seem to be characterized by expression of LTRs and SINEs, and in the case of sample ZL716, only LTRs. LTRs have long been associated with cancer47, and alterations in the muscularis mucosa might play a role in early GC48. Thus, this finding could suggest that aberrant activation of LTRs in this region could be associated with GC. In sum, these results indicate that TEs are expressed in both the tumor tissue and in the tumor microenvironment, which suggests that TEs might also contribute to GC progression through changes to the regions surrounding the tumors.Next, I asked if there are TEs conserved across the studied tissue sections. To this end, the number of TEs common between different combinations of samples was visualized in an upset plot (Fig. 7a). By using this result as guide, it was observed that 33 spatially enriched TEs appear in at least 3 out of the 4 samples, and these were selected to build the set of “top” spatial TEs. Visualization of the TE type distribution indicates that despite the difference in number of enriched TEs detected on each sample, LINEs seem to be predominant, followed by SINEs, and then LTR and DNA (Fig. 7b). In the top set, DNA TEs do not appear, indicating that their activation follows the inter-patient GC heterogeneity captured in these samples. Conversely, there are 9 leading TEs in the top set because they were detected in all the samples (Fig. 7c). These 9 TEs correspond to 1 LTR, 4 LINEs and 4 SINEs, with 6 of them having locus resolution. In contrast with the single-cell analysis, these results show more sparsity in terms of the number of TEs whose location is detected unambiguously in the genome. For instance, in the total 33 TEs of the top set, only 15 (45.5%) have locus resolution, with 6 of them being amongst the leading 9 TEs present in all datasets. Nonetheless, when assessing the overlap between the top spatial TEs and the single-cell results, 29 (87.9%) out of the 33 top TEs were also detected in the single-cell analysis, with 22 (66.7%) of these TEs also associated with the malignant milestones detected in the trajectory analysis (Supplementary Fig. 9).Fig. 7Spatially enriched TEs common between the GC tissue sections. (a) Upset plot of the spatially enriched TEs. The number of TEs at each given overlap, represented by a set of connected dots in the lower half, is indicated as a bar plot. (b) Class distribution of spatially enriched TEs on each sample, and for the top TEs (“Top”) which corresponds to those identified in at least 3 out of the 4 samples. (c) Feature plots of spatially enriched TEs identified in all samples, depicting their expression across the tissue sections. For TEs identified with locus resolution, their genomic location is indicated under their name.The TE-mediated gene regulatory network during GCTo ask about the potential role of TEs as regulators of gene expression, a methodology similar to previous works13,44 was applied: (1) using all the locus-specific TEs detected in both the single-cell and the spatial analysis, a gene-TE dictionary was built, considering all genes within 500 kbp of each TE; (2) Afterwards, to characterize the regulatory potential of TEs, the overlap with regulatory elements in GeneHancer and the ENCODE SCREEN candidate Cis-Regulatory Elements (cCREs) was assessed; (3) the gene-TE dictionary was filtered by considering either pre-defined interactions in GeneHancer or SCREEN, or if the gene was within 50 kbp of the TE; (4) finally, the Spearman correlation values were calculated for each gene-TE pair, and all interactions with correlation ≥ 0.3 were kept.To get an overview of the genes potentially regulated by TEs, two additional analyses were carried out. First, an automated literature search using the NCBI E-Utilities49 was carried out, and genes associated with publications in cancer were labeled as “Cancer gene”. Also, gene set enrichment analysis using the fgsea R package50 was performed, using the Kyoto Encyclopedia of Genes and Genomes (KEGG) terms, and the Gene Ontology (GO) terms, and filtering results to those having adjusted p-value ≤ 0.05.Following the aforementioned approach, a total of 3151 interactions were predicted. Based on the genomic overlap with known regulatory elements from the GeneHancer and ENCODE SCREEN database, the TEs potentially interacting with genes are distributed as follows: 1142 “TE GeneHancer”, 572 “TE ENCODE cCREs”, and 1437 as “TE coexpression” (i.e., they only met the correlation threshold) (Supplementary data 5). A total of 1992 unique genes are regulated by TEs, with 573 of them labeled as “Cancer gene”. 210 unique genes (59 being Cancer genes) have 3 or more interactions with TEs, amounting to a total of 997 interactions distributed in 87 modules (Fig. 8a). TEs in these modules are distributed into 445 “TE GeneHancer”, 96 “TE ENCODE cCREs” and 454 “TE coexpression” Out of the 210 genes, 153 interact with either a TE GeneHancer or TE ENCODE cCREs, providing further support to the hypothesis of TEs playing a regulatory role in GC. The remaining interactions met the correlation threshold, and the gene is in close genomic vicinity of the TEs, consistent with previous findings13,36 suggesting such a regulatory role for TEs. In addition, to gain an overview of the regulatory impact of TEs, the network modules were characterized according to the presence of a “TE hub”. For a module to be classified as “TE hub”, a TE must account for the 50% of more of its interactions. Following, this approach, 11 “TE hub” modules were identified (Fig. 8a, modules encircled in dashed lines). Out of the 87 modules, 44 (50.6%) have cancer genes, with the 11 “TE hub” modules amongst them (Fig. 8a, modules detailed in Supplementary Fig. 10). Interestingly, some of the genes in these modules have been previously associated with GC, such as CTSK51, CLDN1852 , HORMAD153,54, MS4A8 and MS4A1555, and SOCS356.Fig. 8Network analysis of Gastric Cancer TEs. (a) Regulatory network of genes associated with TEs. Genes previously associated with cancer are shown in red, and the remaining genes are shown in orange. TEs are colored according to their predicted regulatory link: TE GeneHancer in blue, TE ENCODE cCREs in green, and TE coexpression in purple. TEs that are module hubs are highlighted with black border, and the respective module is encircled in dashed lines. (b) Top 10 enriched gene set terms for the Gene Ontology (GO) Biological Process category (upper half) and KEGG pathways category (lower half). (c) Genome coverage plot of the intronic TE MamRTE1:RTE-BovB:LINE in locus chr10:78,040,182-78,040,254 (highlighted in red), having a predicted regulatory link with the RPS24 gene. On the right, the region enclosed with the dashed-lines rectangle is shown zoomed. (d) Genome coverage plot of the region chr1:150,610,108-150,809,260, depicting upstream TEs (highlighted in red) having a predicted enhancer regulatory link with the CTSK gene. GeneHancer and ENCODE regulatory elements overlapping the selected TEs are shown in blue and green, respectively.At a wider scale, gene set enrichment analysis revealed many terms of relevance to cancer (Fig. 8b). For example, intracellular signal transduction, programmed cell death and phosphorylation are all processes long associated with cancer57,58,59. Pathway-level analysis also depicts changes previously associated with cancer. The MAPK signaling pathway plays roles in cell proliferation, differentiation, migration and apoptosis, and is the top pathway likely regulated by TEs. Because of its role in many cell processes, malfunction on the pathway has been linked to cancer60. “Pathways in cancer” is the second term, and it encompasses different pathways with evidence associating them with cancer. Thus, this result is in close agreement with the previous literature search analysis carried out independently. Another interesting example is “Endocytosis”, that has many functions in nutrient uptake. Disruptions in this pathway also play a significant role in cancer, and there is evidence pointing to a role in GC61,62. Furthermore, when performing these analyses on each stage, it is revealed that these terms appear enriched from NAG, suggesting that TEs could potentially contribute to GC development by early alterations to gene expression (Supplementary Figs. 11–16).Finally, two examples of genes regulated by TEs are depicted: RPS24 (Fig. 8c) and CTSK (Fig. 8d). RPS24 was chosen because the interacting TE appears in all 3 analyses (single-cell, TI and spatial), has a partial overlap with one of its exons (Fig. 8c, TE highlighted in red), and the regulatory link is based on GeneHancer. In turn, the result would suggest that the TE is acting as an enhancer of the same gene. This would be similar to a documented event occurring in the EGFR gene, causing its up-regulation, which in turn, contributes to breast cancer63. Also, over-expression of RPS24 has been reported to be a biomarker of colorectal cancer64. In these lines, the TE-mediated up-regulation proposed in this study might also highlight a role of RPS24 in GC progression. On the other hand, CTSK has been previously implicated in GC, and the associated TEs are located about ~ 200,000 kbp away from its locus, and have overlaps with ENCODE cCREs and GeneHancer elements, making this a good example of a long-distance regulatory link (Fig. 8d, Supplementary Fig. 17). CTSK has been reported as over-expressed in GC and has been proposed as GC biomarker51,65. Additionally, it is unclear how it becomes over-expressed51. This result would help bridge that gap, by indicating that up-regulation of CTSK is caused by TE-derived enhancers.In sum, these results show that TEs are potentially regulating genes and pathways associated with cancer in several ways, strongly suggesting that TEs are implicated in the molecular aberrations that occur in GC.

Hot Topics

Related Articles