MATES: a deep learning-based model for locus-specific quantification of transposable elements in single cell

Methods overviewMATES is a specialized tool designed for locus-level quantification of TEs in single-cell datasets of different modalities. The method involves several key steps. First, raw reads are mapped to the reference genome, identifying reads that map uniquely to a TE locus (unique reads) and reads that map to multiple TE loci (multi-mapping reads) (Fig. 1a). Next, we compute the coverage vector for each TE locus, representing the distribution of unique reads surrounding the locus (context). Each TE region (locus) is then subdivided into smaller bins of length W (e.g., 10 base pairs). Each of these bins is classified into Unique-dominant (U) or Multi-dominant regions (M) based on the percentage of unique and multi-mapping reads within the bin (Fig. 1b). Please refer to the Methods section for details on selecting these hyper-parameters. Third, we employed an AutoEncoder (AE) model to learn the latent embeddings (Vu) that represent the high-dimensional unique reads coverage vector of a TE locus, which indicates the mapping context flanking the specific TE locus. The one-hot encoded TE family information (Ti) was also taken as input for the model. Fourth, the learned latent embedding (Vu) and the TE family embedding (Ti) are used to predict the multi-mapping ratio (α) for the specific TE locus via a multi-layer perceptron regressor. The total loss to learn the model is composed of two components (L1 and L2). The former represents the reconstruction loss of the AutoEncoder, while the latter reflects the continuity of the actual reads coverage across neighboring small bins on the TE. Essentially, the final read coverage on the multi-mapping dominant (M) bins should be close to its neighboring unique dominant (U) bins because of their genomics proximity. Finally, once we train the model that predicts the multi-mapping ratio for each TE locus, we can leverage it to count the total number of reads that fall into the specific TE loci, presenting probabilistic quantification of TEs at the locus level (Fig. 1c). By combining TE quantification with conventional gene quantification (e.g., gene expression or gene accessibility) from the single-cell data, which we refer to as “Gene+TE expression” in the sections below, we can then more accurately cluster the cells and identify comprehensive biomarkers (genes and TEs) to characterize the obtained cell clusters (cell sub-populations). Equipped with advanced features, MATES efficiently handles various single-cell data modalities. Its application offers insights into the roles of TEs in diverse datasets, clustering cells, and identifying potential biomarker TEs (Fig. 1d). Beyond its analytical capabilities, MATES also presents locus-specific TE visualization and interpretation. The tool facilitates the generation of comprehensive bigwig files and Interactive Genomic Viewer (IGV) plots, enabling researchers to visually explore and interpret read assignments for TE loci across the genome (Fig. 1e). This capability unlocks the investigation of potential interactions between TEs and the genes situated near TE loci, significantly enhancing our understanding of TE dynamics and their impact on gene regulation and cellular functions. Please note that, except where specifically mentioned, the term ‘TE’ used throughout this manuscript represents repetitive elements identified from RepeatMasker. This allows us to provide a comprehensive overview of genomic repeats in our study. When discussing the ‘stricter’ definition of TEs, we have specifically mentioned which TE subfamilies are included.Fig. 1: MATES methodology for TE quantification and analysis.a Raw reads are aligned to the reference genome, accounting for multi-mapping reads at TEs’ loci. b TE coverage vectors, including unique reads coverage vector Vu and multi reads coverage vector Vm, are constructed, capturing reads’ distribution information. c An AutoEncoder model extracts latent embeddings from unique reads coverage vectors. These embeddings, combined with TE family data Ti, predict the likelihood, α, of multi-mapping reads aligning to each TE locus. d The multi-mapping probability α, computed by MATES, is critical in creating the TE count matrix. This matrix is pivotal for cell analysis and can be utilized either independently or in conjunction with a conventional gene count matrix. Such combined use enhances cell clustering and biomarker (gene and TE) discovery, providing a more comprehensive understanding of cellular characteristics. e Genome-wide reads coverage visualization by MATES in the Genome Browser. This method quantifies TEs at specific loci in individual cells, producing bigwig files with coverage from probabilistically assigned multi-mapping reads. These files, containing both unique and multi-mapping reads, are merged to generate comprehensive bigwig files for genome-wide TE read visualization using tools such as the Interactive Genomic Viewer (IGV).MATES identifies signature TEs and their specific loci in 2C-like cells (2CLCs) within 10x single-cell RNA-seq data of chemical reprogrammingTo demonstrate the precision of MATES in TE quantification from single-cell RNA-seq data, we applied it to a 10x single-cell chemical reprogramming dataset of mice. This analysis identified signature TEs of 2-cell-like cells (2CLCs)25. By employing MATES to quantify TE expressions, we integrated the quantified TE count matrix with gene expression profiles, facilitating comprehensive clustering and visualization analyzes, as shown in Fig. 2a, b and Supplementary Fig. S1a. Our study revealed a distinct subpopulation of 2CLCs (cluster 17), positioned between stage II and stage III of reprogramming. Notably, MATES detected the 2CLCs population and distinguished their signature gene markers, especially Zscan4d and Zscan4c26, within the transition-stage clusters. Moreover, MATES identified specific TE markers, MERVL-int and MT2_Mm, that are enriched within the 2CLC cluster, corroborating previous studies recognizing these TEs as defining markers for 2CLCs27,28,29,30. These findings highlight MATES’s ability to capture cell populations and their significant biological markers (genes and TEs), providing insights into the cellular dynamics of reprogramming.Fig. 2: MATES enhances cell clustering and biomarker discovery in mouse chemical reprogramming.a, b UMAP plots illustrating MATES’s efficacy in cell clustering by integrating TEs and genes. a is colored by Leiden clustering results, while (b) is colored according to the reprogramming stages, highlighting identified gene (purple) and TE (red) biomarkers. c, d Additional UMAP plots emphasizing MATES’s capability for clustering using only TEs, with (c) colored by Leiden clusters and (d) by reprogramming stages. Notably, MT2_Mm and MERVL-int TEs are prominent biomarkers in Zscan4c/Zscan4d-positive cells, consistent with known 2CLCs markers. e Dot plot identifying stage-specific marker genes (purple) and TEs (black) as detected by MATES. f Illustration of MATES’s probabilistic approach to allocating multi-mapping reads to specific TE loci, predominantly involving MT2_Mm and MERVL-int at the Zscan4c/Zscan4d loci in 2CLCs. g Bar plots displaying the reads enrichment for MT2_Mm and MERVL-int at the Zscan4c/Zscan4d loci. The enrichment p-value is calculated with a one-sided binomial test. h Box plot comparison of cell clustering efficacy by Adjusted Rand Index (ARI) between MATES’s locus-level and subfamily-level TE quantification. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. The experiments run with N = 10 different seeds and the p-value was calculated using a one-sided Student’s t-test. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. Source data are provided as a Source Data file.We next conducted a TE-centric analysis to further validate the distinct role of MATES’s TE expression quantification in cell clustering and biomarker discovery (Fig. 2c, d, Supplementary Fig. S1d). When quantifying TE expression, we took care to exclude overlapping regions between TEs and their adjacent genes to prevent potential information leakage from gene expression data. This TE-centric analysis specifically focused on TE expressions and it also identified the 2CLC cell population. Furthermore, this analysis not only confirmed the previous findings related to the 2CLCs population but also reaffirmed the relevance of its associated TE biomarkers, namely MERVL-int and MT2_Mm, as illustrated in Fig. 2c, d. This indicates that our cell clustering and biomarker discovery are not solely dependent on traditional gene expression analysis. Instead, TE quantification independently conducted by MATES provides consistent cell clustering results and accurately identifies signature TEs for the identified cell populations. To provide a clearer, quantitative view of the clustering accuracy based solely on TEs, we included confusion matrices and calculated Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) scores to highlight the similarity between the TE-based and conventional gene-based analysis results. The clustering results based on TE expression alone were compared to those based on gene expression. Major clusters, such as cluster 1 and cluster 12, representing SIII_D12 and 2CLCs, respectively, were effectively captured by TE-only clusters (Supplementary Fig. S2a). These TE clusters corresponded well with the gene expression clusters, with high ARI (median 0.397, P < 1 × 10−6) and NMI (median 0.496, P < 1 × 10−6) scores indicating a strong alignment (Supplementary Fig. S2b). The confusion matrix in Supplementary Fig. S2c shows that TE cluster 1 is predominantly identified by gene clusters 1 and 9, while TE cluster 12 is mainly identified by gene cluster 18, highlighting a notable correspondence between TE and gene expression clusters. Moreover, by focusing on clustering driven by TE expression, quantified exclusively from multi-mapping reads, MATES demonstrates its ability to manage these challenging reads and identify biomarkers precisely aligned with their specific developmental stages (Supplementary Fig. S1g).Not only does MATES identify the signature genes and TE markers for 2CLCs and cell populations in different reprogramming stages (Fig. 2e, Supplementary Fig. S1h), but it also shows effectiveness in aligning the multi-mapping reads to specific loci, a challenge that has stymied current methods. For instance, scTE is limited to assigning multi-mapping reads to the metagene (TEs of the same subfamily), lacking a distinct assignment to specific genomic loci. While SoloTE quantifies reads uniquely mapped to TEs at the locus level, it retains only the best alignment for multi-mapping reads and then quantifies them at the subfamily level. In contrast, by leveraging the learned multi-mapping rate for each TE locus (α), MATES probabilistically assigns multi-mapping reads to TE genomic loci across the genome. Through this strategy, we can quantify TE expression at the locus level with accuracy, evident when assessing the multi-mapping reads for 2CLC cells (Fig. 2f, g). Multi-mapping reads linked to MT2_Mm and MERVL-int were observed to align closely with genes Zscan4c and Zscan4d, and the total reads linked to MT2_Mm and MERVL-int that aligned closely with Zscan4c and Zscan4d loci were significantly higher than other control loci (Fig. 2g, Supplementary Fig. S1i). This alignment is consistent with findings by Zhu et al.31, where the activation of Zscan4c was correlated with the activation of the endogenous retrovirus MT2_Mm/MERVL-int. Please note that the unique reads-based locus quantification, highlighted in orange in panel g, represents the SoloTE strategy. This strategy processes unique reads at the locus level and multi-mapping reads at the subfamily level. Thus, at the locus level, only unique reads were leveraged by SoloTE, which may result in missing reads mapped to critical loci such as 4c and 4d, indicating its potential limitations. Additionally, locus-specific TE quantification improves clustering accuracy compared to the subfamily level TE quantification typically used in existing methods, clearly demonstrated in Fig. 2h. This emphasizes the substantial benefits of precise, locus-level TE quantification. For additional results demonstrating the effectiveness of MATES with this 10x single-cell RNA-seq data, please see Supplementary Fig. S1.MATES quantifies disease-related TE expression in full-length Smart-Seq2 single-cell RNA-seq data of human glioblastomaTo demonstrate the cross-platform applicability of MATES, we tested and applied the tool to another single-cell RNA-seq dataset from the Smart-Seq2 full-length sequencing platform32, focusing on a human glioblastoma dataset33. The combined use of MATES’s TE expression quantification and conventional gene expression analysis allowed us to pinpoint distinct cell populations within the glioblastoma micro-environment, as shown in the UMAP plots (Fig. 3a, b). We observed certain TEs with expression patterns linked to crucial glioma gene markers like EGFR33,34 and TE markers including HUERS-P1-int35 and HERVK-int36, along with immune cell gene markers such as CD7437,38 and TE marker LTR2B39,40 (Fig. 3b). These correlations suggest that TEs might be associated with processes related to tumor heterogeneity and the immune response in glioblastoma. Further research is necessary to explore any causal relationships and the underlying mechanisms. Combining TE-based cell typing with gene expression data revealed a detailed interplay between genes and TEs. This integration showcased how TE-based clustering could complement gene expression analysis, thus enhancing the resolution of cellular heterogeneity studies.Fig. 3: MATES quantifies disease-related TE expression in Smart-Seq2 single-cell RNA-seq data.a, b UMAP plots showcase cell clustering informed by gene and TE markers. “MATES” or “Gene+TE” signifies combined gene expression with TE data quantified by MATES. Initially, MATES UMAPs are colored by Leiden clusters (a), then by cell type, neoplastic (EGFR, HUERS-P1-int, and HERVK-int), and immune cell markers like CD74, LTR2B, and LTR40A1 (b). c, d UMAPs based solely on MATES-quantified TE expression are colored by Leiden clusters (c) and cell type with specific markers (HERVK-int, etc.) (d). e Dot plots elucidate the correlation between marker genes, TEs, and cell types, as identified by MATES. f, g Illustrate the enhanced clustering accuracy using MATES’s locus-level TE quantification, with (f) detailing Leiden clusters and (g) showing cell types. h The plot lists a highly expressed TE marker (LTR2B) for immune cells at the locus level and the same TE’s non-expressed locus, demonstrating MATES’s capability in locus-level TE quantification. i A bar plot visualizes the average locus-specific TE expression levels in immune and neoplastic cells. j A box plot comparison of cell clustering efficacy,Adjusted Rand Index (ARI), between MATES’s locus-level and subfamily-level TE quantification reveals the method’s enhanced resolution, demonstrating its performance in biomarker identification and cellular classification. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. The experiment runs with N = 10 different seeds and the p-value was calculated using a one-sided Student’s t-test. Source data are provided as a Source Data file.Further demonstrating MATES’s precision, we also performed cell clustering solely based on the TE count matrix quantified by MATES. While the TE-only analysis may not achieve better clustering accuracy compared to the combined analysis, it is crucial to emphasize that TE quantification holds biological information capable of producing coherent results with conventional gene-based analysis. Specifically, we systematically compared the TE-only results with gene expression clustering results and found notable similarity. Leiden clusters 0 and 1 correspond to immune cells, while clusters 2, 3, and 4 correspond to neoplastic cells (Supplementary Fig. S2d, f). The ARI (median is 0.105, P  = 1.03 × 10−2) and NMI (median is 0.161, P = 7.60 × 10−4) scores indicate a weak yet significant agreement between TE expression clustering and gene expression clustering (Supplementary Fig. S2e). The confusion matrix further compares TE clusters to gene clusters and cell types, showing that TE cluster 0 overlaps significantly with gene clusters 0 and 1, which are primarily composed of immune cells, while TE cluster 2 aligns with gene clusters 4 and 5, containing mostly neoplastic cells (Supplementary Fig. S2f). This suggests that TE-based clustering can accurately recapture all major cell populations, identifying their associated TE markers (Fig. 3c, d). The dot plots (Fig. 3e) not only displayed the associations between specific marker genes, TEs, and cell types but also quantified their relative expression levels, adding a deeper dimension to the data analysis.In addition to analyzing TE expressions at the subfamily level shown above, MATES’ locus-level TE quantification offered a more comprehensive view of the cellular landscape (Fig. 3f–h, Supplementary Fig. S3). This approach facilitated the identification of highly expressed TE loci corresponding to marker TEs previously identified at the subfamily level. Notably, even for the same TE subfamily, such as LTR2B, distinct loci could exhibit different expression patterns (Fig. 3h, i), emphasizing the critical necessity for precise and locus-specific TE quantification. The LTR2B locus at chr3∣104522003∣104522491∣LTR2B (chrom∣start∣end∣TE), a highly expressed locus-specific TE marker for immune cells, is close to the CD166 gene, suggesting potential regulatory interactions. CD166, crucial for immune cell adhesion and function41, may be influenced by LTR2B through its regulatory elements. TEs can impact nearby gene expression by providing promoters, enhancers, and transcription factor binding sites, facilitating rapid and dynamic gene expression changes vital for immune responses42. Additionally, TEs are targets for epigenetic modifications, further regulating nearby genes and enhancing immune cell adaptability43. Further experimental analysis is needed to fully understand their interactions. In addition, the application of this locus-specific TE quantification significantly improved cell clustering accuracy compared to its subfamily-level counterpart, as demonstrated in Fig. 3j (P  = 5.48 × 10−7), highlighting its critical role in analyzing cellular heterogeneity and understanding TE functions and its superiority over conventional subfamily level analysis. Please refer to Supplementary Data 1 for the identified top TE locus markers and their nearby interacting genes for neoplastic and immune cells.Our results affirm the robustness of MATES when applied to full-length scRNA-seq data, emphasizing its effectiveness for in-depth cellular analysis within these single-cell RNA-seq datasets across varying sequencing platforms. While some existing methods (e.g., scTE) can be adapted to handle full-length single-cell RNA-seq data, their performance is often suboptimal, highlighting the value of MATES’ ability to process and interpret these datasets effectively (Supplementary Fig. S4).Applicability of MATES across speciesMATES can quantify TE expressions not only in mammals, such as humans and mice as demonstrated above, but also in non-mammalian species, showcasing its cross-species applicability. To comprehensively evaluate MATES’ generalizability, we applied it to single-cell RNA-seq datasets from the non-mammalian species Arabidopsis thaliana (Arabidopsis)44 and Drosophila melanogaster (Drosophila)45. Here we used conventional gene expression-based cell embedding and clustering as the baseline to evaluate the quality of TE quantification of different tools on these non-mammalian species.Supplementary Fig. S5 illustrates the cell clustering based on TE and gene expression in Arabidopsis using MATES, scTE, and SoloTE. Supplementary Fig. S5a shows the UMAP visualization of TE expression quantified by scTE compared with gene expression clusters, with an ARI of 0.3315 (P  < 1 × 10−6). Panel (b) presents the UMAP visualization of TE expression quantified by SoloTE, with an ARI of 0.3200 (P < 1 × 10−6). Panel (c) shows the UMAP visualization of TE expression quantified by MATES compared with gene expression clusters, achieving an ARI of 0.3514 (P < 1 × 10−6). When combining the gene expression and TE expression quantified by computational methods (Supplementary Figs. S5a–c), MATES achieved a 0.6668 ARI score (P < 1 × 10−6), which is higher than SoloTE (ARI = 0.5916, P < 1 × 10−6) and scTE (ARI = 0.5775, P < 1 × 10−6). This is categorized as moderate similarity, as demonstrated in a comprehensive single-cell RNA-seq clustering comparison evaluation study, where varying levels of differential expression (DE) were employed to assess clustering similarity46. The moderate similarity between the UMAP visualizations in panels (a), (b), and (c) indicates that TE-based clustering mirrors gene expression-based clustering, with MATES showing the highest agreement. Here, we employed permutation tests to calculate p-values associated with the similarity ARIs calculated above (see “Methods” for details). Additionally, the identification of marker TEs using MATES further supports its effectiveness. Panel (d) highlights that specific marker TEs were identified for various cell populations. For instance, marker gene KCS6 identified the outer cell layer, while HIK identified partially dividing and inner cell layers. Correspondingly, marker TEs ATCopia66LTR and ATCopia41LTR were identified for the outer cell layer and the partially dividing and inner cell layers, respectively.For Drosophila, Supplementary Fig. S6 shows clustering results based on TE and gene expression using MATES and scTE. UMAP visualizations show TE expression quantified by MATES (ARI = 0.3188, P = 4.60 × 10−5, Supplementary Fig. S6a) and by scTE (ARI =  0.3078, P = 4.60 × 10−5, Supplementary Fig. S6b). This is categorized as moderate similarity, as supported by studies assessing clustering similarity through varying levels of DE46. The moderate similarity between the UMAP visualizations in panels Supplementary Figs. S6a, b indicates that TE-based clustering mirrors gene expression-based clustering, with MATES showing slightly higher agreement. Additionally, the identification of marker TEs using MATES further supports its effectiveness. Supplementary Fig. S6c highlights that specific marker TEs were identified for various cell populations. For example, the marker TE GYPSY12_LTR was identified for cluster type 8, as defined by conventional gene clustering.These findings, based on both the moderate ARI values indicating significant clustering similarity, as evidenced by the permutation test p-values, and the effective identification of marker TEs for specific cell types, demonstrate MATES’s robustness and applicability across both mammalian and non-mammalian species.Applicability of MATES across modalitiesBeyond the general applicability to various species as shown above, our proposed MATES model also demonstrates its versatility by effectively quantifying TEs in not only transcriptome data but also epigenome data. To validate this adaptability, we applied MATES to a 10x single-cell ATAC-seq dataset of adult mouse brain47. This compatibility with single-cell data across different modalities is crucial, as many existing methods are tailored exclusively for transcriptomic data and may not perform as effectively with other single-cell modalities. MATES enables the quantification of TE locus-specific attributes, such as chromatin accessibility, across diverse single-cell data modalities, extending its utility beyond transcriptomics.By quantifying chromatin accessibility at the TE subfamily level and incorporating standard single-cell ATAC-seq peaks, MATES facilitates refined cell clustering and the identification of TE markers that are characteristic of distinct cell populations (Fig. 4a, b, Supplementary Fig. S7a). TEs such as RMER16_MM and RLTR44B from the ERVK family show exclusive accessibility in macrophages48,49, while MamRep434 and MER124 are preferentially accessible in astrocytes50, underscoring TEs’ significant roles in neurogenesis and astrogliogenesis51. For example, MamRep434’s significant contribution to the motif of Lhx2, a key transcription factor in neurogenesis, is emblematic of the functional implications of TE accessibility in cell identity and function50,52. Leveraging solely the TE expression data quantified by MATES for clustering, we successfully identified TE biomarkers and maintained clear delineation among cell groups (Fig. 4c, d, Supplementary Fig. S7b), confirming that the insights gleaned from TE quantification are rooted in genuine biological phenomena. Although integrating TE expression with conventional quantification like gene or peak counts generally produces the best cell embedding quality and clustering, TE-only analysis is essential for highlighting the specific contributions of TEs beyond conventional gene or peak expressions. By comparing TE-based results with conventional gene or peak-based results, we demonstrated that TE quantification is highly informative and can produce consistent cell clustering results similar to conventional analyzes (Supplementary Figs. S2g–i). The median ARI between TE-based clusters and gene-based clusters is 0.309 (P = 5.60 × 10−5), and the median NMI is 0.438 (P  = 4.60 × 10−6). The identified signature TEs for all cell populations are shown in Fig. 4e.Fig. 4: MATES versatility application on adult mouse brain scATAC-seq data.a–d UMAP plots demonstrate the effectiveness of MATES quantification in cell clustering and identifying signature TE markers, using both TEs and Peaks for clustering, with (a) displaying Leiden clusters, and (b) illustrating cell types and TE markers. Key TE biomarkers such as RMER16_Mm, RLTR44B in macrophages, MamRep434, MER124 in Astrocytes, and MURVY-LTR, MamRep1527 in oligodendrocytes are identified. c–d Showcase the specificity of MATES in TE-centric clustering (using only TE quantification by MATES), with (c) focused on Leiden clusters, (d) on cell types and the distinctive TE markers previously noted. e Dot plot concisely presents cell type-specific TE markers uncovered by MATES. f–h These panels illustrate MATES’s improved clustering accuracy using the locus level TE quantification. Panel (f) features UMAP visualizations based on locus-level TE quantification, with colors representing Leiden clusters. Panel (g) displays the same UMAP, but with color coding to differentiate various cell types. Panel (h) provides a specific example of the TE marker RLTR44B in Macrophages at the locus level, in contrast to a non-open locus of the same TE, demonstrating MATES’s capability in detailed locus-level TE quantification. i The box plot in panel (i) contrasts the cell clustering efficacy, Adjusted Rand Index (ARI), of MATES when comparing locus-level versus subfamily-level TE quantification. This comparison highlights the advantages of employing locus-specific TE quantification. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. The experiments run with N=10 different seeds and the associated p-value was determined using a one-sided Student’s t-test. j The bar plot visualizes the average locus-specific TE expression in Macrophage, Oligo, and Astro cells. k The dot plot shows the locus-specific TE markers, identified by MATES, for individual cell types. Source data are provided as a Source Data file.Beyond subfamily-level TE quantification and analysis, MATES also delivers locus-specific TE quantification, identifying signature TEs for each cell population with precise TE locus positions (Fig. 4f–h, Supplementary Figs. S7c, d). Locus-level TE quantification demonstrates significantly higher (P  = 2.55 × 10−34) cell clustering accuracy compared to subfamily-level analysis (Fig. 4i). This underscores the effectiveness of MATES in locus-level TE quantifications and their potential benefits for understanding cellular states within the data. Marker TEs identified at the locus level align with those detected at the subfamily level, validating the method’s accuracy and yielding locus-specific insights into TE’s impact on chromatin accessibility53. Supplementary Fig. S7 provides additional supporting evidence of MATES’ effectiveness in the 10x scATAC-seq dataset, further solidifying its position as a versatile tool for TE quantification and analysis at the single-cell level across different modalities.These locus-level TE biomarkers for each cell population can potentially reveal interactions with nearby genes and their impact on regulating cellular states for specific cell types (Fig. 4j, k). For example, in this mouse brain scATAC dataset, the locus chr13∣89384655∣89385083∣RLTR44B exhibits substantial chromatin accessibility in cells that should be annotated as microglia, despite their previous macrophage annotation (Fig. 4h). One of the flanking genes is Edil3. Quantitative proteomics indicate increased Edil3 expression in isolated microglia of APP-KI mice compared to WT mice54. This locus is also located upstream of the gene Hapln1, which has been recently identified as a macrophage-related regulator and is correlated with cancer immunotherapy55. These findings support the potential roles of this RLTR44B locus in the macrophage/microglia cell population. Similarly, the locus chr19∣40434224∣40434373∣MamRep434 (Supplementary Fig. S7c), located near the Sorbs1 gene associated with astrogliosis, demonstrates elevated expression in individuals with schizophrenia who express high levels of inflammatory markers56. Additionally, genes situated within a 200kb flanking region of this transposable element include Aldh18a157 and Entpd1958, which play pivotal roles in astrocyte functions and their interactions with non-astrocytic cells. These findings collectively underscore the regulatory potential of these identified marker TE loci in influencing astrocyte-related gene expression. Another notable case is the locus chr18∣10613085∣10613530∣RLTR28, which shows high accessibility in scATAC-seq and could be a potential enhancer for the nearby stress-responsive gene Abhd3 in astrocytes, highlighting the role of TEs in stress response through genetic mutations and epigenomic variations59. Additionally, MATES identified the locus chr17∣43557623∣43557829∣LTR33 near the Phospholipase A group 7 (Pla2g7) gene, suggesting it might serve as a tissue-specific expression enhancer of this gene in cortical GM astrocytes. This finding is supported by the reported predominant expression of Pla2g7 in astrocytes60. Please refer to the Supplementary Data 2 for the top identified TE locus markers and their nearby interacting genes for Macrophages, Astro and Oligo.Overall, these examples illustrate how MATES enables detailed interrogation of interactions between TE loci and nearby genes, potentially unlocking deeper understanding of biological mechanisms and providing valuable insights into the regulation of cellular states.Consistency between single-cell TE quantification by MATES and corresponding bulk quantification via conventional methodsTo evaluate the robustness of MATES, we compared both scRNA-seq and scATAC-seq TE quantification with matching bulk datasets. MATES, optimized for single-cell data, demonstrated consistent results across various experimental configurations. For this comparison, bulk data were generated from 10x scRNA-seq and scATAC-seq (pseudo-bulk), simulating bulk RNA-seq and bulk ATAC-seq datasets. Utilizing pseudo-bulk data derived from the same set of cells minimized potential batch effects, ensuring a rigorous comparison. This approach facilitated a direct comparison between single-cell TE expression quantified by MATES and bulk-level quantification from pseudo-bulk data using TEtranscripts11 and Telescope12, which are specifically designed for bulk TE quantification. Please refer to ‘MATES comparison with existing TE quantification methods on bulk data’ in the Methods section for our strategy to compare the single-cell TE quantification from MATES with the corresponding bulk TE quantification from existing approaches.The results for RNA data, illustrated in Supplementary Fig. S8, demonstrate a high degree of correlation between pseudo-bulk TE expression quantified by MATES and the ground truth. Specifically, MATES showed a strong correlation with TEtranscripts (R2 = 0.9429; Supplementary Fig. S8a) and with Telescope (R2 = 0.9664; Supplementary Fig. S8b), indicating robust performance. To provide a comprehensive comparison regarding TEs and repetitive elements, we included only well-defined TEs (DNA, LTR, RC, Retroposon, SINE, and LINE) in the analysis of pseudo-bulk RNA data.The results for comparisons made on ATAC data, depicted in correlation scatter plots (Supplementary Fig. S9), also indicate a high degree of correlation between MATES and TEtranscripts (R2 = 0.9901; Supplementary Fig. S9a) as well as between MATES and Telescope (R2 = 0.9890; Supplementary Fig. S9b). These high correlation coefficients affirm that MATES reliably capture TE expression profiles comparable to those obtained from bulk ATAC-seq data. Additionally, selective genome views (Supplementary Fig. S9c) highlight specific TE regions, further demonstrating the concordance between single-cell (MATES) and bulk (TEtranscripts, Telescope) TE quantification. This analysis incorporated repetitive elements to ensure a comprehensive evaluation.These results underscore the efficacy of MATES in providing consistent and reliable TE quantification at the single-cell level, on par with conventional bulk methods. This consistency not only validates the accuracy of MATES but also enhances our understanding of TE dynamics at the single-cell level, potentially unlocking deeper insights into underlying biological mechanisms.MATES enables multi-omic TE quantification and analysisTo further attest to the broad applicability of MATES, we applied it to a single-cell multi-omics dataset (10x Multiome)61. By amalgamating the TE quantification provided by MATES with conventional gene expression data (RNA transcripts from scRNA-seq) and matched accessibility quantification (chromatin accessibility from scATAC-seq), we distinguished various cell populations and their associated markers, as depicted in Fig. 5a, b.Fig. 5: Multi-modal TE analysis in human PBMCs using MATES.a, b UMAP plots of joint MATES clustering using both scRNA and scATAC modalities, with (a) illustrating Leiden clusters and (b) detailing cell type clusters. c–f TE quantification across modalities highlights the complementarity of multi-modal TE quantification. Here, c, d feature UMAP clustering by Gene and TE in the scRNA modality, while e, f display UMAP clustering by Peaks and TE in the scATAC modality. Both (c) and (e) are colored by Leiden clusters, and (d) and (f) by cell types, showcasing the differential expression of TEs like AluYa5 across both modalities, whereas MER48, LTR71A, and MER54A appear specific to scATAC. g–l This series of UMAP plots and box plots illustrates the multi-modal TE analysis. g and j feature UMAP plots of TE expression, colored by Leiden clusters to highlight the clustering patterns. h and k are UMAP plots that focus on different cell types and TE markers, providing insights into the distinct identities of cells and their associated TEs. i and l are box plots contrasting the cell clustering effectiveness, Adjusted Rand Index (ARI). The box plots underscore the improved resolution provided by locus-level quantification over subfamily-level quantification. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. The experiments in these two plots run with N = 10 different seeds and the p-values were calculated using a one-sided Student’s t-test. m Illustrates TE biomarkers discerned via scRNA and scATAC modalities, noting that high expression TEs often correlate with increased chromatin accessibility, while the reverse is not uniformly observed, highlighting the unique contributions of each modality. n A dot plot captures signature TEs per cell type, validating the complementary nature of scATAC and scRNA data for a holistic view of transposon dynamics. Source data are provided as a Source Data file.MATES’s ability to cluster diverse cells, identifying distinct cell populations and TE biomarkers across different sequencing techniques, introduced MATES’ potential in harnessing multi-omics data for in-depth cell analysis (Fig. 5c–f), the results underscore the synergistic interplay of different modalities within MATES. Notably, when solely relying on MATES’ TE quantification (TE_scRNA and TE_scATAC), the method effectively captured the primary cell populations and their signature TE markers. In contrast, quantification at the locus level further improved clustering performance(Fig. 5g–l, Supplementary Fig. S10a–d). Analyzing the multi-omics data, MATES unexpectedly reveals that certain TEs are uniquely discernible in specific cell sub-populations and exclusive to a particular modality (Fig. 5m, n, Supplementary Fig. S10e). TEs with elevated gene expression often exhibit increased chromatin accessibility (represented by red dots, such as AluYa5, have been reported62). Conversely, transposons with enhanced chromatin accessibility do not always indicate high expression levels (marked as blue dots). For instance, LTR71A is predominantly detected via scATAC-seq, while not present in the scRNA-seq dataset. A similar trend is observed with TE biomarkers like MER48, and MER54A, suggesting these TEs are accessible in uninfected monocytes but transcriptionally dormant. This leads us to propose that such TEs may be primed to a “poised” state that might correlate to monocyte function63. Notably, several of these TEs, previously identified as “poised”, featured prominently among the upregulated TEs. This enrichment calculated by hypergeometric test, with a significant p-value of P = 5.91 × 10−24, was determined by comparing the marker TEs identified through MATES with those upregulated TEs reported in the study (Supplementary Fig. S10f).This observation highlights the capability of MATES to uncover biological features of TE. Here, our result emphasizes that when analyzed through the prism of TEs, chromatin accessibility, and RNA abundance offer complementary insights into the single-cell state (see Fig. 5m). The strength of MATES is demonstrated in Supplementary Fig. S10g. Within this framework, TE quantification by MATES boosts cell clustering accuracy and helps to build new hypotheses.Method benchmarking in single-cell TE quantificationIn our study, we conduct a detailed benchmarking of the MATES approach for TE quantification and its effect on cell clustering in single-cell datasets with various modalities. This is compared with the two established methods, scTE and SoloTE. Our benchmarking primarily concentrates on TE quantification at the subfamily level, due to the limitations of existing methodologies. Both scTE and SoloTE are incapable of providing locus-specific TE quantification that accounts for multi-mapping reads. Specifically, scTE is confined to subfamily level quantification, while SoloTE does offer locus-specific quantification but is constrained to unique mapping reads. The prevalence of multi-mapping reads among TEs highlights the critical need for accurate locus-specific TE quantification to enhance our understanding of cellular states. To ensure a fair comparison in our benchmarking, we have restricted our analysis in this section to the subfamily level, a feature shared by both scTE and SoloTE. However, the advantages of MATES in providing more accurate locus-specific TE quantification over the subfamily level have been discussed in earlier sections of our study for each individual dataset.Here, we perform comprehensive benchmarking of TE quantification methods, evaluating them based on their downstream cell clustering performance. We analyze TE information from three distinct perspectives to illuminate the various aspects of TE expression’s relevance and utility. Firstly, we analyzed gene/peak and TE expression concurrently to demonstrate how TE expression complements conventional gene/peak quantifications from single-cell data. This analysis underscores the added value that TE data can provide alongside traditional metrics. Secondly, we focused exclusively on TE expression to showcase the inherent power and information encapsulated within TE data alone. This approach aimed to illustrate that TE quantification, independent of conventional gene/peak data, holds sufficient information for effective cell clustering. Lastly, we assessed TE expression based on multi-mapping reads to emphasize the ability of the MATES method in handling these reads for TE quantification. This aspect was particularly crucial for understanding the robustness of MATES in dealing with the complexities of TE data. These comparisons were referred to as Gene/Peak + TE expression (Gene+TE/Peak+TE), TE expression (TE), and Multi-mapping TE expression (Multi TE), respectively.The benchmarking in this work included single-cell RNA datasets obtained using both the 10x and Smart-Seq2 platforms, as well as single-cell ATAC sequencing (scATAC) datasets. It is important to note that SoloTE is not compatible with Smart-Seq2 and scATAC single-cell data. Consequently, SoloTE was excluded from the comparative analysis for these two datasets. To assess the clustering results derived from the three perspectives outlined earlier, ARI and NMI as our key benchmarking standards. Our analysis compared MATES with two existing single-cell TE quantification methods scTE and SoloTE, which primarily differ in their approach to handling multi-mapping reads. We observed significant differences in their performance: scTE uses a basic correction method, SoloTE selects the top-scoring alignment, and MATES utilizes a probabilistic approach informed by local read context. The statistical significance of our results, indicated by p-values, reinforces the improved performance of MATES in TE quantification, as evidenced by improved cell clustering performance compared to existing methods.In the analysis that integrated both Gene and TE expression, MATES demonstrated better performance than scTE and SoloTE, especially evident in the 10x scRNA dataset for chemical reprogramming. In this context, scTE’s effectiveness was markedly inferior compared to a gene expression-only approach, highlighting the importance of accurate multi-mapping read assignment. This was evidenced by MATES achieving higher ARI and NMI scores across all datasets tested, as shown in the left panels of Fig. 6a, b. To further showcase and compare the accuracy of TE quantification by different methods, we conducted a cell clustering analysis based on only the TE quantification, excluding gene expression data. Here again, MATES outshone scTE and SoloTE in terms of clustering efficiency, which was reflected in improved ARI and NMI scores (see the middle panels of Fig. 6a, b). This result underscores the capability of MATES in handling multi-mapping reads and demonstrates its significance in TE quantification. In scenarios focusing exclusively on multi-mapping TE reads, MATES manages these challenging read assignments became apparent. MATES consistently outperformed scTE and SoloTE in these tests, as depicted in the middle and right panels of Fig. 6a, b. This consistent improvement across different testing conditions demonstrates the robustness and effectiveness of MATES in TE quantification. Its strength in the assignment of multi-mapping reads contributes to the accuracy in subsequent cell clustering tasks.Fig. 6: Benchmarking MATES performance in diverse single-cell datasets.This assessment highlights MATES’s efficiency in cell clustering, evaluated through the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), using different TE quantification strategies. a, b The impact of various TE quantification methods on cell clustering is compared within the 10x scRNA Dataset of Chemical Reprogramming and Smart-Seq2 Dataset of Glioblastoma. These methods include scTE (gene expression combined with scTE-quantified TE), SoloTE (gene expression combined with SoloTE-quantified TE), and MATES (gene expression integrated with MATES-quantified TE). MATES outperforms both scTE and SoloTE by enhancing gene expression with TE data (left). The middle panel compares clustering based solely on TE quantification methods-unique TEs, scTE, SoloTE, and MATES-with ‘unique TEs’ representing unique-mapping reads TE expression, highlighting MATES’s consistently improved performance. The right panel confirms MATES’s advantage over scTE and SoloTE when considering only multi-mapping TE reads. Note: SoloTE’s incompatibility with Smart-Seq2 data results in a blank section. Panel (a) uses ARI for evaluation, while panel (b) utilizes NMI. c The 10x scATAC Dataset of the Adult Mouse Brain is analyzed to contrast peak and TE quantification using scTE and MATES against peak-only datasets (left). TE mapping reads from scTE and MATES are also compared against unique TE mapping reads (right). SoloTE’s incompatibility with scATAC data leads to its exclusion from this part of the analysis. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the medians. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. The experiments run with N = 10 different seeds. The p-values were calculated using a one-sided Student’s t-test. Source data are provided as a Source Data file.In our benchmarking analysis of the 10x scATAC mouse brain dataset, we assessed the MATES approach against scTE and peak-only quantification methods. This evaluation aimed to determine the cell clustering accuracy when combining peak and TE data. Our results indicated a clear advantage of MATES over both scTE and peak-only methods in terms of clustering accuracy, showcasing its capacity to integrate additional chromatin accessibility insights effectively. A key aspect of MATES’s performance was its comprehensive quantification of TE expression, which encompassed both unique mapping reads and overall TE data. This approach significantly surpassed scTE, contributing to an enhanced clustering process. The inclusion of multi-mapping reads by MATES, although less critical than in scRNA datasets due to its lower frequency in this scATAC-seq data, provided valuable insights to the clustering analysis, as illustrated in Fig. 6c. These findings not only highlight the effectiveness of MATES but also demonstrate its adaptability and potential for broad applications in various single-cell data modalities.Furthermore, we benchmarked MATES against existing methods for locus-level TE quantification using the 2CLCs 10x scRNA dataset. Since scTE does not support locus-level TE quantification, we only compared the results from MATES with those from SoloTE. Supplementary Fig. S11 shows the ARI scores for cell clustering based on locus-level TEs, where MATES achieved a 10.52% higher ARI score than SoloTE (P =  2.60 × 10−12). This improved performance may be attributed to MATES’s ability to leverage locus-level TE expression from multi-mapping reads, whereas SoloTE can only quantify locus-level TE expression from uniquely-mapped reads.Validation of MATES quantification accuracy with single-cell long-read sequencing data and simulation experimentsValidation through long-read sequencing dataTo validate the accuracy of our TE quantification method, MATES, we used long-read sequencing data from both PacBio and Nanopore platforms. PacBio’s Sequel II system generates high-fidelity (HiFi) reads with superior accuracy, often exceeding 99%, making it ideal for applications requiring precise base-level resolution, such as identifying specific TE insertion sites64. Nanopore sequencing, although typically known for its ultra-long reads, provided reads around 900 bp in this study. With an accuracy rate ranging from 85% to 95%, it still offers valuable insights into repetitive regions and TE structures65. Both long-read sequencing technologies enhance our ability to distinguish similar TE instances by spanning long repetitive regions and capturing more mutations. Including datasets from both platforms robustly validates MATES by leveraging the complementary strengths of each technology.To ascertain the precision of TE quantification by MATES, we first utilized a melanoma brain metastasis dataset from nanopore sequencing platform, as detailed in Shiau et al.66. This dataset comprised single-cell nanopore RNA sequencing (scNanoRNAseq)67, producing long-read data with an average length of 937 base pairs (bp). This is significantly longer than the corresponding short-read data of 222 bp obtained from 10x scRNA-seq (real sequencing reads), which came from the same set of cells in the study. The longer read length of nanopore sequencing is advantageous for TE quantification, as it allows for more accurate read alignment68 (i.e., reduced multi-mapping rate  ~1% in nanopore sequencing compared to over 12% in 10x short-read sequencing from the same study), thereby enhancing the reliability of TE quantification. This level of precision from longread data serves as the ground truth for our validation of short-read TE quantification by MATES. In our validation of MATES through nanopore long-read sequencing data, we compared TE expression quantified by each method, including MATES, scTE, and SoloTE, within individual cells. To enhance the accuracy of our correlation calculations and accurately reflect the relationship between short-read and long-read expressions, we included only well defined TE families (DNA, LTR, RC, Retroposon, SINE, and LINE). This was to ensure a focused and accurate analysis of transposable elements and the R2 values truly represented the biological data, free from distortions caused by non-expressive or anomalous readings. MATES’s quantification of TEs demonstrated a strong correlation with the ground-truth long-read data (R2 = 0.7531), surpassing both scTE (R2 = 0.6499) and SoloTE (R2 = 0.6841) in quantifying TE expression at the subfamily level from the real 10x short-read data, as seen in Fig. 7a–c. Additionally, the pseudo-bulk TE expression in the data, determined by averaging the TE expression across all cells in the real short-read 10x sequencing dataset, is also quantified by each respective method. MATES demonstrated a stronger correlation with the established ground-truth long-read data, achieving an R2 value of 0.9276. This performance exceeded that of other methods such as scTE and SoloTE, which achieved R2 values of 0.8591 and 0.8654, respectively (Supplementary Fig. S12).Fig. 7: Validation of MATES TE quantification using nanopore long-read single-cell sequencing and simulation.a–c Analyze the correlation between TE expression quantified by (a) MATES, (b) scTE, and (c) SoloTE from real 10x short-read data and the nanopore long-read sequencing for the same set of cells. This comparison is at the subfamily level. d The distribution of lengths of Alu family TE regions. e Correlation between locus-level TE expression quantified by MATES and the simulated ground truth for the Alu repeat simulated data. f Comparison of quantified results by MATES and scTE to the simulated ground truth for the Alu repeat simulated data. The blue bar and orange bar represent the percentage of simulated reads captured by MATES and scTE, respectively. Among the 6 simulated Alu families, MATES on average captured 96.14% simulated reads while scTE recaptured 92.57%. The p-values of R2 (Coefficient of determination) were calculated using the one-sided F-test. Source data are provided as a Source Data file.We then benchmarked MATES against scTE and SoloTE using PacBio data from the postnatal mouse brain, with the processed long-read data of 7896 cell barcodes and an average length of 1,164 bp69. The data also contains short-reads data from the same number of cells with a length of 91 bp. The results from this analysis showed strong correlations between the long-read and short-read expression data, with MATES achieving an R2 value of 0.5096, compared to 0.3537 for scTE and 0.4199 for SoloTE (Supplementary Fig. S13). These findings indicate that PacBio long-read validation supports that MATES delivers a much better accuracy in TE quantification.Validation through controlled simulationDespite the validation provided by long-read sequencing from Nanopore and PacBio platforms, these technologies have limitations in capturing short TEs like Alu repeats. To address these limitations, we conducted benchmarking using simulated Alu repeats data, with length around 300 bp (Fig. 7d). Using the simulation data as the ground truth allows for more robust comparisons of different methods’ performances, as the benchmarking is not affected by sequencing errors or other technical limitations. Please see ‘Validation with controlled simulation’ in Methods for the details of constructing the simulation dataset. The quantification of MATES were then compared against the simulated ground truth, yielding an R2 value of 0.7420 (Fig. 7e). The simulated data consists of full-length RNA-seq reads without UMIs, which are required by SoloTE for processing and mapping. Consequently, SoloTE is incompatible with this data, prompting a focus on comparing subfamily-level performance with scTE. As illustrated in Fig. 7f, MATES demonstrated closer quantification to the ground truth compared to other evaluated methods.Additionally, simple and low-complexity repeats, which are pervasive in genomic data are also short. However, simple repeats have important impacts on evolution and human disease70,71,72. Therefore, here we also validated the effectiveness of MATES in quantifying the expression for simple and low-complexity repeats. Cluster-specific simple repeats also emerged as top marker TEs in the above results (e.g. Fig. 3e, Fig. 4e, and Fig. 5n), highlighting their potential roles. To show the accuracy of read assignment at the locus level for simple repeats, we conducted a detailed simulation study similar to the simulation of Alu family. The correlation between the MATES quantified read numbers and the simulated reads at each locus was evaluated, achieving a high correlation (R2 = 0.7479, P = 2.02 × 10−43), as illustrated in Supplementary Fig. S14a. Supplementary Fig. S14b compares the difference in read numbers between the ground truth and the quantified results from MATES and scTE, respectively. Shorter error bars indicate closer alignment to the ground truth. Although scTE showed better performance in a few TE instances, MATES generally demonstrated improved performance across most subfamilies.Another concern is that repeats can form different isoforms through alternative splicing, further challenging model performance. For instance, HERV-K has many isoforms with varying lengths (Supplementary Figs. S15a, b). To address this, we employed a controlled simulation to demonstrate MATES’ capacity for quantifying TE isoforms. Specifically, we simulated TE1 and TE2 (Supplementary Fig. S15c) and obtained 4126 ground-truth reads for TE1 and 1282 reads for TE2. Using MATES, pre-trained on human scRNA-seq data, we predicted 4140 reads for TE1 and 1295 reads for TE2 (Supplementary Fig. S15d). These results closely matched the ground truth, demonstrating MATES’ ability to accurately quantify TE isoforms and differentiate between regions with and without deletions, even with the complexity of alternative splicing.To directly evaluate the accuracy of locus-level TE quantification, we simulated full-length short reads from a scNanoRNA-seq long-read data66 to serve as a proxy ground truth. The advantage of nanopore long-read sequencing lies in its ability to generate reads that are sufficiently long to capture variations between different TE instances of the same or similar TE subfamilies. This capability provides a more precise representation of locus-level TE quantification, enabling a systematic and objective assessment of the performance of methodologies in quantifying TE loci. MATES and the SoloTE strategy were run on the simulation data and evaluated against the proxy ground truth from the long-read data. As a result, MATES demonstrated a higher correlation (R2 = 0.4923, P < 1 × 10−324) in profiling locus-specific TE expression compared to the imitation of SoloTE strategy (R2 = 0.1178, P < 1 × 10−324) (Supplementary Fig. S16). This result highlights MATES’s better performance in profiling locus-specific TE expression.In conclusion, these long-read sequencing and controlled simulation based validations effectively demonstrated MATES’s robustness in quantifying TEs, even in complex scenarios involving alternative splicing and short TEs. This underscores the capability of MATES to handle diverse and challenging contexts in TE quantification, providing accurate and meaningful insights.

Hot Topics

Related Articles