Real-time and programmable transcriptome sequencing with PROFIT-seq

Programmable total transcriptome sequencing with PROFIT-seqTo effectively enrich various transcript types, we first employed a combinatorial RT strategy using double-stranded oligo(dT), double-stranded random primer and single-stranded random primer to capture full-length polyadenylated, non-polyadenylated and circular transcripts (Fig. 1a and Methods). Then, we employed splint-based circularization12 and the RCA assay following the Rolling Circle Amplification to Concatemeric Consensus (R2C2) protocol13 to increase the length of cDNAs and improve the power of discriminating finished and rejected reads. Here, the increased length of RCA products (>10 kb) compared with linear RT products (~1 kb) provided the basis for effective target enrichment. Finally, the library was constructed and sequenced on the MinION platform. Raw sequencing reads were divided into subreads using the circularization splint sequence, and consensus sequences were calculated to accurately represent the high-fidelity transcripts (Methods). To improve the applicability of current adaptive sampling tools, PROFIT-seq provides a user-friendly web interface for programmable real-time sequencing manipulation, allowing users to specify sequencing mode, run duration, channel number, barcodes and genes of interest, or upload a configuration file for multiple target submissions (Extended Data Fig. 1a–c).Fig. 1: Real-time and programmable transcriptome sequencing with PROFIT-seq.a, A schematic overview of the PROFIT-seq method. First, the ribosomal RNA-depleted total RNA was reverse transcribed using a combinatorial RT strategy. Double-stranded oligo(dT) (dsdT), dsN and ssN were successively added to capture polyadenylated, non-polyadenylated and circular RNAs. The reverse-transcribed cDNA was then circularized and amplified using the RCA assay. The nanopore sequencing library was constructed, and PROFIT-seq was used for real-time control of the sequencing process. The acquired chunk data were basecalled and demultiplexed according to the sequencing time, channel number and detected barcodes. The basecalled sequences were subsequently aligned to the reference genome. Finally, PROFIT-seq determined whether the sequencing process should be continued or rejected according to the user-provided sequencing configuration. b, The length of rejected (purple) and finished (green) reads for canonical DNA, cDNA and PROFIT-seq runs. All bulk fast5 runs were simulated for sequencing all reads or rejecting all reads for 1 h. c, The elapsed time for raw signal basecalling, sequence alignment and pore manipulation for each acquired chunk of data. d, PROFIT-seq provides three manipulation modes, including enrichment or depletion of reads aligned to target regions and the balancing mode for dynamic determination of enriching targets based on the accomplished coverage. e, The performance of enriching chromosomes 1, 2, 5, 11 and 12 and depleting other chromosomes (middle) or balancing coverage of all chromosomes (right). Source numerical data are available in Source data.Source dataTo demonstrate the benefit of the PROFIT-seq protocol, we simulated different DNA, cDNA and PROFIT-seq sequencing runs for 1 h using the playback function of the MinKNOW software (Methods). As shown in Fig. 1b, the rejected reads in both runs had a median length of ~600 bp, with rejected reads in the cDNA run inevitably accounting for ~67.5% (554 of 820 bp) of full-length sequences, impeding the enrichment efficiency. In contrast, the finished reads in the PROFIT-seq run were 10-fold longer than rejected reads, indicating a more discriminative length difference. The real-time analysis efficiency of PROFIT-seq was verified by assessing the elapsed time for data processing steps. Based on the mapping rate of basecalled sequences, an optimized data acquisition interval of 0.4 s was determined (Extended Data Fig. 1d), and all data processing steps were able to be finished within every chunk acquisition period (Fig. 1c). Then, the enrichment effect was evaluated using three pore manipulation modes: (1) enrichment of target region, (2) depletion of target region and (3) balancing mode, where target regions are dynamically determined according to the sequenced coverage (Fig. 1d). PROFIT-seq successfully enriched target regions, with 72.10% of consensus reads aligning to target chromosomes in enrichment mode compared with 35.00% in control. In addition, the standard variance of reads aligned to different chromosomes decreased from 2.20 in control samples to 1.48 in balancing mode (Fig. 1e), demonstrating the successful pore manipulation using PROFIT-seq.Combinatorial whole-transcriptome RTAs current nanopore cDNA sequencing protocols often employ oligo(dT)-based RT to capture polyadenylated RNAs, the diversity of non-polyadenylated transcripts is largely missing. In contrast, PROFIT-seq employs a combinatorial RT strategy with double-stranded oligo(dT), double-stranded random primer and single-stranded random primer to capture full-length polyadenylated, non-polyadenylated and circular RNAs14. To assess the performance of our combinatorial RT protocol, total RNA from HeLa cells was reverse transcribed using indexed combinatorial RT primers (Methods), and reads were demultiplexed on the basis of the aligned index primers in the first and last 150 bp of the sequences. In two replicates, 82.35% of the sequenced molecules originated from double-stranded oligo(dT) (dsdT) primers, while double-stranded random hexamers (dsN) and single-stranded random hexamers (ssN) accounted for 12.55% and 5.05%, respectively (Fig. 2a). The majority of combinatorial RT reads aligned to the exonic regions of GENCODE-annotated genes (Extended Data Fig. 2a). Compared with the oligo(dT) primer, random hexamers generated reads of similar length (Fig. 2b) but displayed a more even distribution across the 5′ end of transcripts (Fig. 2c) and captured more non-polyadenylated transcripts (for example, pseudogene, long non-coding RNA (lncRNA), miscellaneous RNA (miscRNA) and small nuclear RNA (snRNA)) (Fig. 2d). In particular, ssN primers captured more circular RNAs (circRNAs), aligning with our original intent of the combinatorial RT strategy. Compared with oligo(dT) libraries, the combinatorial RT sample also demonstrated an increase in the proportion of reads from circRNAs and other non-polyadenylated transcripts (Extended Data Fig. 2b,c), indicating that PROFIT-seq was able to efficiently capture non-polyadenylated RNAs.Fig. 2: Simultaneous profiling of polyadenylated and non-polyadenylated transcriptomes using the combinatorial RT strategy.a, The percentage of reads derived from different primers. Rep, biological replicates. The colours indicate dsdT, dsN and ssN, respectively. b, The length of different primed reads in combinatorial RT libraries. c, The coverage across the top 1,000 highly expressed transcripts in different primed reads. The x axis represents the relative position along the transcript, and the y axis is the per cent coverage of combined reads. d, The proportion of reads aligned to transcripts of different biotypes according to the GENCODE annotation. The colours represent different primers. The proportion of circRNA reads is calculated using CIRI-long. e, The identification of poly(A)+ and poly(A)− genes from Yang et al.16. Poly(A)+, poly(A)− and bimorphic genes were classified according to the relative abundance between the poly(A)+ and poly(A)− samples. f,g, The log-scaled gene expression levels of poly(A)+ and poly(A)− genes in combinatorial RT, oligo(dT) primed (f) and ONT direct RNA-seq data (g). Poly(A)+ and poly(A)− genes were classified according to the relative abundance between the poly(A)+ and poly(A)− samples from Yang et al. The colours represent the density of transcripts. h, Tracks of sequencing depth and reconstructed isoforms in the RPL34 and RPS2 loci. i, Tracks of sequencing depth and reconstructed isoforms in the MYC and SMARCE1 loci. The GENCODE v37 annotation and genomic coordinates are indicated above the tracks. Source numerical data are available in Source data.Source dataTo validate the quantification ability of the combinatorial RT method, two random and two oligo(dT) primed cDNA libraries were constructed and sequenced as described above. Compared with the HeLa Illumina total RNA-seq15,16, oligo(dT), combinatorial RT and random primed libraries exhibited high correlation of gene-level expression values (R = 0.65; Extended Data Fig. 2d), which demonstrated that combinatorial RT does not bias quantification. For the poly(A)+ transcriptome, the combinatorial RT library (R = 0.65) achieved similar transcript-level quantification accuracy to oligo(dT) libraries (R = 0.67; Extended Data Fig. 2e), while the random primed library was substantially biased (R = 0.49). Thus, these results demonstrated that combinatorial RT could combine the poly(A)+ and poly(A)− transcriptomes without introducing notable bias that skews the quantification analyses. To further investigate the performance of combinatorial RT, a total of 22,649 poly(A)+ and 302 poly(A)− genes were identified using the public poly(A)-enriched and poly(A)-depleted dataset16 (Fig. 2e). Among them, a total of 16,453 poly(A)+ and 111 poly(A)− genes were successfully recovered in the combinatorial RT dataset. The abundance of poly(A)+ genes correlated highly with oligo(dT) and ONT direct RNA-seq17, while poly(A)− genes were enriched by the combinatorial RT protocol (Fig. 2f,g), indicating that combinatorial RT can better characterize the non-polyadenylated transcriptome compared with canonical poly(A)+ RNA-seq approaches.For instance, canonical oligo(dT) cDNA-seq only sequenced two protein-coding isoforms from the RPL34 gene, while combinatorial RT effectively captured the retained intron supported by the GENCODE v37 annotation18 (Fig. 2h). Similarly, oligo(dT) data only captured the major protein-coding isoform in the RPS2 gene, but combinatorial RT effectively reconstructed three alternative-spliced isoforms and one pseudogene from the same locus. Moreover, the combinatorial RT also effectively captured more transcript isoforms in lower-expressed transcription factors such as MYC and SMARCE1 (Fig. 2i), demonstrating enhanced isoform discovery sensitivity.To examine the quantitative performance of the complete PROFIT-seq protocol, the HeLa transcriptome was sequenced using the combinatorial RT and R2C2 amplification. As shown in Extended Data Fig. 3a, the transcript expression levels from the PROFIT-seq library and combinatorial RT library without R2C2 amplification were significantly correlated (R = 0.87, P < 10−31, Pearson correlation test), indicating reliable quantitative capability. Consistent with previous studies13,19, the consensus reads with a high circular consensus sequence number (>5) exhibited a high accuracy rate (99.22%; Extended Data Fig. 3b). In addition, PROFIT-seq provides a lightweighted strategy for consensus calling and RCA chimeric filtration (Methods), showing comparable performance to the C3POa13 algorithm with similar read accuracy (Extended Data Fig. 3c–e), faster processing speeds (Wilcoxon rank sum test, P = 0.02; Extended Data Fig. 3f) and fewer RCA artefacts (Wilcoxon rank sum test, P = 0.02; Extended Data Fig. 3g).Overall, these results demonstrated that PROFIT-seq can effectively sequence polyadenylated, non-polyadenylated and circular transcripts with high accuracy, which can better delineate transcriptome diversity compared with the canonical cDNA sequencing protocol.Enrichment effect of PROFIT-seq on the target gene panelNext, we evaluated the ability of PROFIT-seq to enrich the target gene panel in two colorectal polyp samples (JL01 and JL19). Here, a panel of 717 protein-coding genes, 1,055 lncRNAs and 606 circRNAs that are related to colorectal cancer were selected as enrichment targets on the basis of public databases20,21,22,23 (Fig. 3a). Each sequencing flow cell was divided into two sections, where half of the channels were sequenced with PROFIT-seq for target enrichment and the other half using the default program as a control. To validate the enrichment efficiency, the number of on-target and unwanted raw reads and consensus reads was calculated in a sliding window of 30 min. As shown in Fig. 3b,c, PROFIT-seq sequenced more molecules and produced more on-target raw and consensus reads in the first 12 h, and the number decreased gradually with the accelerated loss of pores during adaptive sequencing8. Finally, PROFIT-seq achieved an increased number of 1.9× target molecules and 1.68× target consensus reads compared to control runs, indicating the successful enrichment of cancer-related transcripts (Fig. 3d). Then, we investigated the number of cancer-related genes covered using the generated consensus reads. PROFIT-seq rapidly detected 363 target genes (26 mRNA, 286 lncRNA and 39 circRNA loci) within 6 h, while the control runs only sequenced 219 target genes (18 mRNA, 174 lncRNA and 21 circRNA loci) within that same amount of time (Fig. 3e,f). Finally, PROFIT-seq also generated a better number of recalled genes in the complete run. Here, 248 genes stably enriched with a >2-fold increase (Fig. 3g), and the overall enrichment efficiency was significantly correlated between the two samples (R = 0.27, P = 4.61 × 10−23, Spearman correlation test; Fig. 3h), suggesting the effective enrichment of the cancer-related gene panel.Fig. 3: Enrichment and isoform reconstruction of the target gene panel.a, The genomic location of the target gene panel (1,509 loci). The number of transcripts from different databases is indicated in the table. b,c, The absolute number of sequenced molecules (left) and consensus reads (right) aligned to target regions (b) or out-of-target regions (c). The colours represent PROFIT-seq and control runs, and the styles of lines indicate two biological replicates (JL01 and JL19). d, The cumulative number of sequenced molecules (left) and consensus reads (right) aligned to target regions. The colours represent PROFIT-seq and control runs, and the styles of lines indicate two biological replicates (JL01 and JL19). e, The cumulative number of genes covered by consensus reads during the sequencing run. The colours correspond to the PROFIT-seq (red) and control (grey) runs. The styles indicate two biological replicates. f, Bar plot indicating the number of recalled genes at 6 h in two biological replicates. g, The enrichment efficiency of each target gene at 6 h. The colours represent whether genes are 2-fold (red) or <2-fold (grey) enriched. h, The log-scale fold change (enrichment efficiency) of target genes between PROFIT-seq and control data in two polyp samples. The colours represent robustly enriched (red, ≥2-fold enrichment in both samples) and minorly enriched (grey) genes, respectively. R = 0.27, P = 4.61 × 10−23, Spearman correlation test. FC, fold change. i, The number of genes with more (orange) or fewer (blue) reconstructed isoforms in PROFIT-seq compared with control data. j,k, Tracks of sequencing reads and reconstructed isoforms in the GAS5 (j) and BID (k) loci. The numbers represent the discovery sensitivity of GENCODE-annotated transcripts in each sample. Source numerical data are available in Source data.Source dataWith the effective enrichment of cancer-related genes, we next examined the ability of PROFIT-seq to reveal RNA isoform diversity. Here, StringTie2 (ref. 24) was utilized for genome-wide transcript assembly using consensus PROFIT-seq reads. Compared with the control runs, PROFIT-seq dramatically improved the number of reconstructed transcripts in the target locus. As shown in Fig. 3i, 78.11% and 69.70% of cancer-related genes exhibited increased isoform diversity, while no similar effect was observed for non-target genes. For instance, only three isoforms from the GAS5 loci were identified in the control data, while seven and five isoforms were reconstructed in PROFIT-seq runs, respectively (Fig. 3j). Similarly, a notable increase in the aligned reads and reconstructed isoforms was also observed for the BID locus (Fig. 3k), indicating an increased discovery sensitivity of annotated genes. Taken together, these results suggested that PROFIT-seq was able to effectively enrich cancer-related genes and their splicing isoforms in clinical samples without prior experimental enrichment.Unbiased quantification of targeted and non-targeted RNAsAs targeted reads are selectively enriched during the sequencing run, the quantification of targeted and non-targeted RNAs could be biased. To address this issue, we proposed an expectation–maximization (EM)-based algorithm that combines full-length consensus reads and rejected partial reads for effective transcript reconstruction and unbiased quantification (Fig. 4a). Briefly, basecalled reads were divided into full-length reads and partial fragments on the basis of the existence of RT primers and template-switching oligo sequences. Enriched full-length reads were aligned to the reference genome, and transcripts were assembled using StringTie2 (ref. 24). Next, partial fragments of rejected reads were extracted and realigned to the reconstructed transcripts. The expression value of each assembled transcript was measured by Salmon25 using full-length and partial reads, respectively, and the final expression values were determined on the basis of a modified EM algorithm (Methods).Fig. 4: Unbiased quantification of transcripts using the EM-based strategy.a, A schematic overview of the isoform assembly and quantification pipeline. First, full-length reads were aligned to the reference genome for isoform reconstruction. Then, the recovered partial fragments were aligned to the reconstructed isoforms, and Salmon was used for transcript quantification. The final expression levels were measured using an EM-based strategy by combining the full-length reads and recovered partial fragments. b, The percentage of full-length reads for the target (right) and non-target (left) transcripts in the JL01 sample. c, The expression levels of the target (red) and non-target (grey) genes in PROFIT-seq using recovered partial fragments. d, The expression levels of the target (red) and non-target (grey) genes before correction (left) or after correction using the EM algorithm (right) in the JL01 sample. e, Correlation of gene expression levels for Illumina total RNA-seq, control runs without pore manipulation and PROFIT-seq data using either the EM quantification strategy or only full-length reads. The colours represent the correlation coefficient calculated using the Spearman correlation test. f, The log-scale fold change of target gene expression values in PROFIT-seq and Illumina total RNA-seq data as controls. The colours indicate the expression level before correction (grey) or after correction using both full-length PROFIT-seq reads and recovered partial fragments (red). FC, fold change. Source numerical data are available in Source data.Source dataIn PROFIT-seq data, the percentage of full-length consensus reads in target transcripts was higher than that in non-target transcripts, indicating effective enrichment but also quantification bias (Fig. 4b and Extended Data Fig. 4a). Intriguingly, a high Spearman correlation (R = 0.687) between gene quantification results calculated from partial fragments and control runs was observed (Fig. 4c), which demonstrated the viability of expression level correction using these partial fragments. When using full-length reads only, the gene expression values of target genes were substantially overestimated (Fig. 4d and Extended Data Fig. 4b), and the correlation was notably improved after the adjustment by the EM quantification algorithm (R = 0.717), indicating the effective correction of quantification bias.We further benchmarked performance of the EM-based quantification results against Illumina total RNA-seq data. As shown in Fig. 4e, the raw full-length PROFIT-seq data without quantification adjustment exhibited a weak correlation to the Illumina data, while adjusted gene quantification using both full-length and partial fragments was similar to that without sequencing manipulation. Notably, target gene expression levels in PROFIT-seq data exhibited a much lower dispersion level with Illumina and control samples (Fig. 4f and Extended Data Fig. 4c). Taken together, these results indicated that PROFIT-seq was able to effectively enrich target transcripts while maintaining unbiased transcript expression levels.Rapid pathogen detection and variant identificationTo demonstrate the performance of PROFIT-seq in detecting low-abundance pathogens, a total of 16 sputum samples were collected from 8 patients with pneumonia and 8 patients infected with SARS-CoV-2. All samples were then sequenced using a modified PROFIT-seq protocol that uses only dsN and ssN to capture pathogenic RNAs without poly(A) tails. Each flow cell was divided into two sections for PROFIT-seq and control sequencing for 24 h as described above. To explore the pathogenic composition of each sample, the genomic sequences of pneumonia-related pathogens26,27 (for example, Streptococcus pneumoniae, Staphylococcus aureus and Klebsiella pneumoniae) and SARS-CoV-2 were downloaded from the RefSeq28 database as enrichment targets (Fig. 5a). Then, both non-full-length reads and cleaned consensus reads were aligned to the National Center for Biotechnology Information (NCBI) non-redundant protein sequences (NR) database using the long-read mode of diamond2 (ref. 29).Fig. 5: Rapid pathogen detection for sputum samples of patients with pneumonia or COVID-19.a, A schema of sample collection and target enrichment. Each flow cell was divided into two sections, where half of the pores were used to enrich target pathogens and the other half were sequenced without manipulation as a control. Adapted from Servier Medical Art by Servier under a Creative Commons license CC BY 4.0. b, The yield of total and on-target sequenced molecules in control and adaptive run. The y axis represents the number of raw reads or bases of all reads (top row) or target transcripts (bottom row). The bar colours represent control and adaptive runs, respectively. c, The coverage of SARS-CoV-2 genome during the sequencing process. The colours represent PROFIT-seq (red) or control runs (blue). The dashed lines indicate the time taken for adaptive sampling runs to achieve equivalent coverage as 24 h of control runs. d, The performance of PROFIT-seq in reducing elapsed time for SARS-CoV-2 detection. The points represent individual samples. The error bars indicate 95% confidence intervals (P = 0.0007, two-sided Wilcoxon rank sum test, n = 8 biological replicates). e, The performance of PROFIT-seq in increasing data yield for SARS-CoV-2 detection. The points represent individual samples. The error bars indicate 95% confidence intervals (P = 0.0007, two-sided Wilcoxon rank sum test, n = 8 biological replicates). f, The sensitivity of S-protein variant calling using PROFIT-seq (red) or control (blue) data. g, The composition of pathogenic bacteria in the S01 and S06 samples. h, The composition of detected pathogens in the sputum samples of patients with pneumonia using PROFIT-seq (top) and control data (bottom). The colours in g and h correspond to different target pathogens. Pneumonia and COVID-19 samples were named as P and S, respectively. Source numerical data are available in Source data.Source dataIn agreement with previous studies, the total yield of raw reads in PROFIT-seq libraries was lower than unmanipulated control8, but a substantial enrichment in the number and yield of both raw reads and consensus reads from target pathogens was observed (Fig. 5b and Extended Data Fig. 5a). Notably, the amount of on-target data generated by the PROFIT-seq runs within 6 h was equivalent to that of control runs within 24 h (Extended Data Fig. 5b–d). Moreover, the total number of pathogen-derived consensus reads generated was also increased, with an average of 3.21- and 3.57-fold enrichment in the COVID-19 and pneumonia samples, respectively (Extended Data Fig. 5e). For the patients with COVID-19, we first evaluated the diagnostic efficiency measured by the coverage of the SARS-CoV-2 genome (minimum ten supporting consensus reads). Compared with the control runs, the coverage of the SARS-CoV-2 genome detected by PROFIT-seq exhibited a more rapid and efficient increase (Fig. 5c). Specifically, PROFIT-seq required much less time (24.40%) to achieve the same coverage (Fig. 5d) and was able to generate a higher yield (3.3-fold) of target pathogenic consensus reads within the same amount of time (Fig. 5e).In particular, PROFIT-seq runs generated higher sequencing depth for the spike (S) protein, providing the basis for identifying high-confidence mutations (Extended Data Fig. 5f). Notably, all 22 representative Omicron mutations recorded in the 2019nCoVR database30 and the key mutations of Omicron BF.7 subvariants31 (for example, R346T, L452R and F486V) were successfully identified, which is consistent with the pandemic of BF.7 subvariants in the sampling region (Extended Data Fig. 5g). Of note, compared with the control runs, the PROFIT-seq method enabled the rapid and effective detection of S-protein mutations (Fig. 5f). The sensitivity of mutation identification was saturated within 6–12 h using PROFIT-seq, which surpassed the final sensitivity of 24 h sequencing in the control runs. Interestingly, although the S01 and S06 samples contained a low abundance of SARS-CoV-2, the majority of key mutations (~80%) could still be detected, indicating the significant advantage of PROFIT-seq in detecting low-abundance pathogen transcripts with highly accurate consensus sequences.Next, we examined the composition of target bacterial pathogens in all samples. As shown in Extended Data Fig. 5f, although a high proportion (>60%) of coronavirus was observed in most COVID-19 samples, S01 and S06 showed a distinct pathogen composition, in which >70% of assigned reads were identified as Haemophilus influenzae (Fig. 5g). The concurrent infection of H. influenzae and SARS-CoV-2 in these two samples also explained the low efficiency in detecting SARS-CoV-2 variants. For pneumonia samples, PROFIT-seq also exhibited better performance in detecting pathogenic reads (Fig. 5h), revealing the diverse source of dominant pathogens in different patients with pneumonia (for example, K. pneumoniae in P13, S. aureus in P15, S. pneumoniae in P26 and H. influenzae in others). Overall, these results suggested the widespread application of PROFIT-seq in rapid characterization of target pathogens in clinical samples.Revealing host–microbiome association in polyp developmentTo further demonstrate the applicability of PROFIT-seq in characterizing complex associations in disease, we performed PROFIT-seq on 18 colorectal polyp samples including 6 inflammatory polyps, 5 low-grade intraepithelial neoplasia (LIN) and 7 high-grade intraepithelial neoplasia (HIN) samples (Fig. 6a), targeting an immune- and tumour-related gene panel covering 106.69 Mb of genomic regions. In addition, reads that were unmapped or chimerically mapped to the reference genome were also enriched to detect unannotated transcripts, including pathogen RNAs or fusion transcripts. In addition, each sample was sequenced using Illumina total RNA-seq and PROFIT-seq without enrichment targets as the control for comparative analysis.Fig. 6: Interaction of immune response and gut microbiome dysbiosis during polyp-to-intraepithelial neoplasia transformation.a, A summary of the sample collection sites and the sequencing strategy. A total of 18 colorectal polyp samples from three clinical stages were collected. All samples were sequenced using Illumina total RNA-seq and PROFIT-seq with and without enriching target panels, respectively. Adapted from Servier Medical Art by Servier under a Creative Commons license CC BY 4.0. b, The number of target genes that are significantly enriched (>2-fold) (orange), minorly enriched (1- to 2-fold) (yellow) or not enriched (grey). Only genes with more than five supporting reads in both control and adaptive sampling runs were included. c, The percentage of reads in filtered regions or unannotated transcripts, which includes reads that are discordantly mapped or unmapped to the reference genome. d, Principal component analysis representation of samples with different sequencing strategies. The colours represent clinical stages, and the shapes indicate sequencing strategies. The dashed circles correspond to the 3 s.d. ellipses (P = 0.001, ANOSIM test). The P value of ANOSIM test is overlaid. e, The fraction of BCR reads identified by TRUST4. The bar colours represent Illumina total RNA-seq (grey), PROFIT-seq without enrichment (yellow) and PROFIT-seq with enrichment of the target panel (orange), and the background colours indicate different clinical stages. Significant difference between PROFIT-seq and control (P = 1.53 × 10−5, Wilcoxon signed rank test). f, The diversity of BCR hypervariable CDR3 sequences measured by CPK individual CDR3 sequences in six inflammatory, five low-grade and seven high-grade polyp samples. In the control groups, significant differences among stages were observed (P = 0.036, Kruskal–Wallis H test) with Dunn’s post hoc test revealing significant changes between inflammatory and high-grade intraepithelial neoplasia (IN) (P = 0.027), low-grade IN and high-grade IN (P = 0.030). In the PROFIT-seq groups, significant differences among stages (P = 0.017, Kruskal–Wallis H test) were also observed, with Dunn’s post hoc test showing significant changes between inflammatory and high-grade CIN (P = 0.011) and low-grade CIN and high-grade CIN (P = 0.021). g, The diversity of BCR hypervariable CDR3 sequences measured by clonality in six inflammatory, five low-grade and seven high-grade polyp samples. In the control groups, significant differences among stages (P = 0.024, Kruskal–Wallis H test) with significant changes between inflammatory and low-grade CIN (P = 0.006, Dunn’s post hoc test) were observed. In the PROFIT-seq groups, significant differences among stages (P = 0.022, Kruskal–Wallis H test) with significant changes between inflammatory and low-grade CIN (P = 0.006, Dunn’s post hoc test) were also observed. h, The usage of immunoglobulin (IG) heavy-chain isotypes in the PROFIT-seq data of 18 polyp samples. The colours represent different isotypes of the IG heavy chain. i, A heat tree representation of detected microbiotas. The colours represent the normalized change level of bacterial taxa at different ranks in 18 samples measured using the test statistic of Kruskal‒Wallis H test. j, A dot plot representation of the abundance of pathogens related to colorectal cancer. The colours within each dot represent normalized abundance across all samples. For box plots, the middle lines represent the median, and the lower and upper bounds represent the first and third quartiles. The upper and lower whiskers represent the limits of 1.5 inner quantile ranges, and points outside this range are plotted as outliers. Source numerical data are available in Source data.Source dataFor highly supported genes with more than five aligned reads in both runs, 80.79% of targets were successfully enriched by PROFIT-seq, with 26.43% of the genes over 2-fold enriched (Fig. 6b). Overall, the target gene panels were significantly enriched (Wilcoxon rank sum test, P < 0.001) in all 18 PROFIT-seq runs (Extended Data Fig. 6a). Moreover, the percentage of consensus reads that mapped to unwanted ‘filtered regions’ was decreased (Fig. 6c), indicating the effective enrichment of the target gene panel. Moreover, PROFIT-seq generated more discordantly mapped or unmapped reads and fewer unwanted reads than Illumina total RNA-seq data, which validated the advantages of our combinatorial RT strategy in capturing and enriching the full spectrum of the transcriptome. Notably, the principal component analysis of gene expression levels demonstrated that samples from the same biological groups were similar regardless of the sequencing strategy (P = 0.001, ANalysis Of SIMilarity (ANOSIM) test; Fig. 6d), indicating that the PROFIT-seq strategy was able to preserve the real biological diversity between different samples.For immune receptor repertoire analysis, the T cell receptor (TCR) and B cell receptor (BCR) sequences were de novo assembled using TRUST4 (ref. 32) and annotated with the international ImmunoGeneTics (IMGT) database33. As expected, the effective enrichment of BCR and TCR reads (P = 1.53 × 10−5 to 0.014, Wilcoxon signed rank test) was clearly observed in the PROFIT-seq data (Fig. 6e and Extended Data Fig. 6b), whereas the percentage of both BCR and TCR reads was also significantly higher than that of canonical Illumina total RNA-seq data (P = 7.63 × 10−6, Wilcoxon signed rank test). As a result, a gradual decrease in diversity (measured by clonotypes per kilo-reads, CPK) (Fig. 6f) and increase in clonality of hypervariable complementarity-determining region 3 (CDR3) of BCRs (Fig. 6g) were observed in both PROFIT-seq and control data (P = 7.63 × 10−6, Kruskal‒Wallis H test), which was consistent with the reported depletion of B cell diversity in colon cancer against adjacent non-cancerous samples34. To determine the specific BCR isotype that is related to the malignant transformation of polyps, the usage of different immunoglobulin heavy-chain (IGH) isotypes was assessed. As shown in Fig. 6h, PROFIT-seq revealed a significant disturbance in IGHA2 (P = 0.028, Kruskal‒Wallis H test) and IGHG2 (P = 0.045, Kruskal‒Wallis H test), suggesting the switch of pathogenic microbiota during the polyp malignant transformation process35,36. Notably, the IGH isotypes and IGHV and IGHJ family usage observed in PROFIT-seq were successfully confirmed by Illumina RNA-seq data (Extended Data Fig. 6c,d). Finally, the recombination pattern of the V and J segments was investigated. Although the combination varied in different samples, the relative usage of the top-utilized combination was more centralized in the high-grade intraepithelial neoplasia groups (Extended Data Fig. 6e), consistent with the decreasing diversity of BCR in low- and high-grade intraepithelial neoplasia samples.Taking advantage of the ability to capture the full spectrum of the transcriptome, we further investigated the microbiota composition during the transition of polyps to neoplasia (Fig. 6i). Overall, the occurrence of common intestinal microbiota was detected, and a reduced alpha diversity of the polyp microbiome, although not significant (P = 0.17, Kruskal‒Wallis H test), was observed in the PROFIT-seq data (Extended Data Fig. 6f). In particular, the abundance of Fusobacterium, whose infection in colorectal cancer has been widely reported37, was significantly increased in high-grade intraepithelial neoplasia (P = 0.004, Dunn–Bonferroni post hoc test; Fig. 6k). Other colorectal cancer-related gut bacteria, including Bacteroides, Firmicutes, Enterococcus, Escherichia, Proteobacteria and Streptococcus38,39,40, were also effectively captured in the PROFIT-seq data. Together, the reduced microbial diversity and changes in gut microbiota composition indicated dysbiosis of the healthy microbiome during neoplasia progression. It should be noted that several fusion transcripts were successfully detected in the PROFIT-seq data. For instance, the fusion of two protein-coding genes, PCNP and RPS18 (Extended Data Fig. 6f), was frequently observed in two low-grade and two high-grade IN samples, and this PCNP–RPS18 fusion event was also reported in the ChiTaRS 5.0 database41. In summary, these results demonstrated the promising ability of PROFIT-seq to characterize the full spectrum of the transcriptome with polyadenylated and non-polyadenylated RNAs and to enrich target and unannotated transcripts in a programmable manner.

Hot Topics

Related Articles