Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection

Design and synthesize DNA and RNA oligosWe designed oligo sequences following procedures reported in ref. 2 using the CURLCAKE software (http://cb.csail.mit.edu/cb/curlcake/). Specifically, we covered all possible DNA 6mers (4096 in total, median occurrence as 5) and RNA 5mers (1024 in total, median occurrence as 10). We also avoided secondary structures, the formation of which during nanopore sequencing will increase the translocation speed further biasing ionic current signals45,46. We generated two independent sequence sets, for both DNA and RNA, with different CURLCAKE random seeds. Such sets represent different sequence contexts: as larger sequence scopes (k > 6 and 5 for DNA and RNA, respectively) were surveyed, the fraction of overlapping kmers against total kmers among sets drastically decreased. Lengths for yielded DNA and RNA sequences are 25 kb and 12.5 kb, respectively. We split CURLCAKE sequences into “curlcakes” of ~2.5 kb for synthesis purposes. For each DNA curlcake, we added HindIII sites to both 3′ and 5′ ends and removed all the internal HindIII sites. For each RNA curlcake, we first added a strong T7 promoter to the 5′ ends. We then added EcoRV sites to both 3′ and 5′ ends and removed all the internal EcoRV and BamHI sites. Final DNA and RNA sequence backbones were provided in Tables S1 and S2, respectively. We synthesized and cloned all the DNA and RNA curlcakes into the pUC57 vector using blunt EcoRV and HindIII, through the service of GenScript Biotech Corporation.For DNA curlcakes, we incorporated modified nucleotides using PCR. dNTP mixtures, including unmodified (dATP, dTTP, dCTP, dGTP), 5 mC (dATP, dTTP, 5m-dCTP, dGTP), 5hmC (dATP, dTTP, 5-hme-dCTP, dGTP) and U (dATP, dUTP, dCTP, dGTP) were mixed with engineered pUC57 plasmids, primers, the reaction buffer and polymerase for PCR. Yielded products were analyzed by agarose gel electrophoresis and extracted using the gel extraction kit. The concentration of resulting DNA was determined by the NanoDrop 2000 Spectrophotometer.For RNA curlcakes, we incorporated modified nucleotides using in vitro transcription (IVT). Engineered pUC57 plasmids were digested with EcoRV and BamHI restriction enzymes for at least two hours at 37 °C, and analyzed via agarose gel electrophoresis. Purification of the digested DNA was conducted using a PCR purification kit. Nanodrop was used to measure the concentration of extracted DNA prior to IVT. AmpliscribeTM T7-FlashTM Transcription Kit was used to generate IVT RNAs as per manufacturer’s instructions. During IVT, modified ribonucleoside triphosphates including N1-Methyl-ATP (m1A), N6-Methyl-ATP (m6A), and 5-Methyl-CTP (m5C) were supplemented in place of their unmodified counterparts. DNAse I was added to the IVT reaction system after incubation for 4 h at 42 °C to eliminate the residual template DNA. Yielded IVT RNAs were purified using the RNeasy Mini Kit following manufacturer’s instructions. NEB vaccinia capping enzyme was used for the 5′ capping of purified IVT RNAs, with an incubation for 30 min at 37 °C. Following purification with RNAClean XP Beads, the capped IVT RNAs were subjected to polyadenylation tailing. Concentration of capped and polyA-tailed IVT RNAs was determined by Qubit Fluorometric Quantitation.Nanopore sequencingDNA nanopore sequencing libraries were prepared using the ONT Ligation Sequencing Kit (SQK-LSK110) following protocol version ACDE_9110_v110_revN_10Nov2020 as per manufacturer’s instructions. Briefly, for each group (unmodified, 5mC, 5hmC, and U), 100 fmol of PCR DNA was subjected to repair and end-prep with NEBNext PPFE DNA Repair Mix and NEBNext Ultra II End repair/dA-tailing Module kits, respectively. After purification using AMPure XP Beads, the product was subjected to adapter ligation with NEBNext Quick Ligation Module, as the DNA sequencing library. The Qubit fluorometer was used to determine the concentration of the DNA library. The DNA library was mixed with the sequencing buffer and loading beads prior to sequencing on a primed MinION flow cell. The flow cell version is R9.4.1, and the sequencer is MinION.RNA nanopore sequencing libraries were built using the ONT Direct RNA Sequencing Kit (SQK-RNA002) following protocol version DRS_9080_v2_revQ_14Aug2019 as per manufacturer’s instructions. Briefly, for each group (unmodified, m1A, m6A, and m5C), 2 μg of capped and polyA-tailed IVT RNA was subjected to adapter ligation using the NEB T4 DNA Ligase, following reverse transcription using the SuperScript III Reverse Transcriptase. After purification using RNAClean XP Beads, yielded RNA:DNA hybrids were ligated to RNA adapters using the NEB T4 DNA Ligase. The concentration of the yielded RNA library was determined by the Qubit fluorometer. The RNA library was mixed with RNA Running Buffer prior to sequencing on a primed Flongle flow cell. The flow cell version is R9.4.1, and the sequencer is MinION with a Flongle adapter.Iterative label modification-rich sequencesBased on the observation that guppy is able to basecall a fraction of fully-modified DNA and RNA oligos, we reasoned that chemical moiety changes on nucleotides are, in most cases, less likely to drastically shift nanopore sequencing signal distributions. Therefore, modification signals with smaller shifts can still be correctly basecalled. Such basecalled signals will further be used to train guppy, and the yielded model will by this means gain more information for basecalling modification signals. Such a process will be iterated till convergence to label modification signals, as shown in Fig. 1.For the DNA oligo labeling, guppy version 6.0.6 + 8a98bbc was used for basecalling and alignment. We used the template_r9.4.1_450bps_hac.jsn model for the first iteration, and the model trained from the previous iteration for the subsequent iteration. Through the guppy basecalling-alignment process, the—disable_qscore_filtering flag was used to keep “low-quality reads” that are usually artifacts caused by modifications, and all the other flags were set default. Samtools version 1.16 was used to merge, sort, and index alignment results outputted by Guppy. The alignment was performed against reference sequences provided in Table S1. Taiyaki version 5.3.0 was used for guppy training. For preparing training data, get_refs_from_sam.py with default flags, generate_per_read_params.py with default flags, prepare_mapped_reads.py with the mLstm_flipflop_model_r941_DNA.checkpoint, merge_mappedsignalfiles.py with default flags were used. For training guppy models, train_flipflop.py with default flags and the template mLstm_cat_mod_flipflop.py, as well as dump_json.py with default flags on the final model checkpoint were used. We performed in total 3 iterations.The RNA labeling followed the process except 1) template_rna_r9.4.1_70bps_hac.jsn was used for the initial basecalling, 2) reference sequences provided in Table S2 were used for alignment, 3) the –reverse flag of get_refs_from_sam.py was used, 4) the r941_rna_minion.checkpoint was used during prepare_mapped_reads.py, 5) flags—size 256—stride 10—winlen 31 were used for train_flipflop.py.The iterative labeling of yeast native tRNAs followed the same process as RNA oligos, with exceptions that 1) -ax map-ont -k5 -w5 flags were used for the minimap247 (version 2.24-r1122) alignment, which was recommended in ref. 27, 2) tRNA reference sequences in ref. 27 were used for alignment.The iterative labeling of E.coli CpG and GpC-methylated genomic DNAs and the human NA12878 cell line native KRT19 DNAs followed the same process as DNA oligos, with the exception that the U00096.3 E.coli genome and the T2T-CHM13v2.0 human genome was used for alignment respectively.Quantify nanopore sequencing signal shiftsWe used ONT kmer models as references to measure signal shifts. During sequencing, consecutive nucleotide kmers translocate through nanopores, further producing ionic current signals in the unit of pico-amp (pA). Signal events aligned to the same kmer are in general summarized using a Gaussian distribution. The mean and standard deviation of Gaussian, together with other trivial parameters for all possible kmers are recorded in kmer models. Specifically, k equals 6 and 5 for DNA and RNA, respectively42.For modified oligos, we first used iterative labeling to generate accurate basecalling and alignment profiles. Such profiles were subsequently used for nanopolish eventalign11,12,13 (version 0.13.3) with the flag –scale-events to make event tables. Event tables record event parameters, including signal levels and aligned kmers for all sequencing events. For events aligned to the same kmer, we calculated p-values against the corresponding Gaussian distribution in the kmer model. We summarized p-values with the empirical cumulative distribution function to measure the per-kmer signal shifts. We randomly selected 80,000 sequencing reads for each DNA oligo group (unmodified, 5mC, 5hmC), and 35,000 sequencing reads for each RNA oligo group (unmodified, m6A) for signal shift quantification analyses.Incremental learningWe adopted knowledge distillation48 for the IL-adaptation of DNA and RNA basecallers. Specifically, we freezed original basecallers throughout the entire IL process as teacher models and initialized tunable student models by duplicating teacher models. During IL, training data for new tasks (basecall modification-induced signals) flowed through both teacher and student models. Our goals are 1) making sure student models are capable of accomplishing new tasks precisely, and 2) controlling for the catastrophic forgetting of old tasks (general basecalling) by forcing student models to produce similar outputs with teacher models. We therefore introduced Connectionist Temporal Classification (CTC)49 and Response-Based Knowledge Distillation (RBKD)50 loss terms to reach such goals, respectively. We further balanced such contradicting terms as the final optimization goal for the basecaller IL-adaptation.We denoted \(X\) as a signal chunk and \(Y\) as the corresponding nucleotide sequence. The student model transforms \(X\) as CTC matrix \(U,\) where \({u}_{m}^{k}\) indicates the \(U\) value at position \(k\in \left[1,{K}\right]\) and alphabet \(m\in \left[1,{M}\right]\). We summarized all paths traversing \(U\) that can be decoded as \(Y\) into the valid CTC path set \(C\). We further wrote the CTC loss as:$${L}_{{CTC}} \,=\, – {log} \left({\sum }_{c\in C} \quad {\prod }_{\left\{m,k\right\}\in c} \quad {u}_{m}^{k}\right)$$
(1)
We further denoted the counterpart of \(U\) in the teacher model as \(V\), and wrote the RBKD loss as:$${L}_{{RBKD}}=- {\sum }_{k=1}^{K} \quad {\sum }_{m=1}^{M} \quad {p}_{m}^{k} {log} \left({q}_{m}^{k}\right)$$
(2)
where \({p}_{m}^{k}={\left({v}_{m}^{k}\right)}^{\frac{1}{T}}/{\sum }_{m=1}^{M} \, {\left({v}_{m}^{k}\right)}^{\frac{1}{T}}\) and \({q}_{m}^{k}={\left({u}_{m}^{k}\right)}^{\frac{1}{T}}/{\sum }_{m=1}^{M} \, {\left({u}_{m}^{k}\right)}^{\frac{1}{T}}\), and \(T\) is the temperature parameter used to scale probability ratios.We balanced \({L}_{{CTC}}\) and \({L}_{{RBKD}}\) using hyperparameter \(\lambda\), and wrote the final loss as:$$L={L}_{{CTC}}+\lambda {L}_{{RBKD}}$$
(3)
With such an optimization goal, we fine-tuned ONT taiyaki DNA and RNA basecallers, with training data prepared through iterative labeling, and using the AdamW optimizer51. For all IL analyses, we set \(\lambda\), learning rate and epoch as 10, 1e-5, and 1e3, respectively.Anomaly detectionWe further leveraged AD to distinguish between canonical and modification-induced signals. Specifically, we trained one CTC network (template mLstm_flipflop.py for DNA, template mGru_flipflop.py for RNA) per nucleotide isoform, e.g., C, 5mC, 5hmC for DNA cytosine, as its AD model. During AD model training, we first used the “ref_to_signal” entry in the hdf5 file (in which taiyaki stores prepared training data) to retrieve the first signal point that corresponds to the candidate nucleotide for AD. We then took the upstream \(n\) and downstream \(n+m\) signal points, based on which further retrieved the underlying sequence. We minimized the signal-sequence CTC loss using the AdamW optimizer51. We set learning rate and epoch as 5e-5 and 5e3 for all AD analyses, respectively, except for the HEK293 m1A AD model, whose epoch was set as 2e3 for a better detection performance. We set \(n\) and \(m\) as 10 and 20 for DNA, respectively, and 45 and 60 for RNA, respectively.When executing AD models to detect the modification status of a candidate nucleotide, we first retrieved the \(2n+m\) signal chunk as above-mentioned as the model input, then calculated the modification likelihood as:$$l={L}_{{canonical}}/{L}_{{modified}}$$
(4)
where \({L}_{{canonical}}\) and \({L}_{{modification}}\) denote CTC loss of the canonical and modification AD models, respectively. We further converted the \(l\) value as an Ml-tag with following rules:$$l \, \le \, 1,\, {Ml} \,=\, 0{{{\rm{;}}}} \, 1 \, < \, l \, < \, 3\,,\, {Ml} \,=\, {round}((l-1) \,*\, 128){{{\rm{;}}}}\, l \, \ge \, 3,\, {Ml} \,=\, 255$$
(5)
and wrote the result into the sam/bam file with pysam52.Analyze synthesized oligos with IL-ADFor the DNA scenario, we first trained a general IL-basecaller by combining 5mC, 5hmC, and U oligos. We randomly sampled 80,000 sequencing reads for each oligo group as training data, and made ground-truth labels via iterative labeling. With the same training data, we further trained C, 5mC, and 5hmC AD models to analyze C-modification status in unmodified, 5mC, and 5hmC oligos, and T and U AD models to analyze T-modification status in unmodified and U oligos. We then assessed the trained IL-AD workflow with an independent ensemble of unmodified, 5mC, 5hmC, and U oligos. We randomly sampled 50,000 sequencing reads and performed iterative labeling as test data. During the test phase, we first performed IL-basecalling and alignment to determine mappabilities and “error signatures”. Specifically, default parameters for the guppy-embedded minimap2 aligner47 (minimap2 −ax map−ont −k5 -w 10) were used, against DNA oligo reference sequences (Table S1). For mappable reads, we further performed AD to determine Ml-tags.For the analysis of RNA unmodified, m1A, m6A, and 5mC oligos, we followed the same process as in DNA, except 100,000 sequencing reads per oligo group were used for IL-AD training, and RNA oligo reference sequences (Table S2) were used for alignment.Basecall yeast native tRNAs, artificially-methylated E.coli genomic DNAs, and human native KRT19 DNAsWe trained the RNA IL-basecaller with native yeast tRNA nanopore sequencing dataset 7_NanotRNAseq_WTyeast_rep1. Specifically, we determined sequence backbones with iterative labeling, based on which then performed IL. To ensure a balanced training data representation, we randomly sampled 500 sequencing reads per tRNA type. For tRNA species with fewer than 500 reads, we used all available reads. With the trained model, we basecalled an independent biological replicate 11_NanotRNAseq_WTyeast_rep2. We then performed the minimap247 (version 2.24-r1122) alignment with -ax map-ont -k5 -w5 flags against yeast tRNA reference sequences following27.For the CpG and GpC-methylated E.coli genomic DNA analysis, we randomly sampled 80,000 sequencing reads for iterative labeling, based on which then performed IL. We randomly sampled an independent set of 80,000 sequencing reads as test data. The basecalling-alignment process was performed using guppy with the trained IL-model. Specifically, default parameters for the guppy-embedded minimap2 aligner47 (minimap2 −ax map−ont −k5 -w 10) were used, against the U00096.3 E.coli genome.For the human NA12878 cell line KRT19 native DNA analysis, we used the 407 reads yielded by the MinION sequencing for iterative labeling, based on which then performed IL. We used the 30 reads yielded by the Flongle sequencing as test data. The basecalling-alignment process was performed using guppy with the trained IL-model. Specifically, default parameters for the guppy-embedded minimap2 aligner47 (minimap2 −ax map−ont −k5 -w 10) were used, against the T2T-CHM13v2.0 human genome.Detect m6A in yeast, human, and mouse mRNA transcriptsFor the m6A detection in yeast, we first summarized a total of 1308 ime4-dependent ground-truth m6A sites from2. Focused on these sites, we subsequently trained A and m6A AD models using native mRNA nanopore sequencing profiles of wild type and ime4∆ samples2, respectively. During model training, we first used guppy version 6.0.6+8a98bbc and the model template_rna_r9.4.1_70bps_hac.jsn for basecalling. We next performed the minimap247 (version 2.24-r1122) alignment using -ax splice -uf k14 flags against the yeast reference genome provided in ref. 2. We then extended the taiyaki script get_refs_from_sam.py to pinpoint known m6A sites among spliced mRNA reads, further preparing training data hdf5 files for both wild type (WT1_RNAAA023484) and knock-out (KO1_RNAAA024588) samples using prepare_mapped_reads.py paired with the r941_rna_minion.checkpoint. With wild type and knock-out hdf5 files, we trained m6A and A AD models, respectively. We evaluated such models on a biologically independent sample pair (wild type sample WT2_RNAAB056712 and knock-out sample KO2_RNAAB058843).For the m6A detection in the human HEK293 cell line, we first surveyed m6ACE-Seq profiles and identified a total of 15,073 METTL3-dependent ground-truth m6A sites38. Focused on these sites, we subsequently trained A and m6A AD models using native mRNA nanopore sequencing profiles of wild type and METTL3 knock-out samples37, respectively. During model training, we first used guppy version 6.0.6 + 8a98bbc and the model template_rna_r9.4.1_70bps_hac.jsn for basecalling. We next performed the minimap247 (version 2.24-r1122) alignment using -ax splice -uf k14 flags against the GRCh38 reference genome. We then extended the taiyaki script get_refs_from_sam.py to pinpoint known m6A sites among spliced mRNA reads, further preparing training data hdf5 files for both wild type (HEK293T-WT-rep1) and knock-out (HEK293T-Mettl3-KO..1) samples using prepare_mapped_reads.py paired with the r941_rna_minion.checkpoint. With wild type and knock-out hdf5 files, we trained m6A and A AD models, respectively. We evaluated such models on a biologically independent sample pair (wild type sample HEK293T-WT-rep2 and knock-out sample HEK293T-Mettl3-KO..2).We used the above human HEK293 cell line AD models for the m6A detection in mouse embryonic stem cells (mESCs). We executed these models on wild type (mESC_Mettl3_WT) and knock-out (mESC_Mettl3_KO-ref) native mRNA nanopore sequencing profiles from7. We evaluated the detection performance based on a total of 30,519 METTL3-dependent ground-truth m6A sites which were summarized in ref. 40.Simultaneously detect m1A and m6A in human mRNA transcriptsWe followed the above-described AD model training pipeline in this section. Specifically, we used the same HEK293 nanopore sequencing collections and m6A annotations. We considered sites revealed in ref. 41 as m1A ground-truth. Without losing generality, we took ATAD3B transcript isoform NM_031921.4 and GTF3C2 transcript isoform NM_001521.3 as examples to demonstrate the recapitulation of m1A-m6A co-occurrence. We further performed a transcriptome-wide screening for putative m1A and m6A sites.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles