RNA architecture of porcine deltacoronavirus genome inside virions detected by vRIC-seq

Cell culture and virions sample preparationLilly Laboratories cell porcine kidney 1 (LLC-PK1, ATCC CL-101) cells were cultured in Minimun Essential Medium (MEM; Invitrogen) supplemented with 10% fetal bovine serum and 1% penicillin streptomycin (Invitrogen). Cells were cultured at 37°C in an atmosphere of 5% CO2 and 95–99% humidity.LLC-PK1 cells were infected at MOI of 1.0 with PDCoV CHN-HN-2014 (GenBank accession: KT336560). Infections were performed in MEM supplemented with 7.5 μg/mL trypsinization. In the late stage of replication (10 hpi), infected cells were harvested by repeated freezing and thawing. Cell suspension removed cell fragments by 0.88 μm filter and centrifugation at 10,000 × g at 4°C. The virions were enriched by direct ultracentrifugation at 30,000 × g for 3 h at 4°C, and resuspended with ConA Binding buffer (20 mM HEPES-KOH pH 7.9, 10 mM KCl, 1 mM CaCl2, 1 mM MnCl2, 0.01% Tween 20). The integrity of virions were visualized by transmission electron microscopy.vRIC-seq library preparationThe library preparation was carried out in the same way as described in a previous study26. In brief, BioMag®Plus Concanavalin A beads (Polysciences, Inc. cat. 86057) were used to capture virions, and 500 μL purified virions were added to the 150 μL washed beads with ConA Binding buffer, and incubated for 10 min at room temperature. The beads were washed with ConA Binding buffer for three times and PBST buffer for once (1 × PBS, 0.1% Tween 20), and resuspended in 1 mL PBST buffer. The final concentration of 1% formaldehyde was used to fix the RNA-RNA interaction, and the glycine with final concentration 0.125 M was added to quench the formaldehyde.The virions captured by beads were permeated by 1 mL of permeabilization buffer (10 mM Tris-HCl pH 7.5, 10 mM NaCl, 0.5% NP-40, 0.3% Triton X-100, 0.1% Tween 20, 1 × protease inhibitors (Sigma Aldrich), 2 U/ml SUPERase•In™ RNase inhibitor (Thermo Fisher Scientific)). After washing three times with 1 × PNK buffer (50 mM Tris-HCl pH 7.4, 10 mM MgCl2, 0.1% Tween 20), the virial RNA was fragmented by MN mixture (50 mM TrisHCl pH 8.0, 5 mM CaCl2, 0.03 U/μL micrococcal nuclease (Thermo Fisher Scientific)). After 10 min, the reaction was stopped by washing twice with 1 × PNK + EGTA buffer (50 mM Tris-HCl pH 7.4, 20 mM EGTA, 0.1% Tween 20) and twice with 1 × PNK buffer.The fragmented RNA was treated with FastAP alkaline phosphatase (Thermo Fisher Scientific). The reaction was stopped by washing with 1 × PNK + EGTA buffer twice, and 1 × high-salt wash buffer (5 × PBS, 0.1% Tween 20) twice, and three times in 1 × PNK buffer. The pCp-biotin was ligated to the 5’-end fragmented RNA by T4 RNA ligase (Thermo Fisher Scientific). After washing three times with 1 × PNK buffer, the RNA was treated with T4 polynucleotide kinase (Thermo Fisher Scientific), and the reaction was stopped by washing with 1 × PNK + EGTA buffer twice and 1 × PNK buffer twice. Subsequently, T4 RNA ligase was used proximity ligation in siut. After completing the proximity-ligated, the RNA was extracted by proteinase K, Trizol and chloroform according to the manufacturer’s instruction.For strand-specific library construction, the RNA was fragmented used 5 × First-strand buffer (Thermo Fisher Scientific) and incubated at 94°C for 5 min. Subsequently, the preblocked MyOne Streptavidin C1 beads (Thermo Fisher Scientific) by yeast tRNA was used to enrich pCp-biotin-marked RNA fragments according to the manufacturer’s instruction. To synthesize the first-strand cDNA, firstly, RNA was incubated at 65°C for 5 min, put it on the ice for 2 min immediately. Reverse transcription was performed by Superscript II reverse transcriptase (Thermo Fisher Scientific) and incubated at 25 °C for 10 min, 42 °C for 40 min and then 70 °C for 15 min. Then, the dUTP second-strand cDNA was synthesized with Escherichia coli DNA polymerase I (Enzymatics) and the reaction mixture (10 μL of 5 × second-strand buffer (Thermo Fisher), 0.8 μL of 25 mM dNTPs with 80% of the dTTP replaced by dUTP, 20.5 μL of RNase-free water, 0.2 μl of 5 U/μL RNase H (Thermo Fisher)) and incubation at 16 °C for 2 h. The double-stranded (ds) DNA products were purified using 1.8 × AMPure beads (Beckman Coulter) following the manufacturer’s protocol. This was followed by conventional end-repair and purified again by 1.8 × AMPure beads. Of 2 pmol Illumina Y-shaped adaptors were ligated to the dsDNA using Quick T4 DNA Ligase (Enzymatics) after dA tailing, and purified twice with AMPure beads. An emulsion PCR followed as a second amplification step.The adaptor-ligated cDNA was mixed with 1 μL of 10 μM Illumina PE1.0, 1 μL of 10 μM index primer, 1 μL of 50 mM MgSO4, 2.5 μL of 10 × Pfx buffer, 0.4 μL of 25 mM dNTPs, 0.4 μL of 2.5 U/μL Platinum Pfx DNA polymerase (Thermo Fisher) and 3 μL of USER enzyme (NEB). The mixture was incubated at 37 °C for 15 min to allow the USER enzyme to digest the dUTP strand. Then at 94 °C for 2 min, the following PCR program run: 11 to 13 cycles of 94 °C for 15 s, 62 °C for 30 s and 72 °C for 30 s. The 200–450 bp PCR products were purified from agarose gel by MinElute Gel Extraction Kit (Qiagen). The DNA concentration was measured by Qubit (Thermo Fisher) and sequenced using Illumina HiSeq X Ten.Data cleaningThe data cleaning process for vRIC-seq sequencing data included removing adapter sequences from reads, eliminating PCR duplicates, and filtering out low-complexity reads, which could be executed using fastp (v0.23.2)29. By default, fastp removed adapter sequences and set a criterion that allowed a read to have no more than 30% of its bases with quality scores below 20. This ensured the integrity and quality of the sequencing data. To eliminate PCR duplicates, we utilized the script provided by the developers of the vRIC-seq protocol19,26 (https://github.com/caochch/RICpipe/tree/master/step0.remove_PCR_duplicates/scripts). During each iteration, the script merged the sequences of read 1 and read 2 to form a complete read sequence. It then checked whether this combined sequence was already presented in the hash table. If the sequence was found in the table, the read pair was skipped. If not, the script wrote the sequence to the output file and added it to the hash table, marking it as encountered. This process ultimately yielded a cleaned FASTQ file.Sequence alignmentPreparation of reference genomes was required before sequence alignment, including the PDCoV genome (ID: KT336560.1), the complete porcine genome (RefSeq: GCF000003025.6), as well as the ribosomal RNA sequences derived from the pig genome (18s rRNA, 5.8s rRNA, 12s rRNA, and 16s rRNA sequences). Next, a new directory was created to hold the reference genome indexes for STAR aligner (v2.7.1a)30. The–genomeDir parameter was specified to reference the previously constructed STAR index, which was essential for the accurate mapping of the sequencing reads to the corresponding reference genome. The main output files from STAR were ‘Unmapped.out.mate’, ‘Aligned.sortedByCoord.out.bam’, ‘Chimeric.out.junction’, and ‘Log.final.out’. These files represented the reads that unmapped, the reads that aligned directly, the reads that aligned in a chimeric fashion, and the alignment statistics, respectively. We used samtools (v1.9)31 to extract and calculate the number of aligned reads and coverage from the BAM file for different reference gene IDs.To identify chimeric reads as effectively as possible, the following filtering steps were applied to the FASTQ files: Firstly, reads were removed if they were presented in both the star.chimeric.sam and star.align.sam files. Additionally, reads from star.align.sam had to satisfy the criteria that the FLAG value had to be less than 255, indicating a properly aligned read, and the difference between the read length and the aligned length should be fewer than 15 base pairs. Subsequently, the filtered FASTQ files were aligned to the viral genome using BWA (version 0.7.17)32, and chimeric reads were then extracted for the next analysis.Identification and visualization of valid chimeric readsThe aligned reads and chimeric reads extracted from the sequence alignment were both employed to identify valid chimeric reads. These valid chimeric reads correspond to RNA fragments with proximate spatial locations and chimeric junction sites. The process was carried out in the following steps: Initially, paired reads from the sets of reads derived from both direct and chimeric alignments were selected based on their IDs. Subsequently, the length of the insertion fragments and the number of fusion sites were considered to categorize different types of paired reads and determine their classification as valid chimeric reads. Additionally, the unique discontinuous transcription of the coronavirus genome results in the production of numerous subgenomic RNAs (sgRNAs)33,34,35. The distinction between these sgRNAs and the insertion fragments was whether the fusion site contains a cytosine. If these sgRNAs were not removed, they could significantly bias the signal in the downstream interaction matrix. After several filtering steps, the interaction sites from the valid chimeric reads were ultimately extracted and used to generate a interaction matrix with the assistance of Juicerbox (v2.20.00)36.To illustrate the algorithmic logic from paired reads to the interaction matrix, we would take an example involving two RNA fragments that were crosslinked together with a single fusion site:

1.

When the length of the inserted fragment was 150 nucleotides or shorter, it was assumed to be fully captured in the paired-end sequencing, with both ends of the fragment being read through. In such cases, chimeric reads that spanned the entire inserted fragment were deemed valid. Moreover, if the distance between the two ends of the fragment was sufficiently large (greater than 600 base pairs), the corresponding paired read 1 and read 2 were also preserved for further analysis. In this situation, a single inserted fragment might be recorded up to three times. To avoid redundancy, deduplication methods could be applied to remove these duplicate records.

2.

When the length of the inserted fragment was greater than 150 nt but less than 290 nt, the fragment was read through, and neither read 1 nor read 2 could individually cover the entire inserted fragment. If the paired reads partially overlap, concatenating read 1 and read 2 could achieve full coverage of the inserted fragment. In such case, the paired reads were consisted of one completely aligned read and one chimeric read. The chimeric read was recorded independently, and if the distance between the two ends of the inserted fragment was sufficiently long, the paired read 1 and read 2 would also be retained.

3.

When the length of the inserted fragment exceeds 290 bp, it was classified as a non-read-through segment. In such instances, the overlapping region between the paired reads was very short or entirely absent. If one of the reads included the chimeric site, it was similar to the second condition, where the chimeric read was recorded separately. If the distance between the two ends of the inserted fragment was adequately long, the paired reads were recorded once. If no reads spanned the chimeric site, the determination was made based exclusively on the positional relationships of the reads, and the paired reads that covered distant regions of the inserted fragment were the ones recorded.

4.

When there were multiple chimeric sites within a fragment, the results became more complex, but the algorithm logic remained the same. It still required a case-by-case analysis based on the length of the inserted fragment. Further elaboration on this matter was beyond the scope of this discussion.

After identifying the paired reads that represented the inserted fragments, the same process needed to be applied to analyze the RNA-seq data of PDCoV. RNA-seq data sets of PDCoV are available at NCBI database under Bioproject accession PRJNA690955. The chimeric sites of all sgRNAs were recorded, and the paired reads representing the sgRNAs of the coronavirus were removed. The remaining paired reads could be classified as valid chimeric reads, which would be used to construct an interaction matrix and facilitate data visualization.Secondary structure predictionTo predict the RNA secondary structure of PDCoV genome, the initial step involved partition of the most suitable structural domains based on the information provided by the interaction matrix. This process entailed an iterative examination and refinement of the domain boundaries by calculating ratios as well as densities from the interaction matrix until no additional segmentation was necessary. The process was as follows: Initially, the full genome RNA was treated as a single structural domain, and a suitable site needed to be identified to split this domain into two distinct subdomains. To identify the optimal division point, various sites were systematically evaluated, and the interaction frequency along with the interaction score (connection score) for the two resulting structural domains were computed. The domains were grouped together if the interaction score exceed 0.03, provided that the two ends were adjacent or overlapping. Ultimately, several optimal sites for domain division were identified across the full genome. It was important to note that to maintain long-distance interactions, domains shorter than a specified threshold length will not be subdivided any further.The formula for the connection score was as follows:$${\rm{Connection\; score}}=\frac{{\rm{pairwise}}({\rm{win\_a}})({\rm{win\_b}})}{\sqrt{{\rm{coverage}}({\rm{win\_a}})({\rm{win\_b}})}}$$Subsequently, within each structural domain, the most characteristic interaction sites were identified and selected. Following clustering, the base sequences that adhered to rules of complementary base pairing were selected as rigid constraints. These constraints were subsequently fed into RNAstructure (version v6.3)37 for secondary structure prediction. In the final step, the predicted secondary structure from all the individual structural domains were integrated to yield the secondary structure of the PDCoV whole genome. The complete vRIC-seq sequencing data analysis process was showed in Fig. 1.Fig. 1The workflow of vRIC-seq sequencing data analysis.

Hot Topics

Related Articles