High-quality Chromosomal-Level Genome Assembly of the Wasabi (Eutrema japonicum) ‘Magic’

Plant materialsPlant materials for this study were obtained from Heung Agricultural Corporation in Pyeongchang, Republic of Korea (37.456272 N, 128.433811E). The cultivar selected for genome assembly was E. japonicum ‘Magic’. Fresh leaf samples were used for DNA and RNA extraction.Whole-genome sequencingWhole-genome sequencing was performed using PacBio, Nanopore, and Illumina platforms (Table 1). High molecular weight genomic DNA (HMW DNA) for long-read sequencing was extracted using a modified EDTA method8,9. The extracted DNA, with an approximate fragment size of 47 kb, was used to prepare a library following the manufacturer’s instructions with the SMRTbell Prep Kit 3.0 (PacBio, Menlo Park, CA, USA). The library was then sequenced on the PacBio Revio System (PacBio, Menlo Park, CA, USA), generating 44.64 Gbp of HiFi reads. Additional long-read and short-read sequencing was performed to further polish the genome assembly. Extracted HMW DNA was used for Nanopore MinION sequencing (Nanopore, Oxford, United Kingdom). Libraries were prepared using the ONT Ligation Sequencing Kit (SQK-LSK110) according to the recommended protocol, generating 16.78 Gbp of reads. Genomic DNA for the Illumina sequencing was extracted using the SmartGene Plant DNA Extraction Kit (SMART GENE, Daejeon, Republic of Korea). Following the manufacturer’s instructions, a paired-end library with a 550 bp insert size was constructed using the TruSeq DNA Nano kit. This library was then sequenced on the Novaseq6000 system (Illumina, San Diego, CA, USA), ultimately yielding 72.48 Gbp of paired-end reads from Illumina WGS.Table 1 Summary of the raw sequencing data used for the assembly of E. japonicum, detailing the library types, platforms used, total read length, sequencing coverage, and N50 length.Pore-C sequencingThe RE-Pore-C library was carried out as described in the Oxford Nanopore Technologies RE-Pore-C protocol for plant samples (March 2022 version 3). Formaldehyde crosslinked plant nuclei were isolated prior to in situ restriction digestion with NlaIII (New England Biolabs, Ipswich, MA, USA). After overnight incubation, the restriction enzyme was heat-denatured; crosslinked DNA clusters were ligated in proximity, followed by protein degradation and de-crosslinking, releasing the chimeric Pore-C dsDNA polymers. Libraries were constructed using the ONT Ligation Sequencing Kit (SQK-LSK110), following the library preparation recommendations. Sequencing was carried out on a R.9.4.1 PromethION flowcell. Finally, 52.08 Gbp of Pore-C reads were generated. Pore-C Fast5 raw data generated by PromethION were processed with the Guppy software (v6.4.6) to produce fastq files with a Phred quality score of at least 7. These filtered fastq files were then analyzed using NanoPlot (v1.40.0) for statistical data assessment.RNA sequencingTotal RNA was extracted from leaf tissues with Trizol reagent (Invitrogen, Waltham, MA, USA). Libraries for sequencing were generated from the total RNA with the NEBNext Ultra II Directional RNA-Seq Kit (New England Biolabs, Inc., Hitchin, UK). Ribosomal RNA removal was achieved with the RIBO COP rRNA Depletion Kit (LEXOGEN, Inc., Vienna, Austria). The RNA was then purified, converted to cDNA, and sheared. Indexing utilized Illumina indices 1–12, followed by an enrichment process using PCR. The library quality was then verified using an Agilent 2100 Bioanalyzer with a DNA High Sensitivity Kit to check the average fragment size. Library concentrations were quantified using a StepOne Real-Time PCR System (Life Technologies, Inc., Carlsbad, CA, USA) with a specific library quantification kit. Finally, high-throughput sequencing was conducted as 100 bp paired-end reads on the HiSeq X10 and Novaseq6000 systems (Illumina, Inc., San Diego, CA, USA). 5.23 Gbp of reads were generated in the final output.Genome size estimationThe canonical k-mers were calculated using 31-kmers from the Illumina paired-end reads with the Jellyfish software version 2.3.110. Subsequently, GenomeScope 2.0 was utilized with the options (-k 31 -p 4) to assess the haploid genome size and ploidy11. The results revealed a haploid genome size of 356,142,501 bp. The frequency of ‘aabb’ at 4.33% was higher than ‘aaab’ at 0.001%, indicating an allotetraploidy (Fig. 1).Fig. 1The genome profile generated by GenomeScope 2.0 provides a detailed analysis of the E. japonicum genome. The inferred length of the haploid genome (‘len’) is 356,142,501 bp. The proportion of the genome sequence that is non-repetitive (‘uniq’) is 49.3%. Different forms of nucleotide heterozygosity are represented by ‘aaaa’ (95.4%), ‘aaab’ (0.001%), ‘aabb’ (4.33%), ‘aabc’ (0.001%), and ‘abcd’ (0.229%). The mean k-mer coverage for heterozygous bases (‘kcov’) is 28.8, with a read error rate (‘err’) of 0.394% and an average read duplication rate (‘dup’) of 1.28. The k-mer length used (‘k’) is 31, and the inferred ploidy (‘p’) is 4. The graph shows the observed coverage frequency and the modeled distribution of unique sequences, errors, and k-mer peaks.Genome assemblyWith the PacBio HiFi long-reads, a de novo assembly of E. japonicum ‘Magic’ was performed with Hifiasm (v. 0.19.5-r592) using the default parameters12. The initial assembly consisted of 2,626 contigs, with 61 contigs exceeding 5 Mb in length, totaling approximately 1.3 Gbp. The initial assembly data with filtered Pore-C fastq files were processed using Pore-C Snakemake (v0.4.0) to generate the mnd file. The generated mnd file and initial assembly were then inputted into the 3d-dna (v190716) pipeline with specific parameters (run-asm-pipeline.sh -i 1000000–rounds 0–editor-coarse-resolution 2500000–editor-coarse-region 12500000–editor-coarse-stringency 1–polisher-coarse-resolution 2500000–polisher-coarse-region 15000000–polisher-coarse-stringency 1–splitter-coarse-resolution 2500000–splitter-coarse-region 15000000–splitter-coarse-stringency 1)13. The resulting file was uploaded to JuiceBox (v1.11.08) to generate a pairwise contact heatmap, and manual curation was performed three times to produce a curated assembly file14 (Fig. 2). This curated assembly was then polished using Nextpolish2, utilizing both the short and long read data (Figs. 3, 4). After manual curation and polishing, 388.8 Mbp were removed, resulting in 914 Mbp remaining in the final assembly.Fig. 2The Pore-C heatmap shows the interaction between the 14 pseudo-chromosomes of E. japonicum. Darker red cells indicate stronger and more frequent interactions, signifying that the sequences are spatially closer within the nucleus. Blue rectangles outline the pseudo-chromosomes, while green rectangles represent scaffolds. This visualization helps in understanding the three-dimensional organization of the genome by highlighting regions that interact more frequently.Fig. 3The circular diagram provides a detailed visualization of the genomic features of E. japonicum, using a window size of 100,000 bp, with the rings arranged from outer to inner as follows: (A) Karyotypes, (B) Enriched subgenome, (C) Normalized proportion of each subgenome, (D) Density of SG1-specific kmers, (E) Density of SG2-specific kmers, (F) Density of LTR-RTs, (G) Homologous blocks. The yellow areas represent Sub-genome A, and the blue areas represent Sub-genome B.Fig. 4A dot plot representing the chromosomes of E. japonicum. In this dot plot, both the vertical and horizontal axes denote the chromosomes of E. japonicum. Each axis includes chromosome identifiers from two sets, ranging from A01 to A07 and B01 to B07. Each dot corresponds to regions of homologous sequences between the chromosomes on the X-axis and Y-axis. Red dots highlight the best number of homologous genes between the chromosome pairs, indicating regions where notable homology exists. The diagonal line running from the top left to the bottom right indicates self-alignments, where a chromosome is compared to itself, naturally showing a perfect match.Subgenome assignmentSubPhaser (v1.2) was used to classify the subgenomes of E. japonicum ‘Magic’ based on repetitive k-mers, homologous chromosomes (Fig. 5), and long terminal repeat retrotransposons (LTR-RTs) insertion times (Fig. 6)15. Using the final assembly, k-mers were initially calculated with Jellyfish, and differential k-mers between sets of homologous chromosomes (k = 13) were identified. Subgenomes were clustered with the K-Means algorithm, and confidence levels were estimated through bootstrapping. The success of the subgenome classification was assessed with hierarchical clustering and Principal Component Analysis (PCA). Additionally, LTR-RTs were identified with LTRharvest and subsequently classified by TEsorter16. Enrichment of k-mers unique to each subgenome was tested to identify subgenome-specific LTR-RTs. The insertion ages of these subgenome-specific LTR-RTs were estimated to discern differences among the subgenomes. The nucleotide divergence of two LTR sequences from each subgenome-specific LTR-RTs is estimated using the Jukes-Cantor model. The insertion time (T) is determined by the formula T = K/2r, where r is 1.3 × 10−8 substitutions per site per year, and K represents the divergence between the two LTR sequences. The insertion times for subgenome-specific LTR-RTs are summarized and graphically represented (Figs. 6, 7).Fig. 5Distinguishment of Subgenomes Based on Subgenome-Specific k-mers (k = 15). (a) Unsupervised hierarchical clustering is depicted where a horizontal color bar above the heatmap indicates the specific subgenome association of each k-mer. A vertical color bar to the left shows the subgenome assignment of chromosomes. This heatmap displays the Z-scaled relative abundance of the k-mers, with higher Z scores representing greater abundance. (b) PCA of differential 15-mers confirms that the genome has been effectively separated into two subgenomes, demonstrating distinct clusters based on the differential k-mers and homoeologous chromosomes.Fig. 6The time estimates for the insertion of LTR-RTs specific to the different subgenomes of E. japonicum. The figure shows the density of LTR insertion ages in million years, with two distinct curves representing different subgenomes, labeled as SG1 (orange) and SG2 (blue). The X-axis represents the LTR insertion age in million years, providing a timeline for the LTR-RT insertion events. The Y-axis represents the density of these insertion events, indicating how frequently LTR insertions occurred at different times. The summary in the top right corner provides the median insertion ages along with the 95% confidence intervals (CI) for both subgenomes. For SG1, the median insertion age is 1.299 million years with a 95% CI ranging from 0.000 to 4.901 million years. The two peak values of SG1 are 0.094 and 1.504, respectively. For SG2, the median insertion age is 2.089 million years with a 95% CI ranging from 0.197 to 6.594 million years.Fig. 7These circular phylogenetic trees show the analysis of up to 1,000 Gypsy (a) and Copia (b) LTR-RTs from different subgenomes, with branches colored to distinguish SG1 (blue) and SG2 (yellow) and terminal nodes color-coded by clades identified using TEsorter: (a) Athila (red), CRM (orange), Reina (green), Retand (purple), Tekay (pink); (b) Ale (red), Alesia (orange), Bianca (green), Ikeros (cyan), Ivana (blue), SIRE (pink), TAR (purple), Tork (brown).Repeat analysisThe repeat library for the wasabi genome was constructed de novo using RepeatModeler v2.0.2, which incorporates the repeat-finding programs RECON and RepeatScout17. In addition to the default settings, the “-LTRStruct” parameter was used to enhance the detection of LTRs. The resulting repeat library was subsequently imported into RepeatMasker v4.1.2 for the identification and classification of repetitive elements (Table 2)18.Table 2 Statistics of repeat elements in the E. japonicum genome are presented.Genome annotationGene prediction was performed with the Braker3 (v3.0.8) pipeline, which integrates ab initio prediction, homology-based prediction, and transcriptome-based prediction methods19. To conduct homology-based prediction, protein sequences from the OrthoDB viridiplantae database were collected and used to guide GeneMark-ETP. For transcriptome-based prediction, RNA-Seq data from E. japonicum were mapped to the completed assembly with HISAT220. The resulting BAM files were used for transcriptome-based prediction. Finally, Ab initio predictions were performed with AUGUSTUS. The gene prediction results from the three methods were then integrated with TSEBRA to produce the final gene set. The final predicted gene set consists of 47,617 genes.

Hot Topics

Related Articles