T2T genome assemblies of Fallopia multiflora (Heshouwu) and F. multiflora var. angulata

Sample collection, library construction and genome sequencingSamples of F. multiflora (AYY) and F. multiflora var. angulata (CYY) were harvested from South China Botanical Garden, Chinese Academy of Sciences, Guangzhou City, Guangdong Province, P. R. China. Tissues, including leaves, stems, flowers, and roots, were immediately frozen in liquid nitrogen and stored in refrigerator at −80 °C. Genomic DNA was extracted from young leaves using the modified CTAB method for genome sequencing11.A standard SMRTbell library was constructed following the manufacturer’s instructions of the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA) and sequenced on the PacBio Sequel II platform. The sequencing generated 3,967,375/2,667,947 HiFi reads with 68.35/46.82 giga base (Gb) and ~48.82/42.57 X coverage of the estimated genome size for AYY/CYY, respectively (Table 1). The N50 length of the HiFi reads was 17.23/17.55 kb for AYY/CYY. Hi-C libraries were constructed using cross-linked genomic DNA12 and sequenced on the Illumina NovaSeq platform (Illumina, San Diego, CA, USA). A total of 155.01/162.37 Gb Hi-C data with ~110.72/147.61 X coverage were obtained for AYY/CYY. Additionally, a next generation sequencing (NGS) library was constructed and sequenced on the Illumina NovaSeq platform, generating 65.62/55.34 Gb NGS data with the coverage of ~46.87/39.53 X for AYY/CYY (Table 1).Table 1 Summary of the high-throughput sequencing data used in this study.With regard to the transcriptome sequencing, total RNA was extracted from tissues such as root, leaf, stem, and flower. The transcriptome libraries were constructed according to the instruction of NEBNext® Ultra™ II Directional RNA Library Prep Kit (New England Biolabs, MA, USA) and sequenced on the Illumina NovaSeq platform, which yielded a total of 66.01/48.07 Gb high-quality paired-end reads for AYY/CYY (Table S1). All genomic and transcriptomic sequencings were performed by Novogene Co., Ltd. (Beijing, China).Estimation of genome size and heterozygosityThe basic features of genome including genome size, heterozygosity, and proportion of repeated sequences were predicted using K-mer counting method. Illumina paired-end reads were utilized to calculate the counts of K-mer with a size of 21 bp using Jellyfish v1.1.1113. Genomescope2 was then employed to predict the genome features based on the frequency of K-mer14. The estimated genome sizes of AYY and CYY were 1,192.38 Mb and 1,065.60 Mb, respectively, with heterozygosity values of 1.14% and 0.79% (Fig. 1). Notably, the heterozygosity of both AYY and CYY was higher than the value (0.64%) reported in a previous study on AYY8. The difference between our results and previous study might be attributed to the distinct genotype of sample used for genome sequencing.Fig. 1Genome survey of Fallopia multiflora (AYY) and Fallopia multiflora var. angulata (CYY) based on k-mer analysis.
De novo genome assemblyNextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo) and hifiasm v0.16.1-r37515 were utilized to independently assemble the AYY and CYY genomes using HiFi reads, respectively. The assembly resulting from hifiasm was kept for further scaffolding. The contigs generated by NextDenovo were primarily used for closing gap (Table 2). As shown in Table 2, the assemblers produced a total of 367/197 contigs for AYY/CYY with N50 values of 112.58 Mb/94.83 Mb, and only 10 contigs account for 75% of the genome size using hifiasm assembler. Notably, the contig N50 obtained in our study was significantly higher than the recently reported contig N50 of 1.68 Mb8.Table 2 Quality assessments of preliminary assemblies.Subsequently, the adapters and low-quality sequences were removed from the Hi-C paired-end reads using fastp v0.12.416 with default parameters. Non-valid reads were filtered out using HiCUP v0.8.317, and only the valid reads were retained for further analysis. The high-quality Hi-C reads were then used as input data for the 3D-DNA pipeline18 to scaffold the contigs. The Hi-C interaction matrix was visualized and validated using HiCExplorer v3.7.219 (Supplementary Figure S1 and S2). The contigs were scaffolded and anchored onto 11 chromosomes (Chr) with anchoring ratios of 98.26% and 98.94% for AYY and CYY, respectively (Fig. 2 and Table 3). SeqKit20 was used to detect the telomere tandem repeats (3’-TTTAGGG/5’-CCCTAAA) in the chromosome-level assemblies of AYY and CYY. A total of 21 and 19 telomeres were identified at the chromosome ends of AYY and CYY assemblies, respectively (Fig. 3 and Table 3). These results demonstrate that more complete T2T genomes were obtained for both AYY and CYY compared to the previous chromosome-level assembly for AYY8. However, there were still 13 and 7 gaps in AYY and CYY assemblies, respectively.Fig. 2Ciros plot of Fallopia multiflora (AYY) and Fallopia multiflora var. angulata (CYY). T1, Chromosomes with contig components. Red arcs represent telomeres and triangles show the closed gaps; T2, Density of genes with window size of 1.0 Mb; T3, Density of repeat sequences with window size of 1.0 Mb; T4: Density of Gypsy-LTRs with window size of 1.0 Mb; T5, Density of Copia-LTRs with window size of 1.0 Mb; T6, Sequence synteny between Fallopia multiflora and Fallopia multiflora var. angulata genomes.Table 3 Assessment of AYY and CYY genome assemblies.Fig. 3High-throughput reads mapping to the pseudo-chromosomes of the AYY and CYY assemblies. Coverage of HiFi reads (the lowest layer), coverage of Illumina reads (the upper layer), locations of telomeres (red vertical bars at the ends of chromosome), and contig components (long horizontal boxes in the middle layer).To fill the gaps, all contigs and HiFi reads were mapped onto the assembly using Winnowmap2 v2.0321. The sequences/reads spanning the gaps were visually identified using IGV22 and thereby used to manually fill the gaps. Four and two gaps were successfully closed in AYY and CYY assemblies, respectively (Fig. 2 and Table 3). The improved genome assemblies were further polished using Illumina short reads by Pilon23, achieving a final accuracy of 99.99999% for both AYY and CYY assemblies. Finally, high-quality T2T genomes were assembled for AYY and CYY with genome sizes of 1,458.37 Mb and 1,174.38 Mb, interrupted by only 9 and 5 gaps, respectively.Genomes annotationRepeatMasker v4.1.2-p124 was utilized to mask the genome and identify transposable elements (TEs). Repbase v2018102625 and a de novo repeat database constructed by RepeatModeler2 v2.0.426 were used as references. Repeat sequences accounted for 1,009.99 Mb (70.48%) and 782.69 Mb (67.36%) of the genome sizes of AYY and CYY, respectively, which is consistent with the previously reported value of 67.69% in AYY8. Transposon elements were identified using EDTA v2.1.027. Retrotransposons, particularly LTR-RT elements, made up the majority of the TEs, accounting for 61.59%/58.74% of the AYY/CYY genome assemblies (Table 4). Among them, Ty1/Copia and DIRS1/Gypsy were the main LTR elements with uneven proportions in AYY and CYY assemblies (Table 4), which may contribute to the distinct genomic features between AYY and CYY assemblies and thereby drive the structural variation and phytochemical difference in biosynthesizing the specialized metabolites. MISA (https://webblast.ipk-gatersleben.de/misa) was applied to identify simple sequence repeats (SSR) across whole genome. There were 285, 266 and 259,813 SSR identified in AYY and CYY genome, respectively. Subsequently, Infernal v1.1.428 was applied to search for non-coding RNA (ncRNA), including transfer RNAs (tRNAs), ribosomal RNA (rRNA), small RNA (snRNA), and microRNA (miRNA), based on the Rfam database v14.829. In total, 23,826 and 35,337 ncRNAs were annotated in AYY and CYY genomes, respectively, including 331 and 198 miRNAs (Table S2).Table 4 Annotation of the transposable elements in AYY and CYY genome assemblies.Protein-coding genes were predicted using de novo, homology protein-based, and transcriptome-based methods. Transcriptomes from various plant tissues (roots, stems, leaves, and flowers) were assembled using StringTie30. PASA v2.5.331 was then used to annotate gene models based on these transcripts. In total, 63,740 and 58,615 transcripts were respectively confirmed in the AYY and CYY genomes using transcriptome data (Table S2). Protein sequences from SwissProt database and tartary buckwheat genome32 were downloaded to perform homology-based prediction using MAKER3 v3.01.0333. Additionally, AUGUSTUS34 was applied to perform the de novo prediction. The gene models generated from these different methods were integrated using EVidenceModeler35. Transcripts without start or stop codon, and with less 150 bp were filtered out. In the end, a total of 84,768 and 69,100 high-confidence protein-coding genes were predicted in the AYY and CYY assemblies, respectively (Table S3), which is significantly higher than the gene number of 35,575 reported in a recent study on AYY8. The differences in gene prediction may be attributed to variations in genome annotation parameters, the use of different plant samples for genome sequencing, and discrepancies in genome quality.Subsequently, functional annotation of protein-coding genes was performed using blast + v2.1236 (Evalue 1e−6) to compare predicted proteins against NCBI non-redundant protein database (nr) and SwissProt database. The structure of proteins was predicted using InterProScan v5.60-92.037, which called multiple databases including but not limited to Pfam, Phobius, SMART, TMHMM, PANTHER and FunFam. GO terms and KEGG pathways were obtained using eggNOG-mapper v2.1.538. In total, 54,270 and 44,371 genes were functionally annotated for AYY and CYY, respectively, accounting for 64.0% of the total predicted genes.

Hot Topics

Related Articles