Chromosome genome assembly and annotation of Adzuki Bean (Vigna angularis)

Sample collection and sequencingLongxiaodou 4, an adzuki bean cultivar, is extensively cultivated in Heilongjiang Province, China. Its plant is compact, stands upright, and resistant to lodging. Longxiaodou 4 has oval-shaped grains and red seed coats, and it weighs about 20 grams per 100 seeds. In this study, Longxiaodou 4 (Fig. 1) was utilized for genome sequencing and assembly. Plants were cultivated under long-day conditions (16 hours light, 8 hours dark) at 24 °C in a controlled growth cabinet (Xunneng Instrument, Beijing, China). Genomic DNA was extracted from leaf tissue using the Qiagen DNA Purification Kit (Qiagen, Valencia, CA, USA).Fig. 1The appearance of Vigna angularis (cultivar Longxiaodou 4). (a) Plant morphology of Vigna angularis. (b) Seeds of Vigna angularis.To obtain sufficient read data for genome assembly, both the PacBio SEQUEL II (Pacific Biosciences, California, USA) and the Illumina HiSeq 4000 platforms (Illumina, California, USA) were employed. Long reads from the PacBio platform were used for genome assembly, while the short, precise reads from the Illumina platform were used for genome survey and base-level correction after the assembly. For the PacBio platform, 20-kb genomic sequencing libraries were constructed according to the PacBio-suggested protocol, yielding 65.71 Gb of long sequencing reads. After adaptor removal, 65.58 Gb of subreads (coverage of 149.34x) were obtained, with subread N50 and average lengths of 22.16 kb and 13.90 kb, respectively (Table 2). Besides. the DNA was utilized to construct sequencing library using the Illumina TruSeq DNA Sample Prep Kit (Illumina, San Diego, CA, USA). Paired-end sequencing with a read length of 150 base pairs (bp) was conducted on an Illumina HiSeq 4000 system (Illumina, San Diego, CA, USA). This process generated a total of 98.10 Gb of short sequencing reads (coverage of 222.95x). Reads containing adaptors and those with quality scores below 20 were excluded. Consequently, 96.03 Gb of high-quality reads were obtained for de novo genome assembly (Table 2).Table 2 Sequencing data used for the Vigna angularis (cultivar Longxiaodou 4) genome.The same individual used for genomic sequencing was also employed for transcriptome sequencing to provide essential gene expression data for genome annotation. Gene expression showed distinct tissue specificity, and therefore, RNA was extracted from root, stem, and leaf tissues. RNA extraction was performed using the RNAiso Pure RNA Isolation Kit (Takara, Japan), followed by DNase I treatment to remove DNA contamination. RNA quality was assessed with a NanoVue Plus spectrophotometer (GE Healthcare, NJ, USA). RNA-seq libraries were prepared and sequenced on the Illumina HiSeq 4000 platform, yielding a total of 7.89 Gb of transcriptome data. A summary of the genome and transcriptome sequencing data is presented in Table 2.
De novo assembly of the adzuki bean genomeShort next-generation sequencing (NGS) reads were employed for estimating the genome size, heterozygosity, and repeat content of V. angularis before the de novo genome assembly. Jellyfish (v2.1.3)13 was adopted to count the number of 21-mers, which was used to calculate the basic information of the genome (Table 3). The genome size of V. angularis was estimated at 464.9 Mb, with heterozygosity of 0.54% and a repeat content percentage of 43.878%.Table 3 Statistics of the 21-mer analysis of the Vigna angularis (cultivar Longxiaodou 4) genome.The long reads from the PacBio SEQUAL sequencing platform were utilized for the contig assembly using Canu (v2.0)14, with the parameter of the corrected ErrorRate set to 0.045 and corOutCoverage set to 40, respectively. Approximately 150-fold coverage of the estimated genome size was generated after self-correction. The primary assembled genome size was 495 Mb, with a contig N50 of 16.14 Mb. To revise the random error introduced by the PacBio sequencing reads, this assembled genome sequence was polished with the long reads obtained with Racon (v1.3.3)15 and then further polished with the short reads obtained with Pilon (v1.23)16. Purge_haplotigs (v1.0.4)17 was used to purge the heterozygous and redundancy regions of the polished sequences. Ultimately, a high-quality genome of Vigna angularis was obtained, featuring a total size of 447.80 Mb, with a contig N50 of 16.53 Mb and a total of 47 contigs.The completeness and accuracy of the assembled genome were then evaluated with multiple methods. BUSCO (Benchmarking Universal Single-Copy Orthologs, v3.0.0)18 was employed to assess the completeness of the single-copy genes from the orthologs database, with 95.42% complete and 0.68% partial of a total of 1,614 genes in the embryophyta_odb10 database identified, respectively. LTR_FINDER (v1.0.7)19 and LTR_retriever (v2.7)20 were used to search the LTR elements and calculate the LAI score of the genome, which was 15.23. The short NGS reads and long PacBio reads were aligned into the genome. Of the short reads, 97.95% were mapped to the genome, and the coverage was 99.98%, with 96.24 and 99.97% for the long reads for these two values. The BUSCO, LAI index, and read mapping ratio results proved the completeness and accuracy of this assembled genome.Telomere sequence identification was performed based on the characteristic base repeat sequences in the telomere regions (signature sequences: CCCTAAA/TTTAGGG). The details are presented in Table 4.Table 4 Statistical results of telomere identification.Chromosome construction using the interaction information from Hi-C dataThe Hi-C technique has proven its efficacy in chromosome assembly and has been successfully employed in numerous genomic projects21. In this work, we used leaves from the same individual as in the genome sequencing for the Hi-C library construction and sequencing. Approximately 69.8 GB of the raw reads were generated from the Illumina platform, filtered, and subsequently utilized for further analyses. The sequencing reads were mapped to the polished adzuki bean genome with BWA 0.7.1722. Pair-end short reads were mapped to the genome, and only the uniquely mapped read pairs were selected. Juicer 1.5.623 was applied to process the Hi-C reads, and the interaction frequency was quantified and normalized. Then, 3D-DNA24 was applied to identify and correct the errors in the initial assembly and orient and cluster the contigs according to the Hi-C contact matrix. Consequently, 11 groups were successfully clustered, which were further ordered and oriented into chromosomes. Finally, 447.5 Mb contigs were reliably anchored on chromosomes, accounting for 99.9% of the total genome. The contig and scaffold N50 reached 16.5 and 40.0 Mb (Table 5), respectively, providing a high-quality chromosomal genome assembly for adzuki bean.Table 5 Assembly statistics of the Vigna angularis (cultivar Longxiaodou 4) genome.Repetitive element annotationRepetitive sequences of the adzuki bean genome were identified through a combination of ab initio and homology-based prediction approaches. For the ab initio repeat annotation, LTR_FINDER15, RepeatScout25, and RepeatModeler (http://repeatmasker.org/RepeatModeler/) were used to construct a de novo repetitive element database, and RepeatMasker26 (http://repeatmasker.org/RMDownload.html) was used to annotate the repeat elements with the database. RepeatMasker and RepeatProteinMask were used to identify repeats at the DNA and protein level by mapping to the Repbase database27. Tandem repeats were also ab initio annotated with Tandem Repeat Finder28. A total of 243.62 Mb repeat sequences were identified, accounting for 54.50% of the genome (Table 6).Table 6 Summary statistics of the repeats’ annotation of the Vigna angularis (cultivar Longxiaodou 4) genome.Protein-coding gene prediction and functional annotationA combined approach involving de novo prediction, homology-based prediction, and transcriptome-based prediction was used for the protein-coding gene prediction. The RNA-seq reads from multiple tissues such as root, stem, and leaf were cleaned and mapped to the Vigna angularis genome using HISAT229. Subsequently, StringTie30,31 was employed to identify the potential exon regions, and TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki) was utilized to predict the Open Reading Frames (ORFs) based on the transcript sequences. The homologous protein sequences of Abrus precatorius, Vigna radiata var. radiate, Vigna angularis (old version), Vigna unguiculata, and Arachis hypogaea were downloaded from NCBI and mapped to the adzuki bean genome using TBLASTN32. The blast results were conjoined, and accurate coding sequences of the corresponding genomic region on each blast hit were predicted using Exonerate (https://github.com/nathanweeks/exonerate). The de novo gene structures were predicted using AUGUSTUS33 and Genscan34 based on the repeat-masked genome sequence. As a result, a gene set of 25,939 high-quality protein-coding genes was obtained after integrating all the gene structure results from the ab initio, homology, and transcriptome results by MAKER35 (Table 7). Gene annotation completeness was assessed using embryophyta BUSCOs, finding 98.2% completeness18 (Table 8). The distribution of gene element length was compared to the homology species above (Fig. 2).Table 7 Statistics of the gene models of the protein-coding genes annotated in the Vigna angularis (cultivar Longxiaodou 4) genome.Table 8 BUSCO results for the gene model with the embryophyta_odb10 database.Fig. 2Comparisons of the prediction gene models in the Vigna angularis (cultivar Longxiaodou 4) genome to other species. (a) Comparison of gene length between Vigna angularis and other species. (b) Comparison of CDS length between Vigna angularis and other species. (c) Comparison of exon length between Vigna angularis and other species. (d) Comparison of intron length between Vigna angularis and other species.Using the publicly available databases TrEMBL, Swiss-Prot36, InterPro37, NCBI non-redundant protein, euKaryotic Orthologous Groups38, Gene Ontology39, and Kyoto Encyclopedia of Genes and Genomes40, 25,479 predicted genes (approximately 98.23% of all) were functionally annotated with at least one of these databases.

Hot Topics

Related Articles