Chromosome-level genome assembly and annotation of the Rhabdophis nuchalis (Hubei keelback)

Ethics statementThis study adhered to all pertinent ethical and legal guidelines and regulations. The collection of animals and extraction of tissues underwent thorough review and received approval from the Animal Ethics and Welfare Committee of Sichuan Agricultural University (Approval No. 20230121).Sample collectionAn adult female R. nuchalis (body length of 755 mm) was collected from the Shennongjia forest area (latitude: 31.683625, longitude: 110.418075) in Hubei Province, China, for genome sequencing and assembly. Six different tissues (heart, liver, spleen, lung, kidney, and muscle) were sequentially collected and rapidly frozen in liquid nitrogen upon collection, then stored at −80 °C. Liver tissue was utilized for MGI short-read sequencing, PacBio Revio HiFi long-read sequencing, and Hi-C sequencing, while the remaining five tissues were designated for RNA sequencing.Library construction and sequencingThe collected tissues were sent to GrandOmics Biosciences Co., Ltd. (Wuhan, China), for DNA extraction, library construction, and sequencing. Genomic DNA (gDNA) was extracted from the liver following the manufacturer’s instructions and used for the construction of gDNA libraries. The integrity and purity of the gDNA samples were assessed using agarose gel electrophoresis.For short-read sequencing, 1.5 μg gDNA was randomly fragmented by Covaris, following the guidelines specified in the device’s operating manual, and 300–400 bp fragments were selected with the Agencourt AMPure XP-Medium kit. The library was then constructed from the selected fragments using the AxyPrep Mag PCR clean-up Kit according to the manufacturer’s instructions. Finally, the qualified libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This yielded 108.82 Gb of raw reads, and 101.02 Gb of clean reads (with an average depth of coverage of 38.95×) were obtained after quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for genome size estimation and to evaluate the accuracy of genome assembly.Table 1 Statistics of sequencing data used for R. nuchalis genome assembly in this study.For PacBio HiFi long-read sequencing, 5 µg of gDNA was used to construct SMRTbell libraries following PacBio’s standard protocol (Pacific Biosciences, CA, USA). The process included shearing of gDNA using g-TUBEs (Covaris, USA) according to the expected size of the fragments for the library, DNA damage repair, end repair, and A-tailing, followed by ligating hairpin adapters at both ends of the fragments using the SMRTbell Express Template Prep Kit 3.0 (Pacific Biosciences). After nuclease treatment of the SMRTbell library using the SMRTbell Enzyme Cleanup Kit, target fragments were screened using PippinHT (Sage Science, USA), and the prepared SMRTbell library was sequenced on the PacBio Revio platform instrument with Revio Kit in Grandomics. This resulted in 79.31 Gb of HiFi long reads (with an average coverage depth of 39.65×) for genome assembly (Table 1).Hi-C libraries were constructed following the protocol22 to obtain the genome at the chromosome level. Key steps included fixation of liver samples using 2% formaldehyde, cleavage of sequences with the DpnII enzyme, end repair, biotin-14-dCTP labeling, ligation with T4 DNA ligase, and uncross-linking and interrupting the sequences. Subsequently, the libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This generated 209.72 Gb of raw reads, and 209.72 Gb of clean reads (with an average depth of coverage of 107.70×) were obtained after quality control using fastp v0.21.021 (Table 1).To improve the precision of genome annotation, RNA sequencing was conducted across five distinct tissues: heart, spleen, lungs, kidneys, and muscles. Each tissue underwent RNA extraction utilizing TRIzol reagent (Invitrogen, USA), followed by assessment of RNA purity and concentration using Nanodrop and Qubit, construction of RNA-seq libraries employing the MGIEasy RNA Sample Prep Kit (UW Genetics), and sequencing on the BGISEQ DNBSEQ-T7 platform. A minimum of 6 Gb of sequencing data was guaranteed for each tissue. In total, 40.26 Gb of raw reads were generated, with 40.14 Gb of clean reads obtained post quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for transcriptome annotation of the genome.Predicting genome size and heterozygosityThe genome size and heterozygosity of R. nuchalis were predicted using KMC v3.2.123 and GenomeScope v124 software before assembly. Initially, the short reads, post-quality control, underwent analysis with KMC v3.2.123 (parameter k = 17) to generate the k-mer frequency distribution table. Subsequently, the obtained k-mer frequency distribution table was analyzed using GenomeScope v124 software to derive genome prediction information. Finally, the prediction results indicated a genome size of 1.57 Gb and a heterozygosity of 1.20% (Table 2).Table 2 Statistical analysis of the size and heterozygosity prediction for the R. nuchalis genome (K-mer = 17).
De novo assembly of the R. nuchalis genomeDe novo assembly of the R. nuchalis genome was conducted using the obtained HiFi long reads through hifiasm v0.16.025. We acquired the preliminary assembled genome, which underwent comparison with the NT (Nucleotide Sequence Database) library. Sequences longer than 1 Mb were subjected to 50 kb cuts, and contaminating reads (non-target macroclasses, mitochondria) were subsequently removed from the genome to yield the final assembly. The resulting genome size of R. nuchalis, post-contamination removal, was 1.93 Gb, with a contig N50 of 104.79 Mb (Table 3).Table 3 Assembly statistics for R. nuchalis are presented in two parts: the first part comprises the assembly results prior to Hi-C integration, while the second part showcases the outcomes of the Hi-C-assisted assembly process.To assess the quality of the genome assembly, we first employed BUSCO v4.0.526 (Benchmarking Universal Single-Copy Orthologs) to evaluate completeness. This involved analyzing single-copy homologous genes in the OrthoDB database vertebrata_odb10. The analysis revealed that 3,270 (97.50%) out of 3,354 BUSCO groups were identified as complete, including 3,232 complete and single-copy BUSCOs (96.36%), and 38 complete and duplicated BUSCOs (1.13%), indicating high completeness of the assembled genome (Table 4).Table 4 Statistics from the BUSCO assessment of the R. nuchalis genome assembly and annotation.Furthermore, to evaluate the accuracy of the assembly, clean short reads and HiFi long reads were mapped to the R. nuchalis genome using BWA v0.7.1527 and minimap228, respectively. The results indicated that at a coverage depth of 1×, the clean short reads and HiFi long reads achieved 98.24% and 99.97% coverage across the entire genome, respectively (Table 5). This demonstrates the high accuracy of the genome assembly.Table 5 Results from this study involve the alignment of quality-controlled short-read and long-read sequences to the assembled R. nuchalis genome.Hi-C assisted assemblyWe employed a multi-step process to assemble the genome of R. nuchalis to the chromosome level using quality-filtered Hi-C reads. Firstly, clean Hi-C reads were aligned to genomes assembled with HiFi long reads using bowtie2 v2.3.229 to obtain uniquely mapped paired-end reads. Subsequently, HiC-Pro v2.8.130 was utilized to identify and retain valid interacting paired-end reads from these uniquely mapped pairs while filtering out invalid sequences such as dangling-end, self-cycle, re-ligation, and dumped products.Subsequently, the scaffolds underwent further clustering, sorting, and chromosomal localization using LACHESIS v131. Subsequent manual adjustments were made to the genome using Juicebox v1.11.0832 to derive the final pseudochromosomes. The chromosomes, GC content, gene density, abundance of repetitive sequences, and ncRNA distribution of the genome were visualized using the advanced circos33 in TBtools II34 (Fig. 1B). The analysis unveiled that R. nuchalis features 20 chromosomes, consisting of 9 macrochromosomes and 11 microchromosomes (with a 50 Mb threshold in squamates35). Chromosome sizes varied from 14.96 Mb to 411.07 Mb, contributing to a total genome size of 1.92 Gb (Tables 3, 6, and Fig. 1). Notably, the contig N50 stood at 104.79 Mb, while the scaffold N50 reached 204.96 Mb (Table 3). This comprehensive approach facilitated the structuring of the genome into chromosomal configurations, offering profound insights into the genomic architecture of R. nuchalis.Fig. 1(A) Hi-C interaction heatmap for R. nuchalis. Numbers on the x-axis and y-axis, 1 to 9 represent 9 macrochromosomes and “10~20” correspond to 11 microchromosomes. (B) Circos diagram of the R. nuchalis genome. “Chr 01” to “Chr 20” represent chromosomes; (a) chromosomes; (b) GC ratio; (c) gene density; (d) abundance of repetitive sequences; (e) abundance of ncRNA; (b–e) per 100-kb window. At the center is an adult R. nuchalis photographed by Xuemei Tang.Table 6 Statistics of 20 chromosomes of R. nuchalis genome (1–9 are macrochromosomes and 10–20 are microchromosomes).Repeat sequence annotationRepeat sequences, comprising tandem repeats (TRs) and transposable elements (TEs), were annotated in the genome of R. nuchalis using a combination of software tools and databases. For TRs, we employed GMATA v2.236 and Tandem Repeats Finder (TRF v4.07b37) software pairs. GMATA v2.236 identified simple repeat sequences (SSRs), while TRF v4.07b37 identified all tandem repeats in the genome. Regarding TEs, a dual approach of de novo and homologous annotation was adopted. Firstly, transposable elements were de novo annotated using MITE-hunter38 and RepeatModeler v1.0.1139 software, in which also uses LTR_FINDER40, LTR_harvest41 and LTR_retriver42 for synchronization detection of repeat sequences. Subsequently, the obtained libraries were compared with the TEclass Repbase database to categorize each repeat family using TEclass v2.1.343. Furthermore, RepeatMasker v1.33144 was utilized to search for both known and novel TEs by localizing sequences from de novo repeat libraries and Repbase repeat libraries. Overlapping transposon factors belonging to the same repeat class were sorted and combined.The results indicated that a total of 1.09 Gb of repetitive sequences were annotated in the genome of R. nuchalis, constituting 56.51% of the entire genome. Among these, TRs and TEs accounted for 13.78 Mb and 885.68 Mb in size, representing 0.72% and 46.02% of the whole genome, respectively. Class I and Class II TRs comprised 628.50 Mb and 257.18 Mb, contributing to 32.66% and 13.36% of the entire genome, respectively (Table 7). This comprehensive annotation provides insights into the repetitive landscape of the R. nuchalis genome.Table 7 Statistical outcomes regarding repetitive sequences in the R. nuchalis genome (categorized by the type of repetitive sequences).Gene structure annotationIn the structural annotation of the R. nuchalis genome, we initially applied RepeatMasker v1.33144 to soft-mask the annotated repetitive sequences. Subsequently, gene structure prediction was conducted through three methods: homology prediction, transcriptome prediction, and de novo prediction, with integration of the results to derive the final gene structure annotation. For homology prediction, comparisons were made with the genomes of five closely related species (Ahaetulla prasina7, Calamaria septentrionalis1, Pantherophis guttatus1, Thamnophis elegans NCBI accession GCA_009769535.1, and Thermophis baileyi8) using GeMoMa v1.6.145 software. Transcriptome prediction involved mapping quality-controlled RNA-seq reads to the R. nuchalis genome using STAR v2.7.3a46, followed by transcript assembly with Stringtie v1.3.4d47 and prediction of open reading frames (ORFs) using PASA v2.3.348. De novo prediction entailed reassembly of RNA-seq reads using Stringtie v1.3.4d and analysis with PASA v2.3.348 to generate a training set, followed by de novo gene prediction using Augustus v3.3.149. Finally, the predictions were integrated using EVM v1.1.148 (EVidenceModeler).The results indicated that homology prediction, transcriptome prediction, and de novo prediction annotated 48,439, 18,203, and 20,575 genes, respectively, with a final count of 22,057 protein-coding genes successfully annotated after EVM v1.1.148 integration. Among them, the average gene length and CDS length were 34,853.45 bp and 1,617.01 bp, respectively. Each exon contained an average of 9.12 genes, while the average lengths of exons and introns were 177.32 bp and 4,093.52 bp, respectively (Table 8).Table 8 Statistical outcomes of gene structure annotation in the R. nuchalis genome, obtained through three different methods and subsequently integrated.Gene function annotationWe have successfully completed the functional gene annotation of the R. nuchalis genome by utilizing five key public databases: GO (Gene Ontology)50, SwissProt51, NR (Non-Redundant protein Database), KEGG (Kyoto Encyclopedia of Genes and Genomes)52, and KOG (Eukaryotic Orthologous Groups of proteins)53. In the case of the GO database, we employed the default parameters of the InterProScan v5.3254 program for gene function annotation. For the remaining four databases, we utilized Blastp v2.7.1 to annotate gene functions. The results revealed that 13,451, 18,567, 19,655, 14,474, and 13,362 genes were annotated in GO50, SwissProt51, NR, KEGG52, and KOG53, respectively, accounting for 60.98%, 84.18%, 89.11%, 65.62%, and 60.58% of the total number of genes in R. nuchalis (Table 9). Notably, 9,343 genes were annotated across all five databases (Fig. 2). By integrating the annotation outcomes from these databases, we completed the functional annotation of 19,918 genes, representing 90.30% of the total gene count (Table 9, Fig. 2).Table 9 Statistical findings from the functional annotation of genes within the R. nuchalis genome, sourced from five distinct databases and subsequently consolidated.Fig. 2Venn diagram illustrating the outcomes of functional annotation for R. nuchalis across five databases.Subsequently, we conducted an evaluation of the genome annotation results. Initially, the annotated genes were assessed using BUSCO v4.0.526 based on the OrthoDB database vertebrata_odb10. The evaluation revealed that 3,237 complete genes were identified within 3,354 BUSCO groups, accounting for 96.51% of the database, underscoring the high completeness of the annotated genome of R. nuchalis (Table 4). Furthermore, we compared the genome of R. nuchalis with the published genomes of five closely related species, which exhibited a total gene count ranging from 18,213 to 22,959 genes (Table 10). Remarkably, R. nuchalis possessed 22,057 genes, aligning well with the published species (Table 10). Additionally, in terms of gene length, average CDS length, exon length, the average number of exons per gene, intron length, and the distribution of intron number, R. nuchalis exhibited consistency with the five closely related species (Table 10, Fig. 3).Table 10 Comparison of results of R. nuchalis genome annotation with closely related species.Fig. 3Comparing the genomes of R. nuchalis assembled and annotated in this study with the published genomes of five closely related species. Each species is represented by a different colored line; A-F show the gene length, CDS length, exon number, exon length, intron number, and intron length distribution in the genomes of each species sequentially.Non-coding RNA (ncRNA) annotationThe annotation of ncRNAs in the R. nuchalis genome was accomplished through a combination of database searching and model prediction methods. Specifically, tRNAs were annotated using tRNAscan-SE v2.055, while MicroRNAs, rRNAs, small nucleolar RNAs, and small nucleolar kernel RNAs were identified by searching the Rfam database56 using Infernal v1.1.2 cmscan57. Additionally, RNAmmer v1.258 prediction was employed for the annotation of rRNAs and their subunits. The results showed that a total of 3,599 ncRNA were annotated in the R. nuchalis genome, including 397 rRNA, 981 snRNA, and 2,063 tRNA (Table 11).Table 11 Statistical results of Non-coding RNA annotation of the R. nuchalis genome (categorized by the type of Non-coding RNA).

Chromosome-level genome assembly and annotation of the Rhabdophis nuchalis (Hubei keelback)

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models