First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome

Specimen collection and sample preparationDuring June of 2021, we used entomological nets to collect infected social wasps, Polistes fuscatus, at the Robert H. Treman State Park located in Ithaca, NY, USA. We used taxonomic studies with detailed morphological characteristics to confirm that the parasite was X. peckii23,33,34. Several studies also included information about voucher specimens, confirming that P. fuscatus is infected by X. peckii in Ithaca, NY23,33,34. Infection and stage of development of X. peckii was confirmed by the extrusions in the host abdomen, which are morphologically distinct for developing female (N = 10) and male parasites (N = 10)5. We transported the infected wasps to the University of Rochester in disposable and transparent deli cups that are pre-punched to increase air flow. Each infected wasp was housed individually in a deli cup with a water vial and a sugar cube and kept under controlled conditions that included: temperature of 25 °C, 40% humidity, and full spectrum light for 12 hours. We inspected infected wasps daily during the extrusion period of 3–7 days and 9–17 days for male and female parasites, respectively. After emerging from a host, male parasites were frozen with dry ice and stored at −80 °C. A single female was dissected out of its host and preserved the same way. A single male X. peckii was preserved for an ultra-low DNA input sample preparation and subsequent long-read sequencing, while an additional male and the female were preserved for short-read sequencing. Finally, we froze one more single male in liquid nitrogen for subsequent extraction of RNA, sequencing, and genome annotation.DNA and RNA extraction, library preparation, sequencing and pre-assembly read processingLibrary preparation and sequencing for PacBio HiFi data was performed at the DNA sequencing and Genotyping Center of the Delaware Biotechnology Institute at University of Delaware. Total genomic DNA was isolated using the Qiagen MagAttract kit (QIAGEN, shanghai, China) following the manufacturer’s instructions. One SMRTbell library of circular consensus sequencing (CCS) was constructed according to the Low DNA input PacBio protocol with additional bead cleaning. The library was then sequenced on one PacBio Sequel II SMRT cell platform in CCS mode, which calls consensus reads from subreads that are generated after multiple passes of the enzyme around the circularized template (Pacific Biosciences). Sequencing yielded 8.6 Gb of data and ~0.9 million HiFi (high-fidelity) reads from the male X. peckii sample (Table 1).Table 1 General statistics of raw sequencing reads used for X. peckii genome assembly and genome size estimation.We extracted Genomic DNA from the additional male and the female X. peckii using a DNeasy Blood and Tissue kit (QIAGEN, Valencia, CA, USA) for library preparation and short read whole genome Illumina sequencing. RNA from a whole male was extracted with the RNeasy Mini Kit, implementing a DNase digestion step with the RNase-Free DNase Set (QIAGEN, Valencia, CA, USA). DNA and RNA samples were received and validated via standard quality control procedures at Novogene Co., Sacramento, CA. To construct the genomic DNA library, DNA was sheared into short fragments which were end repaired, A-tailed and ligated with Illumina adapter. Fragments with adapters were PCR amplified, size selected and purified. Each library was checked with Qubit and real time PCR for quantification, and then pooled for sequencing on a NovaSeq PE 150 Illumina platform (Novogene Co. Ltd. Sacramento, CA). Sequencing results generated 7.5 Gb and 6.0 Gb of data for the male and female samples, respectively (Table 1).After sequencing, we assessed read quality using FastQC35. Then, we filtered reads for quality and length, and trimmed adapter sequences and PolyG tails using fastp36 with the options -l 40. Then we re-assessed the quality of the processed reads, which we used to validate the genome size, heterozygosity and repetitive content estimations and to identify sex chromosomes based on coverage patterns. The transcriptome data (total RNAseq) was also sequenced with the NovaSeq PE 150 platform, using an mRNA library preparation with standard poly A enrichment, which generated 6 GB of raw data to implement in the genome annotation.Genome assembly and size estimationPrior to assembly, estimating features like genome size, rate of heterozygosity, and repetitive element content proves useful to inform the parameter values that should be used in subsequent steps. We used Jellyfish (v2.3.0)37 to count and compute a histogram of k-mer frequencies from the raw PacBio HiFi reads using the count (-C -m 21) and histo (–h 1,000,000) modules (Table 2).Table 2 Software and version used for analyses in this study with corresponding parameters, if different from the default options.Then we used the Jellyfish histogram output to run the online web tool of GenomeScope2 (http://qb.cshl.edu/genomescope/genomescope2.0) with the following parameters: K-mer length = 21, ploidy = 2 and max kmer coverage = 1,000,00038. The model suggests that X. peckii has an estimated genome size of 63,567,278 bp with 1.19% heterozygosity levels, and 36.4% of the genome composed of repetitive elements (Fig. 2). The X. peckii genome we report here is one of the smallest insect genomes documented up to date7 (Table 3; See technical validation below).Fig. 2Genome size and content estimation using GenomeScope 2.0. Genomescope2 k-mer (21) distribution from the adapter trimmed PacBio HiFi reads. K-mer depth is plotted against k-mer frequency for a given depth. The plot displays estimation of genome size (len), percentage of the genome that is not in repetitive elements (uniq), homozygous rate (aa), heterozygous rate (ab) mean k-mer coverage for heterozygous bases (kcov), read error rate (err), average rate of read duplications (dup), k-mer size used on the run (k) and ploidy (p). (a) Expected distribution according to the model. (b) Obtained distribution according to the model. The four peaks correspond to the mean coverage levels of the unique heterozygous, unique homozygous, repetitive heterozygous and repetitive homozygous sequences, respectively.Table 3 GenomeScope estimation of genome structure using reads from different sequencing technologies.After a preliminary survey of the genome structure, we ran four different assembly software optimized for HiFi reads (i.e., Hifiasm39, Improved Phased Assembler IPA40, Hicanu41, Flye42) and computed BUSCO scores to gauge differences in assembly metrics. Hifiasm produced the highest quality genome (see technical validation below; Table 4; Fig. 3).Table 4 Assembly statistics using four different assembly software.Fig. 3Assembly statistics with different software. (a) QUAST comparison of assembly contig length produced by different software. (b) BUSCO scores produced by each software. Assembly with Hifiasm shows the highest completeness score with lowest duplication rates. Each assembler indicates that approximately 10% of the genes in the BUSCO database are missing from the genome.We assembled the HiFi reads without additional data into a draft haplotype-resolved genome assembly using Hifiasm v0.16.1-R34139. We then used the default parameters of Hifiasm to generate a primary and alternate assembly graph after aggressive purging of haplotypic duplications (-primary, -l 3) and three rounds of error correction39. The primary assembly produced by Hifiasm was 73,541,611 bp long, assembled in 229 contigs with an N50 of 7.3 Mb and a GC content of 23.6%. Both the primary (73 Mb) and alternate (64 Mb) assemblies produced by Hifiasm were similar to the estimated genome size produced by GenomeScope2.However, the primary assembly was slightly larger than the GenomeScope2 estimations (Table 3; Fig. 2). High heterozygosity can result in spurious duplications that increase the genome size because two alleles from the same loci are included in the primary assembly. We used Jupiter Plot (−n = 50000, ng = 75)43 to produce a synteny plot and compare completeness between the primary and alternate assemblies produced by Hifiasm. The primary assembly produced by Hifiasm is more contiguous and the alternate assembly doesn’t include some of the largest contigs in the primary assembly (potential sex-chromosomes based on coverage patterns; Supplementary Figure 1a). Thus, the small discrepancies (<10 Mb) between the GenomeScope genome size estimations and the final assembly size may be due to assembler limitations when sorting or deduplicating repetitive regions.Genome quality assessmentWe used Blobtools2 from the BlobToolKit suite44 to screen for contamination. First, we performed a BLASTn45 search of our assembly against the general RefSeq blast -nt database using the following parameters: -outfmt ‘6 qseqid staxids bitscore std’ -max_target_seqs 1 -num_threads 12 -evalue 1e-6. Then we used the function blobtools –add to create a BlobDir database that included the blast output (hits file), the read coverage (bam file) from mapping the raw reads back to the assembly, and the BUSCO scores (Busco summary file; see below). We implemented the BlobToolKit v.1.1.144 online Viewer (blobtools host ‘pwd’) to create a Blobplot with the hits of our bacterial contamination scan, and their respective coverage (Supplementary Figure 2). We removed two contigs that matched the phylum Proteobacteria. Specifically, one whole contig was the genome of Wolbachia pipitensis (see technical validation below; Supplementary Figure 2). Then, we used QUAST v346 to assess the metrics of the curated assembly. The final assembly size was 72,105,243 Mb, assembled in 227 contigs with a GC content of 23.4%. We used the BlobtoolKit v.1.1.144 to create a SnailPlot to visualize our assembly statistics (Table 5; Fig. 4). This Whole Genome PacBio HiFi assembly project has been deposited at DDBJ/ENA/GenBank under the accession JAWUEG00000000063.Table 5 Statistics produced by QUAST for final assembly using Hifiasm after removing bacterial contamination.Fig. 4Snail plot assembly visualization using BlobTools 2.0. The contiguity and completeness of the X. peckii genome assembly after contamination screening is plotted as a circle that represents the full length of the assembly (~72.1 Mb), assembled in 227 contigs. The N50 (7.38 Mb) is highlighted in dark orange and the N90 (470k) in light orange. The longest contig was 8.76 Mb (highlighted in red). The assembly has a uniform GC content of 23.4% and few contigs <100 Kb in length. The BUSCO scores are shown in the top right corner in green.We evaluated assembly quality and completeness using Benchmarking Universal Single-copy Orthologs (BUSCO) v.5.2.247 with the insecta_odb10 database. From a total of 1367 BUSCO groups searched, our assembly has 87.4% of genes complete and present in single copy, 0.7% duplicated genes, 1.9% fragmented genes, and 10.4% of the genes missing (Fig. 3). Additionally, we evaluated our assembly against the 2124 BUSCO groups from the endopterygota_odb10 database. This assessment shows 74.8% of complete and single copy genes, 1.1% duplicated genes, 6.5% fragmented genes, and 17.6% missing genes. The underlying cause behind the high proportion of missing genes from both datasets will be clarified once more genomic data for the order becomes available.Mitochondrial genome assemblyWe assembled the mitochondrial (mtDNA) genome of X. peckii from the raw PacBio HiFi reads with the MitoHiFi pipeline48,49, which uses a reference-guided method to perform the assembly. First, the pipeline implemented the software MitoFinder49 to scan all the mtDNA assemblies available on NCBI and downloads the highest quality reference mitogenome for the most closely related taxa. PacBio HiFi reads were mapped to the reference mitogenome using Minimap2 v.2.1750. Then, all the mapping reads were assembled de novo with Hifiasm39 into a mitogenome. A mitochondrial genome is available for X. vesparum (partial genome: 14,519 bp NCBI Accession number DQ364229). In addition, two independent mitogenomes were found for X. yangi, previously X. cf. moutoni (partial genome: 16,717 bp NCBI Accession number MW222190 and partial genome: 15,324 bp NCBI Accession number OK329871)13,14,15. Following the MitoFinder output, we used the mitochondrial genome provided under X. yangi, as the starting reference sequence, because it is the most complete sequence available15. Then, we used MitoHiFi (-p 90 -o 5) to assemble the mitogenome. A total of 30,623 HiFi reads mapped to the reference and were used for de novo assembly with Hifiasm. The final mitogenome size assembled for X. peckii was 16,111 bp (NCBI Accesion JAWUEG000000000)63. To confirm the accuracy of the mitogenome, we used NOVOPlasty51 to assemble an independent mitogenome de novo. We used the Illumina 150 bp short reads from the male X. peckii (Table 1) and the COX1 X. peckii gene sequence available on NCBI (Accesion JN082808)8 for the starting reference, as input for NOVOPlasty. This de novo NOVOPlasty assembly resulted in a mitogenome of 15,946 bp, a similar size and syntenic to the one assembled with MitoHiFi (Supplementary Figure 3).Subsequently, we used the software MITOS252 to annotate the mitochondrial genome. The sequence spans the full region between rrnL and ND2, contains 37 genes and is circularized (Fig. 5a). After manual inspection and curation of the mitogenome, we found that MITOS2 was unable to annotate the complete region of the 16S rRNA gene (i.e., rrnS). In Strepsiptera, a recent study highlights that the rrnS gene appears to have highly divergent regions in the 5’ section of the gene that is flanked by trnV, which hinders the annotation process17. Additionally, to assess the coverage along the sequence, we mapped back to the assembly all the reads that aligned to the original reference X. yangi mitogenome, as well as all the HiFi reads available. We found relatively homogeneous coverage along the sequence, except in the first 2 kb which is likely composed of repetitive elements (Fig. 5b,c). Finally, we used BLAST + v2.145 to search for matches of our mitochondrial assembly in the nuclear genome and filtered out contigs from the nuclear genome with a percentage of sequence identity >99% and of smaller size than the mtDNA sequence53.Fig. 5Mitogenome annotation and coverage. (a) Annotation and protein prediction plot produced by MITOS2. The figure legend highlights the position and quality of the sequence by color-coding individual genes. (b) Coverage along the sequence, using only reads that mapped to the reference X. yangi mitogenome. (c) Coverage along the sequence implemented all HiFi reads available.Mapping rate and coverage for X-linked contigs discoveryCytogenetic data for the order Strepsiptera is only available for two species, and it indicates that they have heteromorphic X and Y chromosomes12,54. Specifically, the number of chromosomes of X. peckii was identified as 2n = 1623,25. To evaluate mapping rate and depth of coverage along the assembly we used Minimap2 v.2.1750 to map the raw long PacBio HiFi reads, as well as the short Illumina (150 bp) PE reads, back to the final assembly. The male X. peckii HiFi reads had an average mapping rate of 99.43% and 97.8% coverage, with a mean depth of coverage of 117X. For the Illumina PE reads the mapping rate was 94.95% for the male and 77.61% for the female, with a mean depth of coverage of 97.27X and 63.73X respectively. We used the CIRCOS55 tool to visualize the depth of coverage through the first 11 contigs that contain >90% of the genome. We identified three X-linked contigs (ctg02, ctg23 and ctg15), which presented a male to female coverage ratio of 0.5. Interestingly, ctg02 is also the largest contig in our assembly (Fig. 6).Fig. 6X-linked contig discovery based on depth of coverage. Average coverage over 50 Kb windows with 20Kb step size in the 11 largest contigs, that contain >90% of the genome. Three individual samples were used: the first external coverage plot corresponds to the HiFi reads of the X. peckii male used for the genome assembly. The middle coverage plot corresponds to short 150 bp Illumina reads from another male, and the internal coverage plot corresponds to short 150 bp Illumina reads from a female. Potential X-linked contigs (red) have half the coverage in both male samples than the rest of the autosomes (blue). In contrast, the female sample has a uniform depth of coverage throughout the plot.Repeat modelerAs this is the first high quality genome assembly of the order Strepsiptera and there are no repeat libraries available for the order, we identified and classified repetitive elements de novo using RepeatModeler v.2.0.13356. Then, we used RepeatMasker v.4.0.73457 to mask the genome assembly using the de novo library produced by RepeatModeler2. RepeatMasker masked 38.41% of the genome, a similar value to the repeat content suggested by GenomeScope2 before removing the bacterial contigs. Most of the repeats in the genome are unclassified (27.8%). LTR elements are the most abundant elements among the classified repeats (5.74%; Fig. 7; Table 6).Fig. 7Repeat content classification using RepeatModeler2 and RepeatMasker. The proportion per category of repetitive element found across the genome. Categories include long interspersed elements LINEs (brown), LTR elements (yellow), DNA elements transposons (light green), simple repeats (purple), low complexity (orange) and unclassified repeats (pink).Table 6 Repetitive elements modeled de novo with RepeatModeler2 (class and content).AnnotationWe annotated the X. peckii genome assembly using the BRAKER358 pipeline. First we used RepeatMasker v.4.0.73457 to soft mask the genome (–xsmall) using the repeat library we generated de novo with RepeatModeler256. Next we used the software STAR (Spliced Transcripts Alignment to a Reference) v2.7.159 to map RNA-seq reads (male; whole body) to the genome. Then we used the mapped reads (–bam) as evidence input for BRAKER358 to generate ab initio gene predictions which resulted in 8759 predicted genes. Additionally, we ran a homology-based gene prediction with the tool GeMoMa v1.960 which allows to incorporate RNA-seq evidence for splice site prediction. We used the annotated proteins from the published genome of Drosophila melanogaster61 and the beetle Tribolium castaneum62 as reference input, and incorporated the previously mapped RNA-seq reads with the following command:java -Xms5G -Xmx170G -jar GeMoMa-1.9.jar CLI GeMoMaPipeline AnnotationFinalizer.r=NO o=true t=UROC_Xpeckii_1.1.fasta s=own i=Tcast a=./GCF_031307605.1/genomic.gff g=GCF_031307605.1_icTriCast1.1_genomic.fnas=own i=Dmel a=dmel-all-r6.45.problem.free.subset.gff g=dmel-all-chromosome-r6.56.fasta r=MAPPED ERE.s=FR_FIRST_STRAND ERE.m=STAR_RNA_mapped.bam threads=16 outdir=TriboliumDrosophila. The combination of homology-based and ab-initio gene predictions by GeMoMa resulted in 12,860 genes (GFF file available in Figshare64).

Hot Topics

Related Articles