Genome assembly, Full-length transcriptome, and isoform diversity of Red Snapper, Lutjanus argentimaculatus

Specimen of Lutjanus argentimaculatus
A specimen of the L. argentimaculatus being maintained at Muthukadu Experimental Station of ICAR – CIBA (Chennai, India) was used for generating the necessary sequence data for building genome assembly. The species identity of the mangrove red snapper was confirmed based on the partial sequence of the barcode gene, Cytochrome C Oxidase I (COI) using primers, F2 – 5′TCGACTAATCACAAAGACATCGGCAC3′ and R1 – 5′TAGACTTCTGGGTGGCCAAAGAATCA3′8. The partial sequence of COI gene was obtained by Sanger sequencing method and submitted to GenBank under the accession number, OQ560579. The phylogenetic analysis to assess the lineage of L. argentimaculatus sample used in the current study was performed using the COI sequence. Multiple accessions of COI sequence were obtained from the Barcode of life data systems, (bold database) for different species of Lutjanus, and Lethrinus lentjan was used as an out group. Supplementary Table 1 summarizes the number of accessions used for generating the phylogenetic tree. MEGA X software9 was used to generate the Maximum Likelihood tree with a 1000 bootstrap iterations implementing the Hasegawa-Kishino-Yano model with Gamma and Invariant site distribution rates (HKY + G + I). The final tree was visualized using FigTree v1.4.3210. The phylogenetic analysis for the COI sequence (OQ560579) of L. argentimaculatus sample used for genome sequencing revealed clustering with other accessions of L. argentimaculatus (Fig. 2).Fig. 2Maximum likelihood phylogenetic tree based on COI gene sequence of Lutjanus argentimaculatus and other Lutjanus species and Lethrinus lentjan as an outgroup. Fish images were obtained from FAO species catalogue and modified65.Genomic data generationThe genomic DNA was extracted from muscle tissue of L. argentimaculatus using Genomic-tip 100/G kit (Qiagen, Hilden, Germany). DNA quantification carried out using Qubit 3.0 fluorometer (Thermofisher Scientific, Massachusetts, USA) using DNA HS assay kit (Thermofisher Scientific, Massachusetts, USA). DNA purity was checked using NanoDrop 2000 (Thermofisher Scientific, Massachusetts, USA). The integrity of DNA was evaluated on 1% agarose gel. Approximately 48 µg of total DNA with a concentration of 322 ng/µl was obtained.PacBio long read generation was performed using the high molecular weight genomic DNA. The genomic DNA was sheared using Megaruptor (Diagenode Inc, NJ, USA) for library preparation. The gel electropherogram of Femto pulse run (Agilent Technologies, California, USA) showed average size of 42,046 bp and 31,506 bp of sheared DNA and size selected DNA respectively. The library preparation was performed with SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, California, USA) and size selection was carried out using BluePippinTM (Sage Science, Massachusetts, USA). The library was sequenced on PacBio Sequel II platform (Pacific Biosciences, California, USA) and the raw data processing was carried out using SMRTlink v10.1.0.119588 software. (Pacific Biosciences, California, USA). A total of 134.6 Gb PacBio Sequel II Continuous Long Read (CLR) sequence data was generated with 8,515,294 number of sequences, 117.2 Gb unique molecular yield, maximum length of 180,279 bp and a mean subread length per ZMW of 15,808 bp (Supplementary Table S2).The Illumina genomic DNA sequencing library was constructed using KAPA HyperPlus kit (Basel, Switzerland) as per manufacturers’ protocol. The library size was assessed using TapeStation (Agilent Technologies, California, USA). The libraries with average insert size 316 bp analysed through tapestation were later sequenced in Novaseq 6000 (Illumina) platform. About 461 million reads/ 69.70 Gb data was generated. The Illumina paired-end reads were used to assess the genome length.For Hi-C data generation, tissue crosslinking and proximity ligation was followed by library preparation using the Proximo Hi-C Kit (Animal) (Phase genomics, Washington, USA). The library was sequenced on Illumina NovaSeq 6000 platform in 150 bp paired-end mode to generate 449 million paired reads (67.44 Gb) of which 93.19% bases have Q30 quality score and 16.88% inter-contig high quality read pairs Supplementary Table S3. The Hi-C reads were used in scaffolding of assembly contigs.Transcriptome data generationSix tissues (muscle, gills, liver, kidney, stomach, and gonad) collected from adult L. argentimaculatus fish were flash frozen in liquid nitrogen and stored at −80 °C. The total RNA was extracted using TRIzol (DSS Takara, CA, USA) method and purified using Nucleospin RNA cleanup kit (Machery Nagel, Germany). RNA quantification was carried out using Qubit 3.0 fluorometer using RNA HS kit (Thermofisher Scientific, Massachusetts, USA). RNA integrity was checked using Bioanalyzer 2100 (Agilent Technologies, California, USA) and the quantity was measured using Nanodrop 2000 (Thermofisher Scientific, Massachusetts, USA). The total RNA was enriched for mRNA using NEBNext® Poly(A) mRNA Magnetic isolation module (NEB Inc., France) and cDNA was prepared using NEBNext® Single Cell/Low Input cDNA Synthesis and Amplification Module (New England Biolabs, Hitchin, UK). The cDNA was purified using pronex beads (Promega, Madison, USA).To generate Illumina compatible RNAseq libraries, the cDNA was used according to the manufacturer’s instruction as per NEBNext II RNA Library Prep Kit for Illumina® (NEB Inc., France). The libraries were quantified with qubit 3.0 fluorometer (Thermo Fisher Scientific, USA) using DNA HS assay kit (Thermo Fisher Scientific, USA). The libraries were subjected to fragment analysis using Femto Pulse pulsed field capillary electrophoresis system using Ultra Sensitivity NGS kit, and High Sensitivity D1000 ScreenTape (Agilent Technologies, California, USA) following manufacturer’s protocol to evaluate the insert sizes. The insert sizes obtained for the libraries were ranging from 295 bp to 362 bp. Sequencing data was obtained from the qualified libraries using Illumina NovaSeq 6000, S4 Flow Cell (2x150bp read lengths). Sequencing data (total reads and total bases) was obtained from gills (18.47 Gb), kidney (18.23 Gb), liver (17.41 Gb), muscles (17.07 Gb), stomach (23.07 Gb), and from gonad (14.54 Gb) is shown in (Supplementary Table 4). The RNAseq data were quality trimmed using trimmomatic v0.3911To generate the Long-read Iso-Seq data, the cDNA from individual tissues were pooled in equimolar ratio and then subjected to Library preparation. The library preparation from the pooled cDNA was performed with SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, California, USA). Library size was checked using bioanalyzer 2100 (Agilent Technologies, California, USA) and the observed average insert size was 4,564 bp. Sequence data was generated using Pacific Biosciences (PacBio) long-read RNA sequencing (Iso-Seq). A total of 42.64 million subreads were generated by Iso-Seq with subreads summing upto 138.56 Gb. (Supplementary Table 5).Genome size estimationKidney tissue sample of L. argentimaculatus was homogenized with 750 µl of LB01 lysis buffer (15 mM Tris, 2 mM disodium-EDTA, 0.5 mM spermine tetrahydrochloride, 80 mM KCl, 20 mM NaCl and 0.1% v/v TritonX-100). The homogenized sample was filtered through cell strainer (40 µm) (Corning Inc., New York, USA). The filtrate was treated with 1 µl of RNase A (Qiagen, Hilden, Germany) and propidium iodide (12 µl) (ThermoFisher Scientific, Massachusetts, USA). The control sample, chicken erythrocytes (20 µl) (BDTM DNA QC Particles kit, BD Biosciences, California, USA) was mixed with LB01 buffer (480 µl), RNase A (1 µl) and propidium iodide (12 µl). The test and control samples were incubated in dark at 4 °C for 30 min before flow cytometry analysis. The flow cytometry readings were acquired on BD AccuriTM C6 flow cytometer for 10,000 events, flow rate (14 µl/min) and 10 µm core size setting. Gating of the density plots was carried out and histogram data was acquired for genome size estimation using FlowJoâ„¢ Software (BD Life Sciences California, USA). The density plot and histogram obtained from flow cytometer for genome size estimation from kidney tissue is shown in Fig. 3a,b, and genome size of L. argentimaculatus was found to be 1.06 pg (1.037 Gb). The genome size estimation for Lutjanidae family ranges from 0.78 Gb to 1.37 Gb using different methods (https://www.genomesize.com/search.php). In our previous study we have reported the genome size for L. argentimaculatus as 0.93 Gb, using blood cells by flow cytometry analysis12. In the present study we have used the kidney cells of the fish which revealed 1.037 Gb genome size estimate. This slight variation in the estimate may have been impacted by the use of different tissue type for the target species. However, our estimates of L. argentimaculatus genome size is typically close and within the range values for other lutjanids available in the animal genome size database.Fig. 3Genome estimation profiles of L. argentimaculatus. (a) Density plot and histogram from kidney tissue of L. argentimaculatus and chicken erythrocytes (control). (b) Histogram depicting the count of events for Kidney tissue of L. argentimaculatus and chicken erythrocytes. (c) Genome length assessment of L. argentimaculatus by K-mer frequency generated using Jellyfish and Genomescope.Apart from the genome size estimates from flow cytometry, insights about the genome haploid length and repeat length is revealed by counting the K-mers from Illumina paired end reads. The K-mer frequency for L. argentimaculatus genome was estimated from the Illumina paired end reads using Jellyfish 2.2.313 to count the canonical 21 K-mers with the hash size as 20 G. The counts histogram was later given as input to the online tool Genomescope14 to get the K-mer frequency peaks. The K-mer frequency for L. argentimaculatus revealed the genome haploid length to be 713 Mb and genome repeat length as 91 Mb with genome unique length of 621 Mb (Fig. 3c). The genome size estimated by K-mer analysis for other Lutjanus species is around 1.117 Gb (Lutjanus campechanus)5.Genome assembly of Lutjanus argentimaculatus
The contig level assembly of mangrove red snapper was generated using the long-read sequencing data of about 134 Gb using wtdbg215, with minimum read length of 25,000 bp, and a minimum unitig length of 4096 bp. It generated an assembly of length 1.04 Gb in 699 contigs with N50 of 12.24 Mb. The mitochondrial DNA sequence was extracted from the contig file by performing blastn16 against known mitochondrial sequence of L. argentimaculatus (NCBI accession number: JN182927). The isolated mitochondrial genome sequence was 16,634 bp long and consisted of 13 CDS, 22 tRNA and 2 rRNA coding genes and a D-loop (Fig. 4). The sequence was uploaded to NCBI database along with the genome and was assigned the accession number CM068474. The contig assembly was further processed for removal of haplotigs and contig overlap using Purge_Dups v1.2.617. This resulted in 382 contigs with an N50 value of 12.24 Mb and total length of 1.03 Gb.Fig. 4The circular mitochondrial genome map of L. argentimaculatus depicting the 13 encoded protein coding genes, 22 tRNA genes, 2 rRNA genes and a D-loop.The purged contig assembly was further scaffolded with long range Hi-C contact map data using the tool SALSA218. Briefly the paired end reads were trimmed using fastp v0.12.419 with the option, –detect_adapter_for_pe. Trimmed reads were mapped to the purged contig assembly using BWA-MEM v 0.7.17-r118820 software and the resultant sam file was converted to bam file format. Further this bam file is converted to bed format using bamToBed program of bedtools v2.27.121. Finally, the sorted bed file, restriction enzyme cut sites (GATC, CTNAG, TTAA and GANTC) and the purged contig assembly was given as inputs to the SALSA2 pipeline to build the scaffold level assembly. The assembly file and Hi-C file were loaded on Juicebox with Assembly Tools v.2.17.00 for visualization and manual curation (Fig. 5). The scaffold level assembly comprised of 400 scaffolds with the top 24 scaffolds covering up to 828.87 Mb (Fig. 6, Table 1). The genome size estimated for L. argentimaculatus by flow cytometry indicated a closer approximation of the actual size (1.03 Gb) of the mangrove red snapper genome assembled in this study.Fig. 5Hi-C map representing the scaffolds of Redsnapper genome assembly.Fig. 6Circos plot of the Mangrove red snapper genome. From the outermost: Track1: Top 24 largest scaffolds of the red snapper genome. Track2: Contigs corresponding to the scaffolds represented as tiles. Track3: Genes of red snapper shown as highlights with incremental gene lengths of 5,000 bp (viz. <5 kb, 5 kb – 10 kb, 10 kb – 15 kb, 15 kb – 20 kb, 20 kb – 25 kb and >25 kb) shown in ascending order of the gene lengths. Track4: Full length isoform sequences supporting the genes of red snapper shown as highlights with incremental isoform lengths of 5,000 bp (viz. <5 kb, 5 kb – 10 kb, 10 kb – 15 kb, 15 kb – 20 kb, 20 kb – 25 kb and >25 kb) shown in ascending order of thethe isoform lengths. Track5: Transcripts generated from RNAseq data supporting the genes of red snapper shown as highlights with incremental transcript lengths of 20,000 bp (viz. <5 kb, 5 kb – 10 kb, 10 kb – 15 kb, 15 kb – 20 kb, 20 kb – 25 kb and >25 kb) shown in the ascending order of the transcript lengths. Track6: GC content of Red snapper genome shown as line diagram plotted with 100 kb sliding window. The GC values below 35 and above 45 are shown in black color, and remaining in tan color.Table 1 Assembly statistics of L. argentimaculatus genome at contig and scaffold level.Synteny analysis was performed for the assembly of mangrove red snapper with crimson snapper, Lutjanus erythropterus [Assembly accession no. ASM2009168v1]22 and zebrafish, Danio rerio [Assembly accession no. GRCz11]23 using SyMAP v5.3.524 to understand the genome similarity between them. The largest 24 scaffolds of red snapper and crimson snapper were taken, whereas for the zebrafish, 25 chromosomes were taken to perform genomic synteny. The analysis revealed that there are 31 syntenic blocks greater than 10 mb between mangrove red snapper and crimson snapper (Fig. 7). Whereas when the red snapper was compared to the zebrafish there are 8 and 58 blocks greater than 10 mb shared between them respectively (Fig. S1). It was also observed that 99% of red snapper is covered by 84% of crimson snapper in syntenic blocks. Similarly, 96% of red snapper genome is covered by 92% of zebrafish genome in syntenic blocks.Fig. 7Synteny between Mangrove red snapper and Crimson snapper.Repeat profileThe repeat profile of the red snapper assembly was obtained using RepeatMasker v4.1.425 (http://repeatmasker.org/RMDownload.html). First the database of repeat families were generated using RepeatModeler v2.0.4 (http://repeatmasker.org/RepeatModeler/) and then de novo repeat prediction was done using RepeatMasker v 4.1.4, LTR_FINDER (LTRPipeline v2.0.4), and TRF tool v4.0926 to identify all kinds of repetitive elements.A total of 446.78 Mb genome sequences were identified as repeats, accounting for 43.37% of the whole assembly. DNA transposons constituted the major classified repeat class (13.51%, 139,201,547 bp), followed by Retroelements (5.06%, 52,120,576 bp). Simple repeats (1.81%, 18,645,993 bp) and rolling circles (0.45%, 4,654,588 bp), The major repeat element classes of retroelements was found to be LINEs (3.28%) followed by LTR elements (1.44%), SINEs (0.34%) and Penelope (0.05%). The Low complexity sequences were found be around 0.26% and small RNAs were estimated to be around 0.23%. (Table 2).Table 2 Repeat profile of L. argentimaculatus genome.Iso-Seq transcriptome data processingThe raw data was processed using iso-seq3 pipeline SMRTLink v10.1 to obtain high quality isoforms. The methodology described in27,28. was followed with minor modifications. Briefly, ccs step was called to obtain Circular Consensus Sequence (CCS) with options as ‘minimum passes = 3 and minimum quality = 0.99’. The data was further demultiplexed using lima by providing the options ‘peek-guess, dump-clips, and dump-removed’. The demultiplexed sequences were further processed through isoseq3 refine step with options ‘require-polya and minrq = 1’ to obtain Full Length Non-Chimeric sequences (FLNC). Finally, isoseq3 cluster was run to obtain 94,258 high quality transcripts with N50 length of 3,973 bp. To screen for any possible contaminants Mash v2.229 was used with the refseq genome sketch. To reduce the redundancy in transcripts, the high quality sequences were collapsed based on the exonic structures using the isoseq3 collapse program with options ‘min-aln-coverage = 0.85 and min-aln-identity = 0.95’ to obtain 56,515 non-redundant transcripts.Gene prediction and functional annotationGenes were predicted for the mangrove red snapper genome by incorporating evidences from Iso-seq, RNAseq datasets and proteins from related species. The protocol described in30,31 was adapted with minor changes. While the evidence from Iso-seq data was directly used for building the gene models, the other two evidences, RNA-seq and proteins were used to refine the gene models. Briefly, the full-length non-chimeric (FLNC) sequences of Iso-Seq data generated in this study were aligned to the genome using GMAP v2020-06-3032 to generate hints. Tentative genes were predicted using Augustus v3.3.333 by providing the repeat masked genome and the Iso-seq hints generated previously as inputs and zebrafish gene models as reference. The evidence from RNA-seq data were obtained by aligning paired-end reads of various tissues (gills, muscle, kidney, stomach, gonad, and liver) to the genome using Hisat2 v2.2.034 and the resultant sam file was converted to bam file format and sorted using samtools v1.1435. The reference based transcriptome was assembled using StringTie v2.1.436 by providing the sorted bam file as input. These assembled transcripts were subjected to further refinement, filtering and subsequently transformed into genome-based coordinates using TransDecoder v5.5.0637. Additionally, protein sequences from closely related species within the order Eupercaria (Supplementary Table 6) were downloaded from NCBI Genome database, concatenated and subjected to clustering at 90% identity threshold using CD-HIT V4.8.138. Hints were then generated by aligning these clustered protein sequences to the genome using GenomeThreader v1.7.339.The gene model, formed by integrating the outputs of Augustus with hints from Iso-Seq, RNA-Seq, and homologous proteins, was unified using a weighted approach within Evidence Modeler37. This approach culminated in a consensus gene model encompassing a total of 27,172 genes comprising of 21,629 complete genes (having start and stop codons) and 905 genes with neither start nor stop codons (Supplementary Table S7).A homology based annotation against the Actinopterygii (txid7898) dataset of non-redundant database was performed using blastx tool16 and hits were obtained for 24,981 (91.93%) transcripts, The protein domains based annotation was executed by the Interproscan module of OmicsBox v3.2.440 and the EggNOG mapper module41 was used for orthology based annotation resulting in 22,870 (84.17%) and 16,147 (59.42%) hits respectively. The gene ontology and functional annotation obtained by combining the gene ontology terms from blastx, Interpro and EggNOG summed upto 22,491 (82.77%) transcripts. The annotated genes participating in different pathways were obtained by mapping against KEGG database42 and enzyme codes were obtained for 7,467 (27.48%) of them (Table 3) Figs. S2 & S3. Transferases and hydrolases were the most dominant enzyme classes expressed (Fig. S4). Gene ontology revealed that the most expressed GO categories were macromolecule biosynthetic process (Biological processes), anion binding (Molecular function) and intracellular membrane-bounded organelle (Cellular components) (Fig. S5). The annotated genes were involved in 330 pathways as per the results obtained from KEGG pathway analysis (refer Gene_pathways.xlsx43 in figshare repository).Table 3 Annotation statistics of L. argentimaculatus through various databases.Isoform diversity of L. argentimaculatus transcriptomeSQANTI3 v. 5.044 was used to classify 56,515 transcripts into isoform categories and it has led to identification of 18,108 unique gene models in the mangrove red snapper transcriptome (Table 4, Supplementary file 1). Out of 56,515 unique isoforms, 10.6% isoforms were found to be Full Splice Match (FSM), since they are found to have the same splice junctions as the reference transcripts. The splice junction matching the reference from the beginning but not at the end and called as Incomplete Splice Match (ISM) were found to be 3.75% of the total isoforms. The Novel in Catalog (NIC) having a combination of known acceptor or donor splice sites and Novel not in Catalog (NNC) which are known to use at least one novel acceptor or donor sites were found to be at 2.6% and 43.94% respectively. Another major category of isoforms which typically span across two genes and called as Fusion isoforms comprised of 32.46%. The genic genomic isoforms which overlap with introns and exons comprised of 1.69%. The intergenic isoforms which are found in intergenic locations were about 4.49%. The antisense isoform which have an overlap in the antisense strand were about 0.4% in the mangrove red snapper transcriptome. The Genic intron isoform which are known to be completely contained within an intron category of isoforms were absent in the mangrove red snapper transcriptome. About 43.69% of genes had only one isoform and about 27.42% genes had two to three isoforms. About 13.53% of genes consisted of four to five isoforms and rest of 15.35% genes had more than or equal to 6 isoforms (Fig. S6).Table 4 Statistics of Iso-Seq based transcriptome analysis.We have obtained high number of 56,515 unique isoforms which was comparable with equally high number of isoforms (66538) obtained from grey mullet in our earlier study27. A higher proportion of genes with more isoforms indicate presence of a high degree of transcriptome complexity in brackishwater fish transcriptomes. These transcriptomes will be an important resource in exploring the overall contribution of alternative isoforms to gene expression related to various functional pathways.Alternate splicing events and rarefaction curve analysisFor identifying the alternate splicing events in the mangrove red snapper transcriptome, SUPPA245 with the subcommand generateEvents was used with default options. In total 23,064 alternate splicing events were identified in 5,905 genes, of which 5,228 were alternative 3′ splice site and alternative 5′ splice site (A3/A5 events), 8,925 were alternative first/last exons (AF/AL events), 5,964 were retained intron (RI events), 2,780 were Skipping Exon SE events and 167 were mutually exclusive exons (MX events) (Fig. 8).Fig. 8Identification of alternate splicing events in L. argentimaculatus transcriptome.Alternate splicing plays key roles in response to different stress factors in the fish such as aggression in Betta splendens46 cold stress in Tilapia47, heat stress in catfish48. The other functional associations of alternative splicing include in sex determination and gonadal differentiation reported in zebra fish49, antiviral immunity in grass carp50 etc. It has been reported that alternate splicing events show universally similar trends across organisms with relatively high occurrences of retained introns (RIs), alternative 3′ and 5′; splice sites (A3 and A5), alternative first exons (AF), and less degree of skipping exons (SE), alternative last exons (AL) and mutually exclusive exons (MX). Our study also revealed similar pattern except for AF which showed highest events (37%), followed by RIs (26%). The SE (12%), AL and MX (1%) were observed at lower levels. However, the comparative pattern of alternate splicing events in case of fishes alone remains to be studied before any conclusion.Rarefaction curve analysis was performed to understand the read depth required for discovery of genes and isoforms. From the cDNA_Cupcake scripts (https://github.com/Magdoll/cDNA_Cupcake), subsample.py was used with the options ‘min_fl_count = 2 a– –step = 1,000’ by providing the collapsed transcripts and the sqanti3 classification file as inputs. To understand further about the isoform discovery rate at the various categories level, another script subsample_with_category.py was used with the options and inputs same as subsample.py script. The rarefaction curve analysis represented the relation between genes and isoforms to the read depth (Fig. 9(a)) and the isoform discovery rate at the various categorical levels (Fig. 9(b)). At the gene/isoform level, rarefaction curves showed that the discovery of genes reached saturation much earlier in comparison to the isoforms. Under isoform categories, the novel not in catalog (NNC) and fusion isoforms are still being discovered, whereas the other categories of isoforms reached saturation and showed sequencing depth sufficiency.Fig. 9The rarefaction curve analysis. (a) to obtain genes and isoforms relative to the read depth, (b) the isoform discovery rate at the various categories level.In conclusion, this highly contiguous genome assembly and the isoform characterized transcriptome of L. argentimaculatus will be a useful genetic resource and will provide valuable insights and information pertaining to structural, functional, and evolutionary aspects of fish.

Hot Topics

Related Articles