The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations

Cruise and sample collectionDiel1Diel1 eukaryotic metatranscriptome samples were collected during R/V Kilo Moana cruise KM1513 (SCOPE HOE-Legacy 2, July 2015) near Station ALOHA in the North Pacific Subtropical Gyre as previously described12,13,14. Briefly, replicate samples were collected from ~15 m depth every 4 hours over a 4-day period. The cruise track followed the same water mass by tracking a semi-Lagrangian drifter20, and seawater samples were collected with Niskin bottles attached to a CTD rosette. Approximately 7 L of seawater was pre-filtered using 100 µm Nitex mesh, collected onto a 0.2 µm polycarbonate filter using a peristaltic pump, and immediately flash-frozen in liquid nitrogen.Gradients 1Gradients 1 samples were collected during the R/V Ka’imikai-O-Kanaloa, cruise KOK1606, from 19 April to 3 May 2016 as previously described10. Seawater samples were collected with Niskin bottles attached to a CTD rosette. Approximately 6–10 L of seawater was pre-filtered using 200 μm Nitex mesh and collected by sequential filtration through 3 μm and 0.2 μm polycarbonate filters using a peristaltic pump. Samples were immediately flash frozen in liquid nitrogen.Gradients 2Gradients 2 samples were collected during R/V Marcus G. Langseth, cruise MGL1704, from 27 May to 13 June 2017 as previously described11. Seawater samples were collected from Niskin bottles attached to a CTD rosette. Approximately 7–10 L of seawater was pre-filtered using a 100 μm Nitex mesh and collected by sequential filtration through 3 μm and 0.2 μm polycarbonate filters using a peristaltic pump. Samples were immediately flash frozen in liquid nitrogen.Gradients 3Gradients 3 samples were collected during R/V Kilo Moana, cruise KM1906, from 10 April to 29 April 2019. Seawater samples for metatranscriptomes were collected from the ship’s seawater intake at ~7 m depth. Approximately 5–10 L of seawater was pre-filtered through a 100 μm Nitex mesh before sequential filtration through 3 μm and 0.2 μm polycarbonate filters using a peristaltic pump. Samples were immediately flash frozen in liquid nitrogen.To minimize diel transcript variability, surface seawater samples were collected at near dawn (Table 1) for transect metatranscriptomes during the three Gradients cruises.Gradients 3 diel stationThe diel-resolved study of in situ conditions was conducted on the 2019 Gradients 3 cruise at the northern most station at ~41.5°N, to capture diel biological variability over a 72-hour diel observation period. This station followed the same water mass with a semi-Lagrangian float, similar to the Diel1 study. Replicate metatranscriptome samples were collected using seawater from Niskin bottles atached to a CTD rosette from a depth of 15 m depth approximately every 4 hours over ~72 hours. Approximately 5 to 10 L of seawater was pre-filtered through a 100 μm Nitex mesh before sequential filtration through 3 μm and 0.2 μm polycarbonate filters using a peristaltic pump. Samples were immediately flash frozen in liquid nitrogen.Extraction and sequencingMetatranscriptome samples were extracted and sequenced as previously described12,13,14. For quantification purposes, a set of 14 synthetic spike-in mRNA standards were added to samples during the RNA extraction process as previously described21 and included 8 standards synthesized with poly(A) tails to mimic eukaryotic messenger RNAs (mRNAs). Addition of synthetic mRNA standards at known concentrations, along with the number of sequencing reads recovered for the internal standards, allows estimation of natural mRNA abundances13. For the poly(A) tailed synthetic mRNA standards, 5,230,198,630 molecules were added to each sample, and for the prokaryotic RNA standards 13,580,968,026 molecules were added to each sample. A FASTA-format file of mRNA standard sequences (CustomStandardSequences.txt) is available on the raw assembly Zenodo repository22. Standard counts were recovered using the Bowtie 2 aligner version 2.5.2 using the standard pipeline for paired-end reads23.Total RNA extraction for the Diel1 and Gradients 1 cruises was done using the ToTALLY RNA kit (Invitrogen AM1910). Gradients 2, Gradients 3, and G3 diel metatranscriptomes were extracted using the same methods, except total RNA was extracted using the Direct-zol RNA MiniPrep Plus kit (Zymo Research R2702). The extracted RNA from all samples was quantified using a Qubit fluorometer 1.0 (ThermoFisher) and quality controlled using a Bioanalyzer 2100 (Agilent) prior to sequencing. Samples were randomized across sequencing runs to reduce potential biases. Poly(A)-selected eukaryotic mRNAs for the Diel 1 and Gradients 1 studies were sequenced on an Illumina NextSeq500 platform using four high output flow cells with 300 cycles, and the Gradients 2 and 3 mRNAs were sequenced on the Illumina NovaSeq6000 platform using the S2 flow cell with 300 cycles. Library prep with poly(A)-selection and sequencing was performed at the Northwest Genomic Center (University of Washington, Seattle). Both platforms resulted in 150 bp paired end reads. Raw sequencing data is available in the NCBI Short Read Archive (see Data Records section).Quality control and trimmingRaw Illumina sequence reads were quality controlled with Trimmomatic v0.3624 (parameters: MAXINFO:135:0.5, LEADING:3, TRAILING:3, MINLEN:60, and AVGQUAL:20), as previously described14. These commands are called from the Illumina_QC_AWS.sh script as called in the ‘process_short_reads.sh’ files in the codebase: https://github.com/armbrustlab/NPac_euk_gene_catalog.Metatranscriptome assemblyQuality controlled read pairs from the metatranscriptomes were assembled using the Trinity de novo assembler25 (parameters:–normalize_reads–min_kmer_cov 2–min_contig_length 300). Diel1, Gradients 1, and Gradients 2 metatranscriptomes were assembled on the Pittsburgh Supercomputing Center’s Bridges Large Memory system the using Trinity version v2.3.2, as previously described for Diel114. Diel1 and Gradients 1 assemblies were conducted on the short reads from combined replicates; Gradients 2 transect assemblies were conducted on combined replicates (14 of 32 assemblies) and 18 individually assembled replicate samples (18 of 32 assemblies), due to memory limitations on the assembly platform. The Gradients 3 and G3 diel metatranscriptomes were assembled individually for each replicate using Trinity v2.12.0 on a local high-memory cluster. In total, these assemblies generated 235 million nucleotide transcript contigs, which amounted to 182 million transcript contigs when clustered at 99% amino acid identity for each cruise (Table 2). Redundancies may exist between the different cruises. The Diel1 raw poly-A+ selected assemblies were previously made available in this Zenodo repository (https://zenodo.org/records/5009803), and the Gradients assemblies are deposited in this repository: (https://zenodo.org/records/10699458). The assembly code is referenced in these repositories and publicly available on the NPEGC codebase in the ‘trinity_assemblies.sh’ scripts here: https://github.com/armbrustlab/NPac_euk_gene_catalog.Table 2 Summary statistics of clustered nucleotide and protein sequences and annotation efficiencies.Six-frame translation and frame selection of nucleotide sequencesNucleotide sequences were translated in six frames with transeq26. vEMBOSS:6.6.0.059 using Standard Genetic Code. The longest coding frame(s) (longest uninterrupted stretch of amino acid residues) were retained for downstream analysis. Full code description is available on the codebase here: NPEGC.6tr_frame_selection_clustering.sh.Clustering on amino acid identity of translated protein sequencesTranslated protein sequences were clustered together at the 99% amino acid sequence identity threshold with MMseqs227 per dataset, to reduce sequence redundancy in identical or nearly identical transcript contigs across metatranscriptome assemblies. A total of 199,480,491 protein sequences were retained from the cluster centroid representatives, and the corresponding set of nucleotide transcripts (182,252,494) that these protein sequences were generated from were retained as clustered nucleotide transcripts (Table 2). The difference in totals between the clustered protein and nucleotide transcripts is from retention of multiple longest translated coding frames of equal lengths during frame selection. Clustering at the 99% amino acid identity level resulted in retention of 78% of the 234,228,950 original assembled nucleotide transcripts. This code is available on the codebase here: NPEGC.6tr_frame_selection_clustering.sh. The clustered proteins are available on the NPEGC protein data repository here (https://zenodo.org/records/12630398) and the clustered nucleotides sequences have been deposited to the NPEGC nucleotide data repository (https://zenodo.org/records/13826820). A full description of these files is given in the Data Records section.Nucleotide transcript countsTo quantify assembled transcript abundances, the metatranscriptome quality controlled short reads were aligned back to the clustered nucleotide transcripts using the kallisto aligner28 v0.46.1. The raw counts have been added to the NPEGC nucleotide data repository (https://zenodo.org/records/13826820); files are described in the Data Records section. Code description is referenced in the data repository and the codebase: https://github.com/armbrustlab/NPac_euk_gene_catalog/tree/main/scripts/nt_data.Taxonomic annotationClustered protein sequences were assigned a taxonomic identity via DIAMOND29 protein alignment to a combined multi-kingdom protein sequence reference library15 consisting of the marine microbial eukaryote focused MarFERReT v1.1 reference protein library16 supplemented with marine prokaryote genomes from the MARMICRODB v1.0 reference protein database17. We used DIAMOND to estimate the lowest common ancestor (LCA) of translated sequences from matches to the MarFERReT + MARMICRODB database15,30, and to assign a taxonomic identifier (NCBI taxID) from the NCBI Taxonomy database31 (https://www.ncbi.nlm.nih.gov/taxonomy, Table 2). DIAMOND LCA predictions for NPEGC protein sequences are available on the protein data repository32 (https://zenodo.org/records/12630398) and listed in the Data Records section; code is available on the codebase here: NPEGC.diamond_taxonomy.log.sh.Functional annotation of protein sequencesClustered protein sequences were annotated using HMMER 3.333 against the Pfam18 35.0 collection of 19,632 protein family Hidden Markov Models (HMMs). Pfam 35.0 HMMs were downloaded from the Interpro34 FTP server: (http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/). The highest-stringency cutoff score (‘trusted cutoff’) assigned by Pfam to each hmm profile was used as a minimum score threshold. Raw HMMER Pfam family predictions are available on the protein data repository32 (https://zenodo.org/records/12630398) and listed in the Data Records section; code is available on the codebase here: NPEGC.hmmer_function.sh. Clustered protein sequences were also annotated against the KOfam HMM database using KofamScan19. Best KOfam predictions with a threshold score >30 are available on the protein data repository32 (https://zenodo.org/records/12630398) and listed in the Data Records section; the full set of KOfam predictions with a threshold score >30 are available on the KOfam data repository35 (https://zenodo.org/records/13743267); code is available on the codebase here: NPEGC.kofamscan_function.sh.

Hot Topics

Related Articles