Coassembly and binning of a twenty-year metagenomic time-series from Lake Mendota

Sample collection and DNA extractionSamples collected from Lake Mendota were obtained through the NTL-LTER program (https://lter.limnology.wisc.edu/). Sample collection and DNA extraction, but not shotgun metagenome sequencing (described below), was completed as previously described by Rohwer and McMahon2. Briefly, surface layer (integrated 12 m epilimnion) water samples collected from the deepest location of Lake Mendota were filtered onto 0.2-μm pore-size polyethersulfone Supor filters (Pall Corp., Port Washington, NY, USA) prior to storage at −80 °C, allowing the collection of DNA from prokaryotic, eukaryotic, and viral species present in the sample. DNA was purified from these filters using FastDNA Spin Kits (MP Biomedicals, Burlingame, CA, USA). Detailed metadata is available through JGI’s Genomes OnLine (GOLD)14 system under GOLD Study ID Gs0136121.Sequencing, read QC, and filteringFor this study, standard True-Seq Illumina libraries were generated at the DOE Joint Genome Institute (JGI) and sequenced using the NovaSeq 6000 with the S4 flow cell. Data generation spanned a period of ~2.5 years, and thus software tool versions and protocols for read quality control and filtering differ slightly for each of the individual metagenomes. Further details can be found in Supplementary Dataset 1 which is organized by JGI sequencing project identifier. In general, BBDuk13 was used to remove contaminants, trim reads that contained adapter sequence, and right quality trim reads where quality drops to 0. BBDuk was used to remove reads that contained 4 or more ‘N’ bases, had an average quality score across the read less than 3 or had a minimum length < = 51 bp or 33% of the full read length. Reads mapped with BBMap15 to masked human, cat, dog, mouse, and common microbial contaminant references at 93% identity were separated into chaff files and discarded. The final filtered FASTQ was subsequently used for metagenome coassembly and mapping.Filtered reads were coassembled with MetaHipMer5 v2.1.0.1.256-g6a25b79-dirty RevertAggrShuffleReads [mhm2.py -v–pin = none–checkpoint = true] on 1,500 nodes on the Summit system at the Oak Ridge Leadership Computing Facility. Contigs smaller than 500 bp were removed. Alignment information was determined by mapping each sample’s reads to the assembly reference with BBtools15 (v38.95) [bbmap.sh Xmx450g nodisk = true interleaved = true ambiguous = random mappedonly = t trimreaddescriptions = t usemodulo = t fast = t] to provide an alignment for each sample to the assembly. Overall coverage was determined by running BBTools (v38.95) [pileup.sh] on all alignment files concatenated. A total of 65,176,533,394 reads were input into the aligner and a total of 61,542,936,624 (94%) aligned.MAG generation, refinement, quality check and taxonomic annotationAssembled contigs were annotated using the DOE-JGI metagenome workflow (v5.1.11)6 and grouped into metagenome-assembled genomes (MAGs) using MetaBAT27 (v2.15), an automated metagenome binning software tool that uses an adaptive binning algorithm to eliminate manual parameter tuning. Next, genome completeness and contamination were estimated based on the recovery of a set of core single-copy marker genes using CheckM (v1.1.3)8 (Table S1). The bins are reported according to the Minimum Information about a Metagenome-Assembled Genome (MIMAG16) standard as high, medium, or low quality. For each of the high- and medium-quality bins, the taxonomic lineage was computed using the GTDB-Tk which is a software toolkit that assigns objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy (v1.3.0, GTDB database release 95)9. The bins identified as low-quality were explored for eukaryotic potential wherein their eukaryotic genome quality (completeness and contamination) and lineage was estimated based on single copy marker gene sets using EukCC (v2.1.2, eukcc2_db_ver_1.2)17, and those with more than 50% completion and less than 10% contamination were chosen for further analysis (Table S2). Four of the eukaryotic MAGs were further annotated using JGI’s PhycoCosm annotation pipeline10.Viral contig identification, de-replication and taxonomic classificationThe computational program geNomad (v1.7.4)11 was used to identify viral contigs from unbinned metagenomic data and assign taxonomy. CheckV (v1.5)12, was used to determine the completeness and quality of the identified viral sequences (Table S3). Contigs with no completeness estimate, only an hmm-based estimate, only an aai-based low-confidence estimate, and/or a completeness <50% were discarded. Contigs longer than 150% of the aai_expected_length were also removed resulting in a total of 6,350 unique viral sequences.Phylogenomic analysisNSGTree (v0.4.3; https://github.com/NeLLi-team/nsgtree) was used for phylogenetic tree construction (Fig. 2). The.faa files generated for each MAG and the UNI56.hmm reference set of phylogenetic marker HMMs were used as input files. The Interactive Tree of Life (v6)18 was used to visualize and annotate the phylogenetic tree.

Hot Topics

Related Articles