MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies

Implementation
Input
The input to the MAGinator workflow comprises a set of samples with (1) shotgun metagenomic sequenced reads, (2) their sample-wise assembled contigs, and (3) sample-wise MAGs (groups of contigs from the same genome), clustered across samples, as defined by a metagenomic binning tool (see below).Reads should be provided in a comma-separated file giving the location of the fastq files and formatted as: SampleName,PathToForwardReads,PathToReverseReads. The contigs should be nucleotide sequences in FASTA format. The MAGs should be given as a tab-separated file including the MAG identifier and contig identifier. The sample-wise MAGs should be grouped into MAG clusters representing a taxonomic entity found across the samples, which will usually be species but can also be at the subspecies level, depending on the characteristics of the input data. MAGinator is flexible regarding which tool is being used for creating the MAGs, however we recommend using VAMB18. If other binners are used, MAG clustering across samples would have to be implemented before running VAMB. As MAGinator relies on the input MAGs a larger sample size is recommended. The specific number of samples relies both on the sequencing depth and the diversity of the community being analysed. We advise the user to look at the number of MAG clusters created and assess them according to the environment being analysed.DependenciesThe dependencies to run MAGinator are mamba21 and Snakemake22—all other dependencies are installed automatically by Snakemake through MAGinator. Additionally, MAGinator needs the GTDB-tk database downloaded for taxonomic annotation of MAGs and as a reference for the phylogenetic SNV-level analysis of the signature genes.Output generatedMAGinator generates multiple outputs and intermediate files useful for additional downstream analysis (Supplementary Table 1, Supplementary Figure 1). Importantly, MAGinator outputs the taxonomy of the MAGs, the signature genes of the MAG clusters, the sample-wise relative abundances of the MAG clusters, a non-redundant gene matrix with sample-wise mapping counts, synteny clusters and inferred phylogenies for each MAG cluster along with a table presenting samples showing evidence of strain mixtures within each MAG cluster. Additionally, a folder is created containing the log information of all the jobs run by Snakemake.ApplicationMAGinator is written in Python 3. It is based on a set of Snakemake22 workflows and is easily scalable to work for both single servers and compute clusters. MAGinator is implemented as a python package and is available on GitHub at https://github.com/Russel88/MAGinator. The user can adjust the individual steps of the pipeline using various parameters (Suppl. Table 2). The results in this paper are based on MAGinator v.0.1.10.The MAGs are filtered based on a minimum size for inclusion, with a default size of 200,000 bp. The included MAGs are taxonomically annotated using GTDB-tk (v.2.1.1)23, by calling genes using Prodigal (v.2.6.3)24, identifying GTDB marker genes and placing them in a reference tree. As the taxonomic annotation of the MAG clusters is found to be redundant, clusters with the same taxonomic assignment can be combined into one cluster, with the flag ‘–mgs_collections’ which we identify as a Metagenomic Species (MGS). Redundant genes are identified by clustering with MMseqs2 (v.13.45111)25 easy-linclust using a default clustering-coverage and sequence identity threshold of 0.8, creating a list of the representative genes along with their cluster-members. The redundant genes are filtered away, leaving a nonredundant gene catalogue. The raw reads are mapped to the gene catalogue using BWA mem2 (v.2.2.1)26 and counted using Samtools (v.1.10)27, leaving a gene count matrix, which is used as input for the signature gene refinement and following phylogenetic clade separation and abundance estimates.Signature gene identificationWe previously described the method for identifying the signature genes for the data set17. In brief, signature genes are selected to ensure that they 1) are unique for the MAG cluster, 2) are present in all members of the cluster, and 3) are single-copy.To accomplish this, the following steps are taken: Initially, the non-redundant gene count matrix is curated to discard any genes if they have (redundant) cluster members originating from more than one MAG cluster, as they are thus not specific for that biological entity. Subsequently, the remaining genes within each MAG cluster are sorted based on their co-abundance correlation across the samples. As the genes are unique for the species, if they are consistently detected in similar abundance across samples, it suggests that they are single-copy. This step also mitigates differences in reading mappings caused by biological or technical variations. The initial set of signature genes for each biological entity is selected from the most correlated genes. Subsequently, these signature genes are further refined and optimised by fitting them to a rank-based negative binomial model that captures the characteristics of the specific microbial composition in the input data. The signature gene set is evaluated across the samples, by calculating the probability of the detected number of signature genes given the number of reads mapping to the MAG cluster. Finally the abundance of each MAG cluster is derived from the read counts to the identified signature genes normalised according to the gene lengths.SNV-level resolution phylogenetic treesTo elucidate the smaller biological differences within the MAG clusters, MAGinator will infer a phylogeny based on the sequences of the signature genes. Based on the read mappings to the signature genes the sample-specific SNVs are called using output from Samtools mpileup. An alignment for each signature gene is made for all samples containing the signature genes using MAFFT (v.7)28 run with the offset value of 0.123 as no long indels are expected. MAGinator allows phylogenetic inference to be calculated with either the fast method Fast-Tree (v.2)29 (default) or the more accurate but resource-intensive method IQ-TREE (v.2)30. In samples where no MAG was found, the phylogenies can be used to detect rare subspecies-level entities based on just a few reads mapping to the signature genes and to infer functions and genes from closely related MAGs from other samples. The criteria for inclusion in the tree can be adjusted by the user. For a sample to be included in the phylogeny the following three criteria have to be met 1) minimum fraction of non-N characters in the alignment, 2) minimum number of GTDB marker genes to be detected, 3) minimum number of signature genes to be detected. The default values for a sample to be included in the phylogenetic tree have been set relatively low in order to enable the placement of samples in the tree, even in cases of very low abundance. The trees can be associated with metadata to obtain clade-level differences associated with study design variables such as disease phenotype, sampling location, or environmental factors.Gene syntenyBased on the gene clustering with MMSeqs2 a weighted graph is created, which reflects the adjacency of the genes on contigs. If genes are close enough in the graph, they will be categorised as part of the same synteny cluster, and it is assumed that they have related functionality and/or are part of the same functional module. Clustering is determined using mcl (v.14)31, where the user has the options to influence the adjacency count and stringency of the clusters. Only immediate adjacency is considered. By default, genes found adjacent just once are included in the graph, but this can be tuned to make more strict clusters. The inflation parameter for MCL-clustering of the synteny graph is important for the size of the gene clusters and is, by default, set high in order to yield small and consistent clusters.Taxonomic scope of gene clustersThe taxonomic assignment of the sample-specific MAG is done using GTDB-tk. In some cases it will not be possible to assign a taxonomy to the MAG, which could be due to contamination, the MAG originating from a currently undescribed organism or due to too little information found in the MAG. In these cases an alternative is to assign the gene clusters, found in the MAG, a taxonomy. The taxonomic scope of the genes is described for the category in which they are predominantly found in, given by a fraction defined by the user (default value 0.9). E.g. if run with default options and a gene cluster has the assignment “Bacteria Firmicutes_A Clostridia Lachnospirales Lachnospiraceae Anaerostipes NA”, then at least 90% of the genes should be found in Anaerostipes. The algorithm will find the most specific taxonomic rank which has at least 90% agreement across the genes in the cluster assigned by GTDB-tk.Workflow designThe MAGinator workflow has been constructed to make the information flow between the different modules automatically (Suppl. Figure 1).The data goes through a series of filtering and processing steps (Fig. 1 A–F), including:A: MAG clusters, which are composed of one or more MAGs, are inputted.B: The genes are clustered and redundant genes are removed.C: Reads are mapped to the genes, creating a gene count matrix.D: Signature genes are identified for each MAG cluster, and used for abundance estimationsE: Based on the signature genes, SNV-level resolution phylogenetic trees are created, and the taxonomic scope of gene clusters is identified.F: Synteny-clusters of genes are identified, reflecting the adjacency of the genes on the contigs.Benchmarking on CAMI’s simulated strain-madness data setThe construction of the strain-madness benchmarking dataset was part of the second round of CAMI challenges5. The data consists of 100 simulated metagenomics samples consisting of paired-end short reads of 150 bp. The samples were run through a preprocessing workflow prior to the analysis. This involved the removal of adaptors with BBDuk (v. 38.96 http://jgi.doe.gov/data-and-tools/bb-tools/) run with the following settings ‘ktrim=r k = 23 mink=11 hdist=1 hdist2 = 0 ptpe tbo’, removal of low-quality and short reads (<75 base pairs) with Sickle (v. 1.33)32 and removal of human contamination (reference version: UCSC hg19, GRCh37.p13) using BBmap (http://jgi.doe.gov/data-and-tools/bb-tools/) leaving an average of 6.6 million reads (SD: ±2802 reads) per sample.To generate de novo assemblies, Spades (v. 3.15.5)33 was utilised with the -meta option, with kmer sizes of 21, 33, 55 and 77, and contigs shorter than 1500 bp being discarded. Read-to-assembly mapping was carried out using BWA-mem2 (v.2.2.1)26 and SAMTOOLS (v.1.10)27. Contig depths were assessed using Metabat2’s jgi_summarize_bam_contig_depths (v.2.12)34, while contigs were binned into MAGs using VAMB (v.3.0.8)18 with default settings.The reads, contigs and MAGs were run through the MAGinator workflow (v.0.1.16). For comparison purposes, the VAMB clusters were annotated with an NCBI Taxonomy ID using CAMITAX10. The profile was created with an R custom script and the lineage was found using NCBI’s taxonomy toolkit (https://bioinf.shenwei.me/taxonkit). As the strain identifiers from the gold standard do not exist in the NCBI database (e.g. 1313.1), we have assigned an extra number to the Taxonomy ID for the clusters which had the same species-level annotation, starting at 1 to the number of redundantly annotated clusters.The data for the benchmarking was obtained from CAMI second challenge evaluation of profiles. The profiles used for the benchmarking in this study were selected based on the best-performing tools found in the CAMI II paper. The top 10 profiles comprise DUDes35 (v.0.08), LSHVec12, MetaPhlAn214 (v.2.9.22), MetaPhyler2 (v.1.25), mOTUs3 (v.2.0.1 and v.2.5.1) and TIPP36(v.4.3.10). The profiles were compared using Open-community Profiling Assessment tooL (OPAL) (v.1.0.11), which was run with default settings.Franzosa et al. reanalysisProcessed taxa and metadata tables were obtained from the Franzosa et al.13. supplementary materials. Raw data were downloaded from ENA using the provided accessions, and run through the preprocessing, assembly and binning before running the entire MAGinator pipeline. Four samples failed the assembly (PRISM | 7238, PRISM | 7445, PRISM | 7947, PRISM | 8550) and were excluded from all downstream analyses, both in the original and the MAGinator processed tables, leaving 216 samples.Statistical methods for abundance matricesAbundance matrices were analysed in R (v.4.1.2). Sample management and beta diversity calculations were done in {phyloseq}37, along with PCoA analysis. Differential abundance testing was done with the {DAtest} R package, which uses the Wilcoxon test function (Wilcox.test) from the {stats} package, with p-values adjusted by Benjamini-Hochberg false discovery rate correction. Corrected p-values less than 0.05 were considered significant.Subspecies resolution of Bifidobacterium longum
COPSAC dataset – data characteristics and preparation
The COPSAC2010 cohort consists of 700 unselected children recruited during pregnancy week 24 and followed closely throughout childhood with extensive sample collection, exposure assessments and longitudinal clinical phenotyping38,39,40. From the cohort, we used 662 deeply sequenced metagenomics samples taken at 1 year of age. The details of the study and sequencing protocol have previously been published40. The samples consist of 150-bp paired-end reads per with mean ± SD: 48 ± 15.5 million reads.The data was analysed using the same approach as for the strain-madness data set, with the exception of filtering away reads shorter than 50 bp in the preprocessing step. This workflow yielded 880 MAG clusters for the samples.MAGinator was run using the reads, contigs and MAGs from VAMB as input. Thus creating a set of signature genes for each MAG cluster which has been found de novo for this particular dataset.CHILD dataset – data characteristics and preparationThe Canadian Healthy Infant Longitudinal Development (CHILD) study comprises a large longitudinal birth cohort with stool collection in infancy for microbiome analysis41. Stool samples used in this analysis were sequenced to an average depth of 4.85 million reads (SD: 1.79 million), and samples which included >1 million reads after preprocessing were kept for the current analysis7.We analysed a subset of the CHILD cohort, consisting of 2846 metagenomic sequenced faecal samples from infants. To overcome the shallow sequencing, the signature genes of the COPSAC2010 cohort were used to profile the samples instead of running MAGinator. To ensure that the process of the read mappings was identical to COPSAC, the read mapping was carried out using the full gene catalogue. Next, the read counts for the signature genes were extracted and used to derive sample-wise abundances for each MAG cluster.Examining bifidobacterium MAG clustersThe detection of signature genes for B. infantis for the COPSAC2010 (n = 662) and CHILD (n = 2846) cohorts was carried out by creating a binary detection matrix and using the standard function (heatmap) with default values in R. Furthermore, we compared the abundances of all the Bifidobacterium MAG clusters derived from MAGinator with abundance estimates from Metaphlan 3 (v.3.0.7) and subspecies phylogenies from Strainphlan 3 (v.3.0.7) for the species Bifidobacterium longum. The phylogenetic tree output by Strainphlan was converted into a distance matrix and clustered using partitioning around medoids into two clusters. The two clusters were annotated as B. longum subsp. longum (B. longum) and B. infantis based on the placement of Bifidobacterium longum reference genomes in the phylogenetic tree.SNV-level phylogenetic trees for COPSAC datasetFor each MAG cluster, the sequences of the signature genes were used as a reference to create an SNV-level phylogenetic tree. The trees for COPSAC2010 were constructed with the default values of MAGinator, producing both a tree in Newick file format for each MAG cluster and files containing the statistics for the alignments. The tree for Faecalibacterium sp900758465 was visualised in R using {ggtree}42. The heatmaps in Suppl. Figure 7 was constructed from B) stats.tab and C) stats_genes.tab. The median frequency of bases in the signature gene alignment with mixed alleles was calculated based on positions with a depth of minimum 2 and normalised according to the gene length. A major allele frequency of at least 0.8 was required for the sample to be considered homogenous. These are also the default cutoffs, and users can adjust them to trade off sensitivity for specificity.Strain mixtures within de novo MAGs across environmentsTo assess the degree of within-MAG cluster strain diversity for non-human associated environments, two public datasets were included in the analysis. One study done by Engel and Ellegaard examined the honey-bee gut43 and the Tara Oceans study44. The raw data was run through the same workflow as the strain-madness data and run through MAGinator. Due to computational limitations and the size of the Tara Oceans samples only 148 of 243 samples were successfully assembled.Gene syntenies and functional annotation for COPSAC datasetThe non-redundant genes were annotated using eggNOG mapper (v.2.0.2)19,45,46. Of the 14.7 million non-redundant genes 9.2 million were annotated. The visualisation of the synteny clusters was done with {igraph}47.Statistics and reproducibilityThe statistical methods included in this study has been conducted with R (v.4.1.2). In this study we have analysed 5 public datasets, COPSAC201038 (n = 662), CHILD41 (n = 2846), Franzosa et al. IBD-study13 (n = 220), Tara Oceans44 (n = 243) and honey-bee43 (n = 54). For Franzosa et al. and Tara Oceans not all samples succeeded in assembly and was thus not included in the analysis included in this study, leaving 216 and 148 samples respectively.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles