Reconstructing SARS-CoV-2 lineages from mixed wastewater sequencing data

Finding lineages in simulated readsSynthetically combined datasets with known frequencies are a valuable control and self-test when developing wastewater data analysis techniques. One such dataset has been created to test frequency prediction tools by28, available on Github (https://github.com/sgsutcliffe/ww_benchmark), which we used to test our method’s ability to detect known lineages.The dataset contained simulated reads from 35 genomes, representing four major SARS-CoV-2 lineages (BA.1, BA.2, Delta, and a “deltacron” recombinant lineage) as well as a synthetic SARS-CoV-2 lineage which contained random mutations. The dataset was composed of 100 simulated samples with different combinations of the five lineages and some with simulated amplicon dropout. The proportion of each lineage in a sample ranged from as little as 1% to 100%.We ran the method on the 100 simulated samples and identified 5 NMF components (corresponding to the five lineage definitions included in the dataset). Note that this tool requires the user to input the number of lineages to identify in the sample. The five predicted lineages (NMF components) were run through Nextclade29, a tool that performs sequence alignment, mutation calling, and clade assignment for various pathogens including SARS-CoV-2, as Pangloin struggles with recombinant lineages making it unsuitable for this analysis. As expected, Nextclade classified the five predicted lineages as BA.1.18 (BA.1), AY.4 (Delta), BA.2.3 (BA.2), B (undetermined synthetic), and XS (Deltacron).Accession numbers for each of the 34 genomes which were used to simulate the reads are available on GitHub (https://github.com/sgsutcliffe/ww_benchmark/blob/main/consensus_lineages.txt). Each of these sequences were downloaded from GISAID and classified with Nextclade, alongside the five predicted sequences from our method. Nextclade predicted a range of BA.1, BA.2, and Delta sub-lineages. All deltacrons were predicted to be XS which corresponded with our predicted deltacron sequence. The synthetic genome was not included since it was not uploaded to GISAID. We downloaded the alignment from Nextclade and built a neighbour joining (NJ) tree using Seaview30, shown in Fig. 2. Our predicted lineages are highlighted in yellow and clearly cluster with the four major lineages. Additionally, our fourth predicted lineage, “lineage4” clusters distinctly alone, which would be consistent with a synthetic genome containing random mutations. Therefore, our method was able to pick out the four real major lineages in the simulated dataset, as well as the novel synthetic lineage.Figure 2Phylogenetic tree (NJ) showing how the predicted lineages cluster with the isolates that were used to simulate the reads. Predicted lineages are highlighted in yellow. Predicted lineage 4 clusters alone, and is likely the dataset’s novel synthetic lineage.Finding major VOCs across all samplesOngoing wastewater collection for surveillance from sites across Ontario (Canada) provided an environmental dataset of raw SARS-CoV-2 amplicon sequencing data. All available data at the time of download (1026 samples collected between October 2021 and June 2022) from our routine sequencing of Ontario wastewater was processed using NMF with 3 components so that 3 lineages would be predicted. We applied the mutations listed in the predicted lineages to the SARS-CoV-2 reference genome to create a fasta with the sequence for each NMF-predicted lineage. We ran Pangolin31 on the resulting fastas to assign a lineage to each of them. Pangolin also runs a tool called Scorpio which assigns lineages and provides a confidence score for the particular lineage call. The predicted lineages for Pangolin and Scorpio along with the Scorpio support values are shown in Table 1.Table 1 Lineage assignments for each of the NMF-predicted lineages from all samples. The predicted lineages were all highly abundant in Ontario when these samples were collected32.The three NMF-predicted lineages were BA.2, BA.1.1, and B.1.617.2. All three of these were highly abundant in clinical sequencing data in Ontario during the time frame that we analyzed. B.1.617.2 is the parent lineage for all delta sub-lineages which were dominant in Ontario before being replaced by Omicron (BA.1.1). Eventually BA.2, another Omicron sub-lineage, replaced BA.1.132. Together, these give an accurate snapshot of the most significant lineages in Ontario between October 2021 and June 2022.We downloaded the frequency of each mutation for the three predicted lineages from outbreak.info and compared them to the learned mutation values33. Figure 3 shows the values for the spike mutations next to the frequency with which those values are observed in clinical sequences. For NMF-predicted lineages, the value is the normalized value from that lineage’s learned NMF component. For known lineages, the value is the proportion of known clinical genomes which contain that mutation. In general, the predicted lineages agree well with their analogous known lineages. The lineages on outbreak.info do not include synonymous mutations or insertions. Some mutations may represent legitimate local variation (like S:A222V), albeit with poor coverage and therefore a small sample size. The confidence and accuracy of the predictions decreases with each consecutive lineage which is logical because the components in NMF are ranked according to their relative importance.Figure 3Heatmap showing the learned spike mutation values of the predicted lineages next to the frequency with which those mutations are observed in the corresponding lineage according to outbreak.info. Nonsynonymous mutations which cause the same amino acid change are grouped together and labelled by the amino acid change.Figure 4 plots the values for the N gene. All mutations which are predicted to be significant agree with the outbreak.info data, including the variable presence of N:G215C in Delta. The N gene carries fewer mutations than the S gene and has much better coverage which probably leads to increased accuracy in the predicted lineages.Figure 4Heatmap showing the learned N gene mutation values of the predicted lineages next to the frequency with which those mutations are observed in the corresponding lineage according to outbreak.info. Nonsynonymous mutations which cause the same amino acid change are grouped together and labelled by the amino acid change.Finding SARS-CoV-2 sub-lineages in a single runWe also ran the NMF method to look for two lineages in a single sequencing run with samples from across Ontario in late June 2022. The lineage predictions are shown in Table 2.Table 2 Lineage assignments for each of the NMF-predicted lineages from a single run in late June 2022.Using samples from a single run, the method was able to accurately predict the two major Omicron lineages in Ontario at the time, BA.2 and BA.5. Surprisingly, the method was able to pick up all mutations with enough accuracy to predict specific sub-lineages of the two. Both of these sub-lineages have been identified in Ontario at the time according to outbreak.info, although the prevalence of BA.5.2.1 in clinical cases is lower than the prevalence predicted in wastewater using known lineage prediction pipelines (i.e., Alcov)11,34.We plotted the predicted values of the spike mutations for the two sub-lineages, which are shown in Fig. 5. BA.2 and BA.5 are very similar which can pose a challenge for the method but it was able to identify distinguishing mutations such as S:F486V.Figure 5Heatmap showing the learned spike mutation values of the predicted lineages in the single run. Nonsynonymous mutations which cause the same amino acid change are grouped together and labelled by the amino acid change.It is worth noting that sub-lineages are notoriously difficult to distinguish in wastewater, even when mutations are known. This is because closely-related sub-lineages are usually only differentiated by a few mutations and the frequency of those mutations can vary widely from sample to sample. Our primary aim was to develop a method which is capable of discovering SARS-CoV-2 lineages and mutations without the need for clinical sampling or WBS. Surprisingly, we are able to identify the mutations with such accuracy that not only can we deduce the major lineages which are present in a sample, but also accurately identify the specific sub-lineages which are most abundant without the need for lineage definitions. The accuracy likely comes from the ability of the method to pool information from multiple samples which works to smooth and reduce some of the noise within individual samples.

Hot Topics

Related Articles