A program for real-time surveillance of SARS-CoV-2 genetics

An infrastructure for nationwide COVID-19 surveillanceThe rapid emergence and continuous evolution of SARS-CoV-2 necessitated setting up a national surveillance program that could provide real-time epidemiological snapshots across the U.S. In response to the CDC’s basal surveillance program, we organized and implemented a nationwide infrastructure to identify SARS-CoV-2 positive patient samples across the U.S., consolidate these samples and their associated demographic data, and sequence their genomes to determine their SARS-CoV-2 PANGO lineages. Our surveillance system also includes mechanisms for monitoring emerging lineages and their potential impacts on qPCR and sequencing performance. Further, this setup is flexible and modular, enabling rapid responses and developments to our pathogen surveillance.A schematic of this modular infrastructure and its associated sequencing protocols is shown in Fig. 1. In Fig. 1a, we show a high-level view of our surveillance pipeline that begins with sample accessioning and resulting in WGS reports with de-identified sample metadata. Upper respiratory samples (nasopharyngeal (NP) or nasal swabs) collected in 0.9% saline or viral transport media were collected and analyzed through our COVID-19 RT-PCR Test at various Labcorp® service centers and laboratories, and the resulting PCR extraction plates containing SARS-CoV-2-positive samples were then shipped to Labcorp® central locations and consolidated into plates with high viral titer samples (i.e. N1 Ct < 31) using our custom developed plate selector app (Fig. 1b, Methods). The selection of plates using this app is critical, as it de-identifies data thereby ensuring compliance with the U.S. Health Insurance Portability and Accountability Act (HIPAA) and allows flexibility in choosing samples. For example, we can focus on certain geographic regions where outbreaks are underway, put limitations on the amount of data from a state to ensure uniform surveillance, and threshold on the PCR amplification N1 Ct value to ensure sufficient genetic material is available for sequencing in each sample. The resulting consolidated plates of high viral titer samples were then shipped to our sequencing labs where samples were processed through custom tiled Molecular Loop® probe amplification, followed by library preparation and sequencing on the PacBio® Sequel II™ platform in a highly multiplexed fashion (Fig. 1c,d). PacBio® raw data was then processed to generate Circular Consensus Sequencing (CCS) reads which were then analyzed using our custom bioinformatics workflow to generate consensus genomes for each sample (Fig. 1d). Stringent sequence coverage quality control was then applied followed by PANGO lineage determination for each sample (Methods), and results were then merged with patient demographic data and de-identified with a new custom ID generated for each sample (Fig. 1d). Consensus genomes, summary reports, raw CCS reads, and alignment variant calls were then provided to CDC who in turn processed our submission and deposited the relevant data to public repositories, namely the Global Initiative on Sharing All Influenza Data (GISAID)20 and the National Center for Biotechnology Information (NCBI)21.Fig. 1Journey of a sample from raw nasopharyngeal and nasal swabs to CDC reporting. (a) Overview of end-to-end genomic surveillance setup. (b) Consolidation of positive high viral titer samples. (c) Probe amplification protocol. (d) Long read sequencing analysis and bioinformatics workflow to prepare high-quality SARS-CoV-2 genomic sequences with demographic metadata for CDC.High-throughput, high-fidelity SARS-CoV-2 genome surveillance pipelineThe rapid pathogen evolution during a pandemic and the possibility of sporadic outbreaks necessitates a highly robust genomic surveillance pipeline. From January 2021 to July 2023, we have used our Virseq pipeline (Fig. 1) to report 524,498 high-quality SARS-CoV-2 genomes (10X median depth of coverage, > 90% genome coverage, complete S gene coverage) and patient demographic data to the CDC. These sequences captured all major lineages that have emerged throughout the COVID-19 pandemic since the inception of this surveillance effort, including Alpha, Delta, Omicron, and the many Omicron subvariants (Fig. 2, Methods). When overlaying the positivity rate of our diagnostic PCR assays used for sample picking, we observed multiple fluctuations matching variant emerges, such as BA.1, BA.4/BA.5, and XBB.1.5 (Fig. 2b). Stratifying genomes by U.S. Health and Human Services (HHS) region (HHS regions 1–10; Methods), we found that the HHS regions 1 and 2 (corresponding to the U.S. Northeast) often served as a harbinger to predict variant emergence for all other regions (Figure S1). For example, the initial Omicron variant (BA.1) and the more recent XBB.1.5 variant reached ~ 50% prevalence in HHS regions 1 and 2 approximately one week and four weeks prior to all other regions, respectively (Figures S1 and S2).Fig. 2SARS-CoV-2 PANGO lineage analysis of 594,832 high-quality genomes from USA samples collected between January 2021 and July 2023. In each plot samples are grouped and colored by the closest parent listed on the top left. (a) Time-resolved phylogeny of a subset of samples (n = 28,069) clustered based on the oldest collected sample with at most 1,000 samples per month. (b) SARS-CoV-2 lineage proportions across weeks with at least 100 samples (left y-axis) and the PCR positivity rate (%) indicated by the black line (right y-axis).After the BA.1 wave, reports indicated that SARS-CoV-2 infections by Omicron variants exhibited lower viral loads22,23, prompting us to investigate the diagnostic PCR N1 Ct values of our samples, which are a useful proxy for viral load. As expected, lower sample Ct values were correlated with both increased average depth of coverage and higher consensus genome coverage (Figure S3a). We also observed shifts in sample diagnostic N1 Ct values throughout the pandemic that ranged from 20 to 21 in 2021 and exceeded 24 during the BA.1 wave in the 2021–2022 winter season, prior to reaching a steady state between 22 and 23. (Figure S3b). In our later analysis of SARS-CoV-2 co-infections, we found that co-infected sample N1 Ct values followed the same trends as those of samples where only a single SARS-CoV-2 lineage was detected. Together, these results indicate that our surveillance captures the overall kinetics and patterns of circulating variants.This collection of high-quality genomes also uniquely captured demographic trends from all 50 states in the U.S. and the District of Columbia. States with the highest representation include California (n > 62,000) and New Jersey (n > 48,000), while other key states from HHS regions 9 and 10 (WA, n > 19,000; AZ, n > 17,000), HHS region 5 (IL, n > 23,000; OH, n > 10,000), and HHS regions 4 and 6 (NC, n > 45,000; FL, n > 36,000; TX, n > 14,000) also had strong sampling (Fig. 3a, Table S1). We also observed that the number of Virseq-generated genomes has represented a consistent proportion of total SARS-CoV-2-positive samples collected in each HHS region with minor fluctuations (Fig. 3b). These fluctuations are expected as the timing of variant outbreaks (e.g. BA.1) can vary across HHS regions along with concomitant surges in PCR positivity rate (Fig. 2b). Despite these changes, we still observed sustained census normalized sampling across both our diagnostic PCR assays and Virseq (Figure S4a) even with an end to the COVID-19 emergency response and the reduction in our surveillance volume in 2023 (Figure S4b). Notably, we observed a reduction in Virseq surveillance in HHS regions 6–8 at the onset of the BA.2 wave, as BA.2 prevalence spiked elsewhere in the U.S. before affecting these regions (February 2022; Figs. 2, 3b, S1, S2, and S4).Fig. 3U.S. nationwide SARS-CoV-2 genomic surveillance consisting of 594,832 high-quality genomes from samples collected between January 2021 and July 2023. In both panels, the U.S. is divided into 10 Health and Human Services (HHS) regions, each in separate colors. (a) Each state is notated by its two-letter abbreviation and the shade indicates the number of samples collected. Each region uses a shared color gradient, shown on the right in grayscale with maximum and minimum values of 19,000 and 1000, respectively. (b) Line plots showing the ratio of Virseq-generated genomes versus the total SARS-CoV-2-positive RT-PCR samples collected on the log2 scale over each sample collection month. January 2021 and July 2023 were excluded, as they each have fewer than 100 Virseq-generated genomes in the dataset analyzed. Vertical black lines denote the start of years 2022 and 2023.We also found that the age distribution of samples collected in each geographic region shifted throughout the course of the pandemic, with a plurality from pediatrics in 2021, shifting to a more even distribution of ages in 2022, and finally shifting to a plurality from older segments of the population in 2023 (Figures S5 and S6). Interestingly, there was also a modest trend in patient-reported gender, as the proportion of samples from female patients appeared to increase from 2021 to 2022 and once again in 2023 (Figures S5 and S6), despite there not being any difference in PCR positivity between males and females (Figure S7a). We also observed a bifurcation in PCR positivity across age groups in March 2022, as pediatric PCR positivity lowered to approximately half of the positivity in most other age groups and this relative difference has not changed since (Figure S7b). Intriguingly, two months prior (January 2022), the CDC and Food and Drug Administration (FDA) announced multiple expansions of pediatric COVID-19 vaccination availability24,25,26 (Figure S7b). We also observed that in June 2022, 18–19 year-olds had a reduction in PCR positivity relative to older age groups as well, and most recently in June 2023 we observed an increase in geriatric (80 + year olds) PCR positivity (Figure S7b). These demographic- and region-specific trends in variant prevalences and PCR positivity rates require a surveillance apparatus like ours that is flexible and robust to the rapid evolution of the SARS-CoV-2 genome.Robust, mutation-resistant S gene sequencing using a probe-based long-read strategyTo maintain nationwide surveillance of pathogen genome evolution, the selected WGS approach must be able to withstand sudden changes in genetic diversity. Our surveillance apparatus uniquely employs a probe-based long-read sequencing approach that is mutation-resistant by design due to its ~ 22X genome tiling of ~ 99.69% of the SARS-CoV-2 genome (29,809 of 29,903 bases), except for the 5′- and 3′-peripheral genomic regions (Fig. 1c). At the end of 2021, the Omicron variant (BA.1) emerged and swept through the U.S. in a matter of weeks (Figs. 2, S1, and S2), dramatically shifting the diversity of the circulating lineages across the U.S. population (Fig. 4a). Unlike earlier variants like Delta which were predominantly mutated in the ORF1a genic region, the original Omicron variant (BA.1) introduced a surge of novel S gene mutations (27 SNPs and three deletions compared to Delta) (Fig. 4b), raising concern regarding the ability of PCR- and amplicon-based assays to detect BA.1. In fact, the Spike gene target failure (SGTF) genomic signature was so common that it became a useful proxy for the PCR detection of emerging variants, such as Alpha and Omicron27. While interruptions in our surveillance were not observed (Fig. 2), we verified the fidelity of our sequencing of BA.1* using an in silico approach to check for probe dropout caused by lineage defining mutations (Methods). We found that our assay retained ~ 20X in-silico tiling of the S gene during the initial Omicron wave (Fig. 4c), and in the worst-case scenario where we allowed zero SNP tolerance in probe binding regions, we retained a minimum of ~ 10X probe tiling (Figure S8). Furthermore, as the total number of unique mutations and the concurrent prevalence of multiple circulating lineages measured in terms of their entropy continues to increase with more recent XBB subvariants, we continue to observe profoundly stable genome-wide probe tiling in silico (Fig. 4c). This robust probe tiling is especially important as chronic infections and widespread vaccination have altered the evolutionary trajectory of the SARS-CoV-2 genome28 and the receptor binding domain of the S gene remains under considerable selective pressure in the Omicron era8.Fig. 4Pandemic-wide trends in SARS-CoV-2 genomic mutations and the robustness of probe-based WGS. (a) Number of mutations with at least 5% prevalence in the lineage population (black) and circulating lineage shannon diversity (red). (b) Number of mutations with at least 5% prevalence in the lineage population separated by genic origin. (c) Heatmap showing the genome-wide probe coverage of the most common lineage in circulation for each collection week with genomic positions shown 5′ (bottom) to 3′ (top). Probes were considered failures if a deletion, insertion, or > 3 SNPs were detected in either probe arm. Large genic regions (ORF1a, ORF1b, S) are indicated by horizontal lines and are labeled on the right. In all panels, results are stratified by sample collection week with vertical bars separating the major waves of the pandemic with the causal variant shown above. Waves are demarcated using the collection week when the causal variant first reached 5% prevalence.We also confirmed the robustness of our sequencing strategy by analyzing the genome-wide per-base coverage of Variants of Concern (VOCs) that have emerged throughout the pandemic (Figure S9, starting with B.1.1.7 or ‘Alpha’ up until XBB.1.5). We found that overall per-base coverage has remained stable and well above our minimum per-base coverage required for base calling in our consensus genomes (4 CCS reads), despite the heavily mutated S gene of Omicron and subvariants thereof (Figure S9). Together, these results show that the Virseq assay is a stable and effective sequencing strategy and is a critical component of our surveillance apparatus. However, while we are confident in our response to variant outbreaks thus far, it is imperative that proactive measures are taken to preclude future surveillance interruptions.Modeling the performance of the Virseq assay by simulationOne challenge of sustaining continuous WGS surveillance is the need to predict changes in sequencing performance that may occur and its effect on the characterization of the pathogen variants. Many SARS-CoV-2 lineages emerge in regions outside of our surveillance network (i.e. in countries other than U.S.) and may be too rare for detection once they initially spread to our surveilled regions. To address this challenge, we developed a Virseq performance simulator that models our entire production process from raw reads to PANGO lineage determination, capturing the sequencing and other systematic errors that might propagate into consensus genomes. This simulator was constructed using a representative batch of samples and incorporates the per-base coverage and minor allele fractions commonly observed at each genomic position (Methods, Figure S10). Application of this simulator begins with an input sequence that is then mutated to reflect any errors introduced by our production process, which can then be compared with the original sequence via concordance analysis of their PANGO lineage determinations.We routinely use this simulator to monitor newly designated lineages in the PANGO nomenclature (largely VOCs) as well as randomly selected sequences from previous months of surveillance, leveraging the GISAID database20. This routine monitoring mechanism is a crucial component of our FDA Emergency Use Authorization (EUA) and could be used to help maintain future pandemic surveillance networks. In this study, we expanded this analysis to include up to 100 sequences each from 1,899 VOCs (97,421 total sequences, ‘VOC experiment’) and descendant lineages thereof, current and former, and 10,000 randomly selected sequences from each month spanning from January 2021 until July 2023 (310,000 total sequences, ‘retrospective experiment’) (Methods). As expected, we observed similar coverage profiles between the two experiments and the simulator model, indicating that the simulated genomes accurately reflected Virseq-generated sequences (Figure S10b). When assessing the PANGO lineage concordance between the simulated and original genomes, we first checked for exact matches then also checked for parent/child relationships between the lineages compared (e.g. BA.5 is a parent of the child BA.5.1), deeming these parent matches (Methods).Overall, we observed strong concordance in both the retrospective (99.55% exact, 309,900 of 310,000 sequences; 99.97% parent, 308,620 of 310,000 sequences) and VOC experiments (99.06% exact, 96,509 of 97,421 sequences; 99.84% parent, 97,261 of 97,421 sequences) (Fig. 5). In the retrospective experiment we observed strong concordance across all 31 months analyzed with some month-to-month fluctuations (> 98.95% exact, 9895 of 10,000 sequences; > 99.9% parent, 9990 of 10,000 sequences) (Fig. 5a). We also observed that some fluctuations coincided with shifts in circulating lineage diversity and the timing of VOC emergences (Fig. 4, Fig. 5a). Intriguingly, while some VOC emergences resulted in slight reductions of concordance (BA.1, BA.4/BA.5), others counterintuitively coincided with improved concordance (BA.2, BQ, XBB.1.9/XBB.1.16) (Fig. 4, Fig. 5a).Fig. 5Virseq simulation results. Exact and parent concordance for the retrospective (a) and VOC (b) simulation experiments. Retrospective results are shown with exact (red) and parent (blue) concordance as separate lines over the sample collection months analyzed, with years demarcated by vertical lines. VOC results are shown with exact and parent concordance results for each VOC data point with 90% thresholds marked by dashed red lines. VOCs colored black have > 90% exact concordance, those colored yellow only have > 90% parent concordance, and those colored red are below both concordance thresholds. Vertical gray bars separate the major waves of the pandemic with the causal variant shown above. Waves are demarcated using the collection week when the causal variant first reached 5% prevalence.When assessing the concordance of individual VOCs, we found that 98.21% (1865) and 99.63% (1892) of 1,899 VOCs had > 90% exact and parent lineage concordance, respectively (Fig. 5b). In total, we observed seven VOCs with a small number of discordant calls, and four.of these (BA.2.2.1, BA.5.10, BQ.1.19, and BY.1) had UShER tree placement conflicts with their designated hashes, i.e. these sequences were representative of those lineages but could not be placed accordingly in the UShER tree (Table 1). This indicated that the original sequences of these VOCs had unstable PANGO lineage designations. For example, one of four simulated BA.2.2.1 sequences was called a BA.1, and we later found that the original sequence was hashed as BA.2.2.1 but placed in the UShER tree as BA.1 (Table 1). Inspection of the other three VOCs revealed that they were recombinants with discordant calls corresponding to one of the recombined lineages (Table 1), indicating that the simulated genomes had a loss of resolution.Table 1 Summary of VOCs with more frequent discordant PANGO lineage calls. Each row shows a lineage with < 90% parent concordance and the most common discordant lineage observed among the simulated sequences.We then investigated features of the simulated genomes, including key drivers of discordant lineage calls. As expected, we found that genome coverage was significantly lower among genomes that had discordant lineage calls or that were only parent concordant compared to those with exact concordance in both experiments (p < 0.001 in all comparisons, Wilcoxon rank-sum test, Figure S11a-b). Both experiments yielded similar genomic positional dependence of consensus genome errors (Figure S11c), and these errors were often found in regions modeled with poor coverage (5’/3’ peripheral genomic regions) instead of positions modeled with higher base calling errors (Figure S11d-e). These simulation experiments collectively show that the Virseq assay generates consensus genomes with accurate PANGO lineage designations and that accuracy is predominantly driven by genomic coverage, which is expected behavior for the Phylogenetic Assignment of Named Global Outbreak Lineages (pangolin) software in general29. This Virseq simulator is a crucial component of our surveillance apparatus, ensuring that we anticipate potential sequencing interruptions and serving as a model for viral sequencing simulators in general.Detection and haplotype phasing of SARS-CoV-2 mixturesThe concurrent circulation of multiple lineages of the same virus during a pandemic may result in co-infections of different viral lineages, which in some cases result in more severe clinical outcomes in COVID-19 patients30. Thus, another requirement for effective surveillance machinery is the ability to distinguish and characterize co-infections, which may complicate consensus genome generation and PANGO lineage determination. After an initial finding of within-host SARS-CoV-2 diversity31, reports emerged describing patients likely co-infected with co-circulating SARS-CoV-2 lineages32,33,34,35,36,37. Since our SARS-CoV-2 WGS dataset robustly captures these pandemic-wide trends in circulating lineages (Fig. 2) and is highly stable (Figs. 4, 5), we posited that recovery of multiple SARS-CoV-2 lineage detections from the same sample would be possible.To identify these potential mixtures (i.e. co-infections), we developed a custom workflow utilizing freyja14, an off-the-shelf mixture deconvolution algorithm (Methods). We found that deeply sequenced samples (20.7% or 123,373 of 594,832) produced stable lineage mixture results (> 99% genome coverage, 100% S gene coverage, and average depth of coverage > 200X) (Figure S12a). Mixtures were then required to satisfy three criteria: 1) Top lineage relative abundance no greater than 0.8, 2) second lineage relative abundance no less than 0.2, and 3) minimum of 3 UShER55 defining SNPs discriminating the two lineages (heretofore defined as ‘discriminating SNPs’) comprising the mixture (Supplementary Fig. 12b-f). This process yielded a final confident set of 379 mixtures (Supplementary Data) likely of similar quality to their non-mixture counterparts, as they were found to have similar median depth of coverage (mixtures: 338.4, non-mixtures: 335.6, p = 0.46, Wilcoxon rank-sum test) (Figure S12g). These mixtures had similar Ct values as non-mixtures, with the same differences observed between the BA.1 wave samples and those before and afterwards (Figure S3c). We also found that these mixtures had lineage compositions concordant with the original lineage called by pangolin29, since in most cases the pangolin-derived lineage either closely matched the majority mixture lineage (217 of 379, 57.3%) or was a parent of both majority and minority mixture lineages (142 of 379, 37.5%) (Figure S13).Not surprisingly, we detected a plurality of mixtures during the BA.1 wave (159 of 379, 42%), which represents the largest portion of the dataset analyzed (22,169 of 123,373 samples, 18%). We also observed the highest prevalence of mixtures during the BA.1 wave, peaking at 1.4% (35 of 2498 samples) the last week of December 2021 (Fig. 6a). When categorizing the lineages comprised by these mixtures using their lineage groups, we found that most mixtures were of lineages from the same group (347 of 379, 91.6%), e.g. BA.1* mixed with BA.1* (Fig. 6b, Table S2). This is likely due to variants emerging in blocks (Fig. 2), though we did observe some mixtures (32 of 379, 8.4%) of different lineage groups collected during the transitions between these blocks, e.g. one Delta*-Mu* and three BA.1*-Delta* mixtures (Fig. 6b, Table S2).Fig. 6 Haplotype phasing of SARS-CoV-2 mixture samples (i.e. coinfections). a-b) Prevalence and number of mixtures in each sample collection week with major pandemic waves demarcated as they were in Fig. 3. In (a), the weekly average (~0.2%) is shown as a horizontal dashed red line. In (b), mixtures are colored based on the lineage family of the major and minor mixture components, e.g. BA.1*-Delta indicates a mixture of BA.1 and Delta sublineages with majority BA.1. (c) Sample-level comparison of the number of and minimum distance between UShER defining mutations that discriminate the two mixture lineages. Samples are colored based on their number of defining mutations phased: >1 phased (black, n=149), 1 phased only with non-defining mutation(s) (yellow, n=128), and 0 phased (red, n=102). (d) Comparison of sample defining mutation phasing among largest resolved haplotype blocks. Results are compared between original WhatsHap haplotype blocks (“WhatsHap”) and those same haplotype blocks merged using freyja mixture results (“WhatsHap + freyja”). (e) The number of defining mutations phased and the lengths of sample merged haplotype blocks.Since these mixtures have robust sequencing depth (Figure S12g) and comprise lineages with as many as 77 discriminating SNPs (Supplementary Data), we hypothesized that it would be feasible to resolve lineage haplotypes. Using a standard haplotype phasing tool for long reads (Methods), we were able to produce haplotype blocks in most mixtures (277 of 379, 73.1%), favoring mixtures harboring discriminating SNPs greater in number and closer together (Fig. 6c). We also developed a custom approach that employs a greedy strategy to merge haplotype blocks together based on the allele frequencies of the discriminating SNPs in each haplotype block (Methods). This greatly increased haplotype block resolution, on average increasing the number of SNPs in the largest (merged) haplotype block by ~ 112% (3.2 to 6.8 SNPs) and increasing the size of haplotype blocks to as long as 15.8kbp (Fig. 6d–e). These exceptionally large haplotype blocks were found among the BA.1*/Delta* mixtures, which have the largest number of discriminating SNPs among the mixtures identified (Supplementary Data). One example is LC0471172, which is a mixture of BA.1.1.18 and AY.39 that had a final merged haplotype block of length 15.8kbp harboring 61 SNPs and spanning most of ORFs 1a and 1b as well as the entire S gene and 3’ end of the genome (Figure S14). These rare, finely resolved mixture haplotypes are evidence that combining haplotype reconstruction with mixture analysis has potential to unveil unique sample characteristics. This mixture analysis workflow thus provides our surveillance apparatus with the essential ability to detect co-infections and ensures that consensus genomes are correctly reported in such cases.

A program for real-time surveillance of SARS-CoV-2 genetics

Chromosome-level genome assembly of Tritrichomonas foetus, the causative agent of Bovine Trichomonosis

Parasitologists up in arms as NIH ends funding for key database

BigSur – leveraging gene correlations in single cell transcriptomic data

Some common ‘forever chemicals’ may pass through skin

Leveraging gene correlations in single cell transcriptomic data | BMC Bioinformatics

Hot Topics

Chromosome-level genome assembly of Tritrichomonas foetus, the causative agent of Bovine Trichomonosis

Parasitologists up in arms as NIH ends funding for key database

BigSur – leveraging gene correlations in single cell transcriptomic data

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Chromosome-level genome assembly of Tritrichomonas foetus, the causative agent of Bovine Trichomonosis

Parasitologists up in arms as NIH ends funding for key database

BigSur – leveraging gene correlations in single cell transcriptomic data

Some common ‘forever chemicals’ may pass through skin

Popular Articles

Chromosome-level genome assembly of Tritrichomonas foetus, the causative agent of Bovine Trichomonosis

Parasitologists up in arms as NIH ends funding for key database

BigSur – leveraging gene correlations in single cell transcriptomic data