Olivar: towards automated variant aware primer design for multiplex tiled amplicon sequencing of pathogens

Generation of risk arrayA risk array is an array of non-negative real numbers$${{{{{{\bf{r}}}}}}}=[{r}_{1},\,{r}_{2},\,{r}_{3},\,\ldots,\,{r}_{L}]$$
(1)
where L is the length of the targeted sequence. The array r consists of four weighted components,$${{{{{{\bf{r}}}}}}}={a}_{1}\cdot {{{{{{\bf{snp}}}}}}}+{a}_{2}\cdot {{{{{{\bf{egc}}}}}}}+{a}_{3}\cdot {{{{{{\bf{lc}}}}}}}+{a}_{4}\cdot {{{{{{\bf{ns}}}}}}}$$
(2)
where a1, a2, a3 and a4 are set to 1 as default. snp represent SNPs, egc represent extreme GC content, lc represent low sequence complexity and ns represent non-specificity. For each user provided single nucleotide polymorphism (SNP) at position p of the targeted sequence,$${{{{{{\bf{snp}}}}}}}[p]=10\sqrt{{{{{{{\rm{freq}}}}}}}}$$
(3)
where freq could be the frequency of the SNP to be avoided. The root of SNP frequency is taken to amplify low frequency SNPs, and timed by a factor of 10 to be comparable to other risk components. Note that freq could also be user defined (e.g., when certain SNPs are considered more important than others, or simply just scale up all frequencies by a certain factor). To calculate the other three components, a set of equal-length, overlapping, and evenly distributed “words” are generated from the targeted sequence. Suppose the length of each word is ws (28 by default), L is divisible by ws and ws is divisible by a positive integer c (2 by default), then the distance between neighboring words is ws/c, and$${{{{{{\bf{W}}}}}}}= \left\{{{{{{{\bf{seq}}}}}}}[1:{{{{{{\rm{ws}}}}}}}],\right.\\ {{{{{{\bf{seq}}}}}}}\left[1+\frac{{{{{{{\rm{ws}}}}}}}}{c}:{{{{{{\rm{ws}}}}}}}+\frac{{{{{{{\rm{ws}}}}}}}}{c}\right],\ldots,\\ {{{{{{\bf{seq}}}}}}}\left[1+i\cdot \frac{{{{{{{\rm{ws}}}}}}}}{c}:{{{{{{\rm{ws}}}}}}}+i\cdot \frac{{{{{{{\rm{ws}}}}}}}}{c}\right],\ldots,\\ \left.{{{{{{\bf{seq}}}}}}} \, [L-{{{{{{\rm{ws}}}}}}}+1:L]\right\}$$
(4)
where W is the set of words, seq is the targeted sequence and i is a positive integer. If L is not divisible by ws, the targeted sequence is minimally trimmed to satisfy this condition, and corresponding elements in risk array r are set to 0. The value of ws is set as 28 by default based on typical primer length (20-25bp) and the minimum word size (7) of the NCBI BLAST+ program26. c is set as 2 by default so that each base is covered by two words. GC content and sequence complexity of each word is calculated. Sequence complexity is calculated based on Shannon entropy (described below). The number of BLAST hits of each word is acquired with the user-provided BLAST database. The following parameters are set to the NCBI BLAST+ program26 (version 2.12.0): evalue as 5, reward as 1, penalty as −3, gapopen as 5, gapextend as 2. These are the default parameters of the task “blastn-short”. For a nucleotide at position p (\({{{{{{\rm{ws}}}}}}}-\frac{{{{{{{\rm{ws}}}}}}}}{c}+1\le p\le L-{{{{{{\rm{ws}}}}}}}+\frac{{{{{{{\rm{ws}}}}}}}}{c}\)) of the targeted sequence (except for head and tail regions), there is a set of c words overlapping with that nucleotide, denoted as Wp ⊂ W. The average GC content of Wp, the average sequence complexity of Wp and the average number of BLAST hits of Wp are denoted as gc, cmplx and hits, respectively. nsraw is an array of 0s with length of L. Then,$${{{{{{\bf{egc}}}}}}}[ \, p]=\left\{\begin{array}{ll}0\quad &({{{{{{{\rm{gc}}}}}}}}_{\min } \, < \, {{{{{{\rm{gc}}}}}}} \, < \, {{{{{{{\rm{gc}}}}}}}}_{\max })\hfill \\ 1\quad &({{{{{{\rm{gc}}}}}}}\le {{{{{{{\rm{gc}}}}}}}}_{\min }\,{{{{{{\rm{or}}}}}}}\,{{{{{{\rm{gc}}}}}}}\ge {{{{{{{\rm{gc}}}}}}}}_{\max })\end{array}\right.$$
(5)
$${{{{{{\bf{lc}}}}}}}[ \, p]=\left\{\begin{array}{ll}0\quad &({{{{{{\rm{cmplx}}}}}}} \, > \, {{{{{{{\rm{cmplx}}}}}}}}_{{{{{{{\rm{low}}}}}}}})\hfill \\ 1\quad &({{{{{{\rm{cmplx}}}}}}}\le {{{{{{{\rm{cmplx}}}}}}}}_{{{{{{{\rm{low}}}}}}}})\end{array}\right.$$
(6)
$${{{{{{\bf{nsraw}}}}}}}[ \, p]={{{{{{\rm{hits}}}}}}}$$
(7)
where gcmin (0.25 by default), gcmax (0.75 by default) and cmplxlow (0.4 by default) are user defined. nsraw is then normalized to get ns$${{{{{{\bf{ns}}}}}}}=\frac{{{{{{{\bf{nsraw}}}}}}}}{\max ({{{{{{\bf{nsraw}}}}}}})}$$
(8)
Calculation of sequence complexitySequence complexity is calculated based on Shannon’s entropy. For a DNA sequence S of length L (L > 3) with alphabet {A, T, C, G}, the number of each k-mer (k = 1, 2, 3) is stored as an array$${{{{{{{\bf{c}}}}}}}}_{{{{{{{\bf{k}}}}}}}}=[{m}_{0},\,{m}_{1},\,\ldots,\,{m}_{n}]$$
(9)
where mi(i = 0, 1, …, n) is the number of a certain k-mer, and n = 4k. For example, if S = ACGCAGCGAGCAG, then c1 = [4, 4, 5] since there are 4 ‘A’s, 4 ‘C’s and 5 ‘G’s. Then,$${{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{k}}}}}}}}=\frac{{{{{{{{\bf{c}}}}}}}}_{{{{{{{\bf{k}}}}}}}}}{L-k+1}$$
(10)
$${e}_{k}=-\frac{{{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{k}}}}}}}}\cdot {\log }_{{{{{{{\rm{2}}}}}}}}({{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{k}}}}}}}})}{2k}$$
(11)
where ek is the dot product of the two vectors. The complexity of S is the smallest of e1, e2 and e3.Optimization of PDRsPDRs are regions where primer candidates are generated, and primer candidates are subsequences of a PDR. A PDR is defined by its start coordinate p and stop coordinate p + l − 1 (closed interval), where l is the length of the PDR (40 by default). All PDRs have the same length. The risk of the PDR is then defined as sum(r[p : p + l − 1]), where r is the risk array.Generation of one PDREach PDR is randomly selected within a certain region C (Supplementary Fig. 12, also see green boxes in Fig. 2a), the length of C is greater than or equal to l. The risk of all possible PDRs within C is calculated, and PDRs with risk below Xth percentile are considered candidate PDRs. Here percentile is defined as a PDR risk threshold below which percentage X of PDRs fall. In short, not all possible PDRs within region C is included for random selection, and the worst PDRs are excluded to accelerate the optimization process. Therefore, a lower X means a more stringent selection on PDR candidates. X is set to 30 by default to balance optimization efficiency and effectiveness (Fig. 2). This is also discussed in “Optimization of PDRs” in Results.Generation of a set of PDRsSimilar to forward primer (fP) and reverse primer (rP), there are also forward PDR (fPDR) and reverse PDR (rPDR). Here forward means closer to the 5’ end (left end) of the targeted sequence. There are three restrictions for a valid PDR set given a targeted sequence:

1.

PDRs should not overlap with each other (this guarantees that primers will not overlap).

2.

All amplicons generated from a PDR pair should satisfy a range of desired amplicon length, as defined by the user. An amplicon is defined as a pair of fP and rP, and its length is defined from the start of fP to the end of rP.

3.

The targeted sequence are fully covered with inserts. An insert is the region between a PDR pair.

Suppose a valid PDR set consists of N PDR pairs and each PDR pair is generated sequentially. For the kth PDR pair (k ≥ 1), the fPDR is selected from region C2k-1, and the coordinates of the fPDR is [p2k−1, p2k−1 + l − 1]. The rPDR is selected from region C2k, and the coordinates of the rPDR is [p2k, p2k + l − 1]. \({{{{{{{\rm{ALEN}}}}}}}}_{\min }\) is minimum amplicon length and \({{{{{{{\rm{ALEN}}}}}}}}_{\max }\) is maximum amplicon length.Generation of a PDR set starts with the first PDR pair (k = 1). Starting from the left end of the risk array and the targeted sequence, the fPDR of the first PDR pair is selected from region C1 (Supplementary Fig. 12),$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{1}}}}}}}}=[1,\,3l]$$
(12)
Here C1 starts at 1 for simplicity, and region size of 3l gives enough room for selecting the first PDR. There are 2l + 1 possible PDRs within C1, and one of them is selected as the first fPDR, with start coordinate p1, based on the method described in the section above (Generation of one PDR). The first rPDR is selected from region C2 (Supplementary Fig. 13), and based on the amplicon length restriction,$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{2}}}}}}}}=[ \, {p}_{1}+{{{{{{{\rm{ALEN}}}}}}}}_{\min }-l,\,{p}_{1}+{{{{{{{\rm{ALEN}}}}}}}}_{\max }-1]$$
(13)
The start coordinate of the rPDR is selected as p2.For the second PDR pair (k = 2), the fPDR and rPDR are selected from region C3 and C4, respectively. Considering the three restrictions, the boundaries of C3 are$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{3}}}}}}}}= \left[\max ({p}_{1}+l,\,{p}_{2}+3l-{{{{{{{\rm{ALEN}}}}}}}}_{\max }),\right.\\ \left.{p}_{2}-1\right]$$
(14)
Also see Supplementary Fig. 14. The second fPDR is selected from C3 with start coordinate p3.The boundaries of C4 are$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{4}}}}}}}}= \left[\max ({p}_{3}-l+{{{{{{{\rm{ALEN}}}}}}}}_{\min },\,{p}_{2}+2l),\right.\\ \left.{p}_{3}+{{{{{{{\rm{ALEN}}}}}}}}_{\max }-1\right]$$
(15)
Also see Supplementary Fig. 15. The second rPDR is selected from C4 with start coordinate p4.Starting from the third PDR pair (k ≥ 3), the fPDR and rPDR are selected from region C2k-1 and C2k, respectively. Considering the three restrictions, the boundaries of C2k-1 are$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{2k-1}}}}}}}}= \left[\max ({p}_{2k-4}+l,\,{p}_{2k-2}+3l-{{{{{{{\rm{ALEN}}}}}}}}_{\max }),\right.\\ \left.{p}_{2k-2}-1\right]$$
(16)
Also see Supplementary Fig. 16. The kth fPDR is selected from C2k-1 with start coordinate p2k−1.The boundaries of C2k are$${{{{{{{\bf{C}}}}}}}}_{{{{{{{\bf{2k}}}}}}}}= \left[\max ({p}_{2k-1}-l+{{{{{{{\rm{ALEN}}}}}}}}_{\min },\,{p}_{2k-2}+2l),\right.\\ \left.{p}_{2k-1}+{{{{{{{\rm{ALEN}}}}}}}}_{\max }-1\right]$$
(17)
Also see Supplementary Fig. 17. The kth rPDR is selected from C2k with start coordinate p2k.PDR generation will stop when C2k exceeds the limit of the targeted sequence.Loss of a PDR setThe Loss of a PDR set is defined as the squared sum (sum of squares) of the risk of the top 10% high-risk PDRs. Instead of calculating the total risk of all PDRs, we focus the Loss function on high-risk PDRs and amplify the contribution of the worst PDRs with non-linear transformation (e.g., sum of squares). For an input genome of length L, the number of PDR sets generated is \(\lfloor 500L/{{{{{{{\rm{ALEN}}}}}}}}_{\max }\rfloor\). The PDR set with the lowest Loss is selected for downstream design.Primer pool assignmentIf the targeted sequence is fully covered with inserts (region between fP and rP), then amplicons must be overlapping and the kth fP could form a short amplicon with the (k − 1)th rP. These short amplicons are usually much shorter than desired amplicons, having much higher amplification efficiency. To avoid the formation of these short amplicons, most PCR tiling applications assign amplicons into two pools so that amplicons within each pool are non-overlapping. Suppose the first PDR pair is assigned to pool 1 and the second PDR pair is assigned to pool 2, then the (2k − 1)th PDR pair is assigned to pool 1 and the 2kth PDR pair is assigned to pool 2 (k ≥ 2).Optimizing primer candidates with SADDLEIn the previous step the optimal PDR set is separated into two pools of PDRs, and each pool is input to SADDLE separately. Primer candidates are first generated for each PDR, with default parameters: temperature as 60 °C, salinity as 0.18, max primer length as 36, max ΔG (Gibbs free energy) as −11.8. One primer candidate is randomly picked from each PDR to form a primer set. SADDLE minimize primer dimers within the randomly generated primer set by iteratively replacing a primer candidate with another one from the same PDR. The number of iterations is determined by the total number of PDRs in the pool. The final optimized primer set will be the output of Olivar.Prediction of non-specific ampliconsThe prediction of non-specific amplicons starts with running BLAST through each single primer, with the following parameters set to the NCBI BLAST+ program26: evalue as 5000, reward as 1, penalty as −1, gapopen as 2, gapextend as 1. The values of these parameters are determined based on Primer BLAST27. For each BLAST hit, the BLAST program outputs the location and orientation of that hit. A single primer could have multiple BLAST hits, distributed to different chromosomes. For each chromosome, the location and orientation of hits of all primers are analyzed together. If two hits are close enough and the orientations are legitimate for PCR, that pair of hits is reported as a predicted non-specific amplicon.Calculation of Nextstrain nucleotide entropyNextstrain19 calculates Shannon’s entropy for each nucleotide position of the reference SARS-CoV-2 genome based on MSA of genomes available on GISAID9. Here we include global genomes dating from Dec. 21, 2019 to Dec. 04, 2022. Entropy at a given position c is calculated as below. Suppose we have N genomes. At position c, N1 genomes has ‘A’, N2 genomes has ‘T’, N3 genomes has ‘C’, N4 genomes has ‘G’. Suppose N1, N2, N3 and N4 are non-zero.$${{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{c}}}}}}}}=\left[\frac{{N}_{1}}{N},\,\frac{{N}_{2}}{N},\,\frac{{N}_{3}}{N},\,\frac{{N}_{4}}{N}\right]$$
(18)
$${e}_{c}=-{{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{c}}}}}}}}\cdot {{{{\ln }}}}({{{{{{{\bf{p}}}}}}}}_{{{{{{{\bf{c}}}}}}}})$$
(19)
where pc is an array and ec is dot product and entropy at position c. Of 7774 locations with non-zero entropy, we selected 1517 locations with entropy greater than 0.01 and input to Olivar as SNP locations and frequencies. A list of locations and their corresponding entropy can be found in Supplementary Data 1. We used the reference genome provided by Nextstrain (GenBank MN908947.3) without a BLAST database of non-targeted sequences. Other parameters are kept as default.In silico comparison of Olivar and PrimalScheme on Delta and Omicron variantsFor Olivar, we used GISAID genome EPI_ISL_402124 as reference, a BLAST database of human genome assembly GRCh38 as non-targeted sequences, and 440 SNPs of SARS-CoV-2 lineage B.1.617.2 and B.1.1.529 generated with Variant Database (described below). BLAST version 2.12.0 was used to create the BLAST database, with 24 chromosomes and mitochondria of GRCh38 as input sequences. The same BLAST version was used by Olivar to fetch non-specific hits. A different version of BLAST may generate different non-specific results and lead to different Olivar designs. Desired amplicon length was set to 252nt to 420 nt and the –check-var option of was turned on. For the six Olivar runs, random seed was set to 10, 11, 12, 13, 14, 15, respectively. Other Olivar parameters were kept as default. Olivar design of random seed set as 10 was used for comparison with ARTIC v4.1 (Supplementary Data 3 and 4). Olivar version≤1.1.5 should be used to reproduce the results.For PrimalScheme (version 1.4.1), desired amplicon length was also set to 252nt to 420 nt and 100 artificial genomes were used as input (described below). Other parameters were kept as default.SNP calling of Delta and Omicron variants with Variant DatabaseWe downloaded MSA and metadata of GISAID SARS-CoV-2 genomes before Feb. 28, 2022 and used Variant Database20 (version 2.4) to output SNPs of lineage B.1.617.2 (Delta variant, 3,961,817 genomes) and B.1.1.529 (Omicron variant, 1,258,730 genomes), with GISAID genome EPI_ISL_402124 as reference for coordinates. Variant Database compares the differences between reference genome EPI_ISL_402124 and all genomes in lineage B.1.617.2 (or B.1.1.529), and SNP frequency is defined as the percentage of genomes that are different from the reference at a given location. Note that Variant Database version 2.4 does not output insertions, so here SNPs include substitutions and deletions. We used the command “frequencies <cluster>” to list the frequencies of individual mutations for B.1.617.2 and B.1.1.529 separately. Insertions are not generated due to the limitations of the Variant Database. Of the 10,000 output SNPs, we kept 440 SNPs with a frequency greater than 0.01 and input to Olivar. A list of those SNPs can be found in Supplementary Data 2.Input genomes for PrimalSchemePrimalScheme limits the number of input genomes to 100. Since more than 5 million Delta and Omicron genomes are available from GISAID, 99 artificial genomes are created, bearing the same 440 SNPs input to Olivar. Each SNP is then randomly added to n of the 99 copies of the GISAID reference EPI_ISL_402124,$$n=\left\{\begin{array}{ll}\lfloor 100f\rfloor \hfill \quad &(f \, < \, 0.5)\\ \lfloor 25+50f\rfloor \quad &(0.5 \, < \, f \, < \, 1)\end{array}\right.$$
(20)
where f is the frequency of the SNP and ⌊ ⌋ means rounding down. Together with the reference, 100 genomes was input to PrimalScheme. We did not keep the original frequency for high-frequency SNPs in the artificial genomes because PrimalScheme will consider a position to be conserved if the vast majority of genomes have the same SNP and that position will not be avoided.In silico comparison of Olivar and PrimalScheme on 98 variants of interest (VOI)MSA of 98 VOIs was directly downloaded from GISAID on Octomber 23, 2023. 996 SNPs are called from the MSA of all VOIs, including substitutions, insertions and deletions, with GISAID genome EPI_ISL_402124 used as reference for coordinates. For Olivar input, we used GISAID genome EPI_ISL_402124 as reference, as well as the 996 SNPs. No BLAST database was used. The 98 VOIs were directly used as input for PrimalScheme. For Olivar and PrimalScheme (version 1.4.1), desired amplicon length is set to 380nt to 420 nt. Olivar was also run six times, with random seed set to 10, 11, 12, 13, 14, 15, respectively. Other Olivar and PrimalScheme parameters are kept as default. A list of the 98 VOIs and the 996 SNPs can be found in Supplementary Data 7 and 8.Sample preparation and quantificationSynthetic RNA control for SARS-CoV-2 is purchased from Twist Bioscience (part number 105204, GISAID accession: EPI_ISL_6841980, GISAID name: Hong Kong/HKU-211129-001/2021, B.1.1.529/BA.1, Omicron variant). Time-weighted composite samples of raw wastewater were collected every 1h for 24 h from the influent of the two domestic wastewater treatment plants (WWTPs), CB and KB, on two separate dates (Aug. 8 and Aug. 15, 2022). Samples were kept on ice during transport and stored at 4 °C in the laboratory to be processed within 24 h of collection. Sample concentration using HA filtration and bead beating methods, as well as RNA extraction method using the ChemagicTM Prime Viral DNA/RNA 300 Kit H96 (Chemagic, CMG-1433, PerkinElmer) were as decribed by Lou et al.6. To normalize synthetic RNA control to Ct values of 18 and 35, One-step RT-qPCR were performed with qPCRBIO probe 1-Step Go LoROX (PB25.41, PCR Biosystems) on the QuantStudio 3 Real Time PCR System (A28567, Applied Biosystems) as previously described by Lou et al.28. These two Ct bounds were selected to represent the high and low concentrations of SARS-CoV-2. SARS-CoV-2 concentrations of wastewater samples were quantified using RT-ddPCR on a QX200 AutoDG Droplet Digital PCR System (Bio-Rad) and a C1000 Thermal Cycler (Bio-Rad) in 96-well optical plates, as previously described by Lou et al.6.Multiplexed PCR, library preparation and sequencingARTIC v4.1 primer panel was purchased from Integrated DNA Technologies (IDT, Artic v4.1 NCOV-2019 Panel, 500rxn, 10011442). Primer pool was prepared by IDT according to the instructions provided by the ARTIC network. Specifically, additional primers were added and certain primers were pooled with double concentration. A copy of the instructions and the link to the original web page can be found in Supplementary Data 11. Olivar primers (Supplementary Data 3 and 4) were ordered in tubes from Sigma Aldrich and mixed by hand to achieve the final equal concentration of 15 nanomolar (nM) per primer. Reverse transcription of synthetic RNA control and extracted wastewater RNA were conducted using 8 uL of sample RNA and LunaScript RT SuperMix kit (NEB, E3010), as described by Tyson et al.17. To avoid bias attributed to reverse transcription, for each sample, the total volume of 10 uL cDNA product were gently homogenized by pipetting then divided into four 2.5 uL aliquots for the downstream PCR amplification reactions (using primer pool 1 and 2 of ARTIC v4.1, and using primer pool 1 and 2 of Olivar). PCR amplification was also performed using Q5 Hot Start High-Fidelity 2X Master Mix (NEB M0494) as described by Tyson et al.17. PCR products were purified using AmPureTM XP beads (Beckman Coulter Inc., A63880). A high bead-to-sample ratio of 1.8 was applied to maximize the potentials of capturing PCR byproduct. Purified DNA samples were normalized to 20 ng/uL in 25 uL and submitted for amplicon sequencing service at Azenta (EZ-Amplicon), with paired-end (2 × 250bp), quality filtered and adapters trimmed Illumina reads output. The concentration of amplicons was measured using Qubit dsDNA HS kit and a Qubit 2.0 fluorometer (Invitrogen).Analysis of sequencing dataPaired-end sequencing reads were mapped to the reference genome (GISAID ID EPI_ISL_402124) with Bowtie2 (version 2.4.5)29, with maximum fragment length for valid paired-end alignments (-X) set to 1000. Only read pairs that were mapped concordantly were used for downstream analysis. Disconcordant, single-aligned and unmapped read pairs were discarded. Coverage of each reference position was calculated as the number of reads overlapping that position, using PySAM (version 0.19.1)30, and then normalized with median coverage. Note that reads from the two primer pools were merged before calculating coverage. Sequencing reads were also mapped to amplicon sequences with Bowtie2, and the coverage for each amplicon was the number of read pairs mapped to that amplicon. Data about mapping rate and coverage uniformity can be found in Supplementary Data 12 and 13.Statistical testsFor mapping rates, the two primer pools were treated as different samples (n = 8). For coverage uniformity, pool 1 and pool 2 were combined (n = 4), and the percentage of genomic positions with less than 0.05 × median coverage was used as a measurement of coverage uniformity. P values were calculated with a paired, two-tailed t-test. Effect sizes were calculated as Cohen’s d, with the formula below,$$d=\left| \frac{\overline{D}}{{{{{{{{\rm{SD}}}}}}}}_{{{{{{{\rm{D}}}}}}}}}\right|$$
(21)
where d is Cohen’s d, \(\overline{D}\) is the mean of paired difference, and SDD is the standard deviation of paired difference.Taxonomic assignment of unmapped reads in wastewater samplesRead pairs that are not mapped to the reference genome are used for this analysis. Disconcordant or single-aligned read pairs are not included. Unmapped reads are input to SeqScreen21 (v4.1, database v23.3, including Refeq archaea, bacteria, eukaryotes and viral sequences) with default parameters. All reads are assigned to a species by SeqScreen with a confidence score. Assignments with confidence score lower than 25% are discarded. The number of reads assigned to SARS-CoV-2 (NCBI Tax ID 2697049), bacteria (NCBI Tax ID 2), eukaryotes (NCBI Tax ID 2759), viruses (NCBI Tax ID 10239) and archaea (NCBI Tax ID 2157) are counted.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles