Local read haplotagging enables accurate long-read small variant calling

Approximate haplotaggingIn this section, we describe the approximate haplotagging algorithm in detail. The approximate haplotagging is applied on 25kb windows indepedently and all described methods work on the 25kb window only. We utilize a graph data structure to facilitate the haplotagging of reads. In the graph G, a set of vertices V represents alleles at locations where a set of reads R overlaps these vertices. Our objective is to assign a haplotag of 1, 2, or 0 to all vertices, ensuring that this assignment corresponds to the maximum read support. Haplotag 1 is assigned to reads overlapping a majority of the alleles of haplotype 1 in the local 25kb window, while haplotag 2 is assigned to those overlapping a majority of alleles of haplotype 2 in the same window. Haplotag 0 is designated for reads that cannot be conclusively assigned to either of the two haplotypes.For a genomic position n, we look at all possible tuples of vertices (i, j) from the set of vertices at this position. We use a similar approach to the score calculation for each pair at the same position while calculating scores for a pair of vertices at different variant positions. The score calculation for pair of vertices at different variant positions is described in ”How scores are calculated between positions” section. The best score S(Vn,i, Vn,j) is calculated for each tuple where the first vertex is assigned a haplotag 1 and the second vertex is assigned a haplotag 2. The set of supporting reads R(Vn,i), R(Vn,j) are stored for each possible haplotag assignment. Set of reads supporting the assignment R(Vn,i) is calculated as a set of reads overlapping vertex Vn,i and preceding vertex Vn−1,k joined with a set of reads that overlap Vn,i and start after position n − 1.At each step of the dynamic programming algorithm, we extend the best haplotag assignment calculated for the previous position. Final assignment is calculated by backtracking from the best score for the last genomic position.Below, we present a brief definition of variables we used to describe the approximate haplotagging algorithm:

Vn,j be a vertex j at position n.

S(Vn,i, Vn,j) be a score for vertices i, j at genomic position n so that vertex i is haplotag 1 and vertex j is haplotag 2. It is possible that i and j are the same vertex;

Vn be a set of vertices at position n;

R(V) be a set of reads overlapping vertex V;

R(V1, V2) be a set of reads overlapping vertices V1, V2;

R*(Vn,j) be a set of “new” reads that start after position (n − 1) and overlap vertex Vn,j;

InitializationThe two steps, initialization and recursion used are described below. The first genomic position of an interval (n = 1), we initialize the score as:$$S({V}_{1,i},{V}_{1,j})=size(R({V}_{1,i})\cup R({V}_{1,j})) \,\, f\!or\,all\,possible\,haplotagging\,of\,vertices\,i\,and\,j$$Initialization happens at the beginning of an interval or when the haplotagging cannot be extended from the previous position.RecursionDuring recursion, we calculate scores at each position n for all vertex pairs (Vn,i, Vn,j) based on the previously observed vertex Vn−1,k and Vn−1,l where edges E(Vn−1,k, Vn,i) and E(Vn−1,l, Vn,j) exist in the graph connecting previous position n − 1 and current position n. The score is calculated as:$$S({V}_{n,i},{V}_{n,j})= max\{S({V}_{n-1,k},{V}_{n-1,l}) \\ +size(R({V}_{n-1,k},{V}_{n,i})\cup R({V}_{n-1,l},{V}_{n,j})\cup {R}^{*}({V}_{n,i})\cup {R}^{*}({V}_{n,j}))\}$$for all pairs (Vn−1,k) and (Vn−1,l) where edges E(Vn−1,k, Vn,i) and E(Vn−1,l, Vn,j) exist in the graph.Backtracking and haplotag assignmentAfter scores have been calculated for each vertex, we backtrack from the last position to determine the best scores, and subsequently, each pair of alleles is assigned haplotags. Now that all alleles are haplotagged, we assign reads with haplotags based on the set of haplotagged alleles the read overlaps. This assignment is accomplished by enumerating all the alleles a read overlaps and then counting how many of these alleles belong to haplotype 0, haplotype 1, and haplotype 2. Ideally, a read should overlap only alleles from the same haplotype. If a read overlaps alleles from different haplotypes then the haplotag is assigned by majority. If the number of alleles from haplotype 1 and haplotype 2 are equal, the read is assigned a haplotag 0.Example of the approximate haplotagging algorithmIn Fig. 6, we illustrate a graph constructed for approximate haplotagging. In this graph each vertex represents an allele An,m where n is the genomic position and m is the allele. The alleles (m) are numbered for this demonstration. Set of reads overlapping each allele is shown next to it denoted as read1 to read11. The reads are placed as such it represents the alleles it supports in the graph.Fig. 6: Illustration of a graph used to explain approximate haplotagging alogrithm.This figure demonstrates and example of how scores are calculated in the approximate haplotagging algorithm.In the first step, we initialize the scores for the genomic position n = 1:$$S({A}_{1,1},{A}_{1,1})= size (R({A}_{1,1})\cup R({A}_{1,1}))=5 \\ S({A}_{1,1},{A}_{1,2})= size (R({A}_{1,1}) \cup R({A}_{1,2}))=5+6=11 \\ S({A}_{1,2},{A}_{1,2})= size (R({A}_{1,2}) \cup R({A}_{1,2}))=6$$Then we calculate the scores of each pair of vertices based on the previously observed vertices:$$S({A}_{2,1},{A}_{2,1})= max\{S({A}_{1,1},{A}_{1,1})+size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,1},{A}_{2,1})),S({A}_{1,1},{A}_{1,2}) \\ +size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,2},{A}_{2,1}))\}=max\{5+3,11+4\}=15$$$$S({A}_{2,1},{A}_{2,2})= max\{S({A}_{1,1},{A}_{1,1})+size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,1},{A}_{2,2})),S({A}_{1,1},{A}_{1,2}) \\ +size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,2},{A}_{2,2})),S({A}_{1,2},{A}_{1,2}) \\ +size(R({A}_{1,2},{A}_{2,1})\cup R({A}_{1,2},{A}_{2,2}))\} \\ =max\{5+3+2,11+3+5,6+1+5\}=19$$$$S({A}_{2,2},{A}_{2,1})= max\{S({A}_{1,1},{A}_{1,1})+size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,1},{A}_{2,2})),S({A}_{1,1},{A}_{1,2}) \\ +size(R({A}_{1,1},{A}_{2,2})\cup R({A}_{1,2},{A}_{2,1})),S({A}_{1,2},{A}_{1,2}) \\ +size(R({A}_{1,2},{A}_{2,1})\cup R({A}_{1,2},{A}_{2,2}))\} \\ =max\{5+3+2,11+2+1,6+1+5\}=14$$$$S({A}_{2,2},{A}_{2,2})= max\{S({A}_{1,1},{A}_{1,1})+size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,1},{A}_{2,1})),S({A}_{1,1},{A}_{1,2}) \\ +size(R({A}_{1,1},{A}_{2,1})\cup R({A}_{1,2},{A}_{2,1}))\}=max\{5+3,11+4\}=15$$Then we calculate the best score at the last position. The best score for the last position is 19 where allele A2,1 is assigned to haplotag-1 and allele A2,1 is assigned to haplotag-2. Previous pair of alleles for this score is A1,1 with haplotag-1 and A1,2 with haplotag-2.Using the best score, we assign haplotags to the vertices and haplotags to the reads. From the previous step we have alleles A1,1, A2,1 assigned to haplotag-1 and alleles A2,1, A2,2 assigned to haplotag-2. Reads: read1, read2, read3 overlap alleles A1,1, A2,1 both of which are haplotag-1. Therefore read1, read2, read3 are assigned haplotag-1.Reads: read4, read5 overlap alleles A1,1, A2,2 are consecutively haplotag-1 and haplotag-2. In that case read4, and read5 cannot be haplotagged. Following the same logic reads: read7, read8, read9, read10, read11 are assigned haplotag-2, and read6 cannot be haplotagged.Finally, these haplotag associations of the reads are used in the examples we generate for each candidate and the DNN model uses the information to generate accurate genotypes.How scores are calculated between positionsThe haplotagging algorithm iterates through all tuples of edges and stores the best score for each pair of vertices. We describe three cases with examples below.Simple Case (for 2 positions with two edges)In the simple case, we have two vertices per position and only two edges connecting two positions each without overlaps like in Fig. 7. In Fig. 7, vertex A1,1 means vertex 1 in position 1 and A1,2 means vertex 2 in position 1. So here, we would have 4 possible tuples of edges: (A1,1 − A2,1, A1,1 − A2,1), (A1,2 − A2,2, A1,2 − A2,2), (A1,1 − A2,1, A1,2 − A2,2), (A1,2 − A2,2, A1,1 − A2,1). The score is calculated for each tuple of edges. We calculate best scores for the four pairs of vertices: S(A2,1, A2,1), S(A2,2, A2,2), S(A2,1, A2,2), S(A2,2, A2,1).Fig. 7: Illustration of a graph with two vertices with two edges for simple case of score calculation.This figure demonstrates and example of how scores are calculated in a simple case of two vertices with two edges.Complex case (for 2 positions with four edges)In the worst case with four vertices for two positions, we have edges that connect all four vertices shown in Fig. 8. In this case, we will calculate scores for all 16 possible tuples of edges and we store best scores for 4 possible pairs of vertices: S(A2,1, A2,1), S(A2,2, A2,2), S(A2,1, A2,2), S(A2,2, A2,1).Fig. 8: Illustration of a graph with four vertices and four edges that connect all vertices to each other.This figure demonstrates and example of how scores are calculated in a complex case where multiple edges require calculations of all possible combinations.Case of 3 verticesIn the case of three vertices, where we have reference and two alternate alleles. Similar to the previous two cases, we calculate 9 best scores for each possible pair of vertices. Depending on the number of edges, we have to calculate the scores for all possible tuples of edges to get the scores for each possible pair of vertices.Analysis methodsRead alignment and subsamplingWe used pbmm2 version 1.10 and minimap254 version 2.24-r1122 to align reads to the reference genome. We used samtools55 version 1.15 for sampling alignment files at different coverages.Variant calling and haplotaggingWe used PEPPER-Margin-DeepVariant16 version r0.8, Clair321 version v1.0.0 for variant calling, for DeepVariant-WhatsHap-DeepVariant pipeline we used v1.2.0 version of DeepVariant. For haplotagging with WhatsHap, we used WhatsHap version v1.7.Benchmarking variant callsFor benchmarking variant calls, we used hap.py56 version v0.3.12. The hap.py is available through jmcdani20/hap.py:v0.3.12 in dockerhub. For we used GIAB v4.2.1 truth set9 against GRCh38 reference for all samples.Haplotagging accuracy and natural switch determinationWe used https://github.com/tpesout/genomics_scripts/haplotagging_stats.py to calculate the haplotagging accuracy16.Read accuracy estimationWe used Best57 version v0.1.0 for read accuracy analysis. For the analysis, we used GRCh38 as the reference to derive the empirical QV for each read.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles