GCphase: an SNP phasing method using a graph partition and error correction algorithm | BMC Bioinformatics

The algorithm flow of GCphase can be roughly divided into six steps (Fig. 1): (i) Data Preprocessing. GCphase preprocesses the input VCF and BAM files to transform them into a format suitable for the algorithm, facilitating subsequent computations. (ii) Selecting SNPs and long reads. GCphase filters the obtained SNPs and reads, removing ambiguous SNPs and reads that do not cover any SNP site. (iii) Constructing the graph. The alleles of SNP loci are represented as vertices of an undirected graph, and the reads supporting two alleles are constructed as edges. (iv) Partitioning Graph. GCphase divides the graph into two sets as disjoint as possible by utilizing the minimum-cut algorithm on the graph. (v) Correcting Errors. GCphase employs two error correction steps to improve the accuracy of phasing results. (vi) Producing haplotype blocks. GCphase outputs each maximal connected component in the undirected graph as a haplotype block. Now, let us delve into the specific phasing process.Fig. 1Overview of the GCphase workflow. a Data preprocessing. GCphase simplifies the reads into a format that contains only SNP information. b Selecting SNPs and long reads. SNP loci with disproportionately large allele ratios (the number of reads supporting the major allele accounts for more than 85% of the total number of reads) and reads with insufficient SNP information (indicated by red borders in the graph) were removed. c Constructing the Graph. The two alleles of SNP loci are represented as vertices in the graph, and the reads supporting two alleles are represented as edges. d Partitioning Graph. The graph is partitioned into two sets with the smallest intersection using the minimum-cut algorithm. e Correcting errors and Producing haplotype blocks. After undergoing two error correction steps, the algorithm traverses the maximal connected components in the graph to generate haplotype blocks as the outputData preprocessingGCphase (version 1.0) utilizes minimap2 to align long reads to the reference genome. The resulting alignment in SAM format was then converted to BAM format by using samtools [27, 28]. The sorted and indexed BAM file serves as the input of GCphase and uses pysam to extract information from bam files [29]. The SNP location and its corresponding allele are provided as input in the form of a variant call format (VCF) file, which is a standard format for storing genetic variations.The long reads with optimal alignment are retained, and others are filtered. For a given long read, GCphase first identifies all the SNP loci it contains. The i-th SNP in the long read is represented as a two-tuple mi = (vi, k). vi is the SNP identity, k is a binary value, and when k = 0, it means that the base in the long read is the same as the major allele; when k = 1, it means that the base is the same as the minor allele. Finally, the long read can be transformed into a vector (m1, m2… mn). Ultimately, a single read can be simplified into a representation of SNPs along with the alleles supported by the read at those loci, convenient for subsequent processing.Selecting SNPs and long readsAfter GCphase obtains the information of SNPs and reads, not all the information can be utilized in the phasing process. For each SNP locus to be phased, if the proportion of the allele with the highest read support is greater than or equal to 85% of the total read support for both alleles, this SNP locus is removed. Let vi0 and vi1 represent the two alleles of SNP vi. Let n1 and n2 denote the counts of reads supporting vi0 and vi1, respectively. If max(n1,n2)/(n1 + n2) ≥ 0.85, that is, if the proportion of one allele is excessively high, it is considered a fuzzy locus for SNP vi, indicating unclear allele expression. Phasing at this locus is not feasible, and there is a higher likelihood of introducing errors. Therefore, the locus is removed. After iterating through all the SNP loci to be phased and removing all the SNP with imbalanced heterozygosity, it is also necessary to update the read information by removing these fuzzy loci from the previously obtained reads.Since the core algorithm of phasing relies on the extension and expansion of the relative positions between SNP loci, only the relative positions between multiple (greater than or equal to 2) SNP loci can provide effective phase information for the phasing process. For each long reads, if a read ultimately provides less than two SNPs, it cannot provide effective information for phasing and is considered an invalid read. Therefore, all information from that read was deleted (Fig. 2).Fig. 2Filtering criteria for SNP loci and edges. Alleles of the same colour connected by lines represent a single read, where different colours indicate different haplotypes. In the figure, at the fourth SNP (represented by red dashed lines), the major allele ‘A’ has a count of 9, while the minor allele ‘G’ has a count of 1. The proportion of the major allele reaches 90%, which exceeds the set threshold of 85%. Therefore, SNP4 is considered a fuzzy SNP, and phasing cannot be effectively performed. Thus, SNP4 is removed. The read only aligned to SNP8 and represented in red is shown in the figure. However, since they cover only one SNP locus, they cannot provide effective information for SNP phasing. Therefore, these reads were deletedConstructing the graphLet G(V,E) be an undirected graph, where each allele of the SNP loci to be phased is represented as a vertex in graph G. For the i-th SNP loci pi, we construct two vertices vi0 and vi1 in G, which represent the major and minor alleles of pi, respectively. If there is an edge e(vi0,vj1) between vi0 and vj1, it indicates the reads simultaneously supporting the major allele of SNP locus pi and the minor allele of SNP locus pj. The weight w(vi0,vj1) of edge e(vi0,vj1) represents the count of reads that support both alleles simultaneously. After processing all the reads iteratively, an undirected graph can be constructed.After constructing the undirected graph G, it is necessary to perform a filtering process on the edges of the graph. For any two SNP loci pi and pj, there are four vertices: vi0, vi1, vj0, and vj1 in the graph. There can be up to four edges: e(vi0,vj0), e(vi0,vj1), e(vi1,vj0), and e(vi1,vj1) between them. Due to the mutually exclusive nature of the two alleles at the same SNP locus, if allele (vi0,vj0) is present in one haplotype, the corresponding (vi1,vj1) should be present in the other haplotype. Similarly, if allele (vi0,vj1) is present in one haplotype, the corresponding (vi1,vj0) should be present in the other haplotype. Therefore, the four edges can be classified into two sets: N1 = {e(vi0,vj0), e(vi1,vj1)} and N2 = {e(vi0,vj1), e(vi1,vj0)}, where the weight of each set is the sum of the weights of the edges it contains. The phase relationship between the alleles of the two SNP loci is determined by the weight bias between these two sets. If the weight w(N1) is greater than the weight w(N2), the set N1 is correct, and vice versa. If max(w(N1),w(N2))-min(w(N1),w(N2)) ≤ 1, it indicates that the possibilities of N1 and N2 are converging, suggesting an ambiguous phase between SNP loci pi and pj. Phasing these two SNP loci cannot be effectively performed based on the edges between them, and there is a higher likelihood of introducing errors in the phasing process. In that case, all edges between the alleles about pi and pj are removed.Partitioning graphThe minimum cut of a graph refers to dividing the vertices of the graph into two parts in such a way that the sum of the weights of the edges between the two parts is minimized. For the two alleles of an SNP locus, since they belong to two different haplotypes, the weight between alleles within the same haplotype is maximum, while the weight between alleles from different haplotypes is minimum. Therefore, applying the minimum-cut algorithm on the graph involves partitioning the vertices into two subsets that are as disjoint as possible, such that the sum of the edge weights within each subset is maximized, and the sum of the edge weights between the subsets is minimized. For the graph G(V, E) constructed previously, V = {v00, v01, v10, v11…vi0, vi1…,vm0, vm1}, there are m SNP loci and 2 × m vertices. By using the minimum-cut algorithm, V is partitioned into two sets, V0 and V1. For SNP locus pi, if vi0 belongs to V0, vi1 should belong to V1. The optimization objective is shown below.$$ Obj = \min \sum\limits_{i = 0;j = 0}^{m} e \left( {v_{ik} ,v_{jk^{\prime}} } \right); $$where k and k’ ∊ {0, 1}, k ≠ k’, and vik ∊ V0 and vik’ ∊ V1.The Fiduccia-Mattheyses (FM) algorithm is a minimum-cut algorithm for graphs, primarily used to divide a graph into two equally sized parts with the minimum cut value between them [30] (Fig. 3). GCphase revised FM for partitioning V into V0 and V1. The specific implementation process is shown as follows.Fig. 3For the initial set partitioning G0 and G1 of the initial SNP to be phased, there are edges connecting the two alleles vg0 and vg1 of any SNP locus vg with the points in both sets. The cut value is the sum of the weights of the edges connecting the allele vertex with all vertices in its complementary set. Therefore, when the two alleles belong to different sets, different cut values exist. a When vg0 ∊ G0, vg1 ∊ G1, the cut value is the sum of the weights of the red edges, which is equal to 16. b When vg1 ∊ G0, vg0 ∊ G1, the cut value is the sum of the weights of the red edges, which is equal to 5. The cut value (16) of the allocation method in (a) is greater than the cut value (5) of the allocation method in (b). Therefore, the allocation method for the two alleles vg0 and vg1 of SNP tends to be the allocation method in (b), which is vg1 ∊ G0, vg0 ∊ G1Step 1: Initialization. Due to the mutually exclusive nature of the two alleles at SNP loci, both alleles cannot be assigned to the same set. During initialization, GCphase selects the major allele of all the SNP loci into V0, while the minor allele is inevitably assigned to V1. Therefore, only the partitioning of the major alleles needs to be saved.Step 2: We calculate the cut value of vertices and determine whether to perform allele grouping exchanges. For an SNP locus pt with two alleles vg0 and vg1, first, we assume that vg0 belongs to G0 and vg1 belongs to G1.The cut value of vg0 is calculated by the following formula.$$ cut(v_{g0} ) = \sum\limits_{j = 0}^{m} {e(} v_{g0} ,v_{jp} ); $$where vjp belongs to G1 and p ∊ {0,1}.The cut value of vg1 is calculated by the following formula.$$ cut(v_{g1} ) = \sum\limits_{j = 0}^{m} {e(} v_{g1} ,v_{{j{\text{q}}}} ); $$where vjq belongs to G0 and q ∊ {0,1}. Then, we can obtain the weights cut01(vg0,vg1) = cut (vg0) + cut(vg1).Next, we exchange vg0 and vg1, which means vg0 belongs to G1 and vg1 belongs to G0. We recalculate the cut values of vg0 and vg1 and obtain a new value cut10(vg0,vg1).If cut01(vg0,vg1) > cut10(vg0,vg1), we assign vg0 to G1 and vg1 to G0.Step 3: All the SNP loci are processed iteratively by Step 2 until no SNP locus undergoes an allele exchange operation.After these steps, preliminary grouping results are obtained.Correcting errorsThe initial phasing results obtained from the revised FM algorithm may still contain a few errors. As the revised FM algorithm is a heuristic algorithm, it can produce good partitioning results but does not guarantee finding the global optimum. Therefore, GCphase applies two error correction steps to refine the initial phasing results and to obtain final results. In the error correction step, first, each SNP is sorted in ascending order based on its position in the reference sequence, and then, it is corrected based on the information from its connected vertices.Step 1 of error correction: We determine the presence of switch errors between adjacent SNPs.For the four alleles (vi0,vi1,vj0,vj1) of two SNP loci vi and vj, there are two phasing results for these four alleles: set1(vi, vj) = {(vi0,vj0) ∈ V0;(vi1,vj1) ∈ V1} and set2(vi, vj) = {(vi0,vj1) ∈ V0;(vi1,vj0) ∈ V1}. The weight of set1(vi, vj) is w(vi0,vj0) + w(vi1, vj1), and the weight of set2(vi, vj) is w(vi0,vj1) + w(vi1, vj0).When [w(vi0,vj0) + w(vi1, vj1)]—[w(vi0,vj1) + w(vi1, vj0)] ≥ 2, GCphase selects set1 as the final phasing result. When [w(vi0,vj1) + w(vi1, vj0)]—[w(vi0,vj0) + w(vi1, vj1)] ≥ 2, GCphase selects set2 as the final phasing result. The final phasing results about (vi0,vi1,vj0,vj1) are represented as OptimizedSet(vi0,vi1,vj0,vj1);For vi and its adjacent vi+1, if edges exist between their alleles, the phasing result about their alleles obtained by the revised FM algorithm (Sect. 2.4) is assumed to be FM(vi0,vi1,v(i+1)0,v(i+1)1). When FM(vi0,vi1,v(i+1)0,v(i+1)1) ≠ OptimizedSet(vi0,vi1,v(i+1)0,v(i+1)1), there is a switch error at SNP vi+1. If there is no edge connecting the alleles vi and vi+1, GCphase skips vi+1 and considers the next locus vi+2. If FM(vi0,vi1,v(i+2)0,v(i+2)1) ≠ OptimizedSet (vi0,vi1,v(i+2)0,v(i+2)1), there exists the following calculation for all SNP loci vk that are connected to vi+2:$$ Fs(v_{k0} ,v_{k1} ,v_{{\left( {i + 2} \right)0}} ,v_{{\left( {i + 2} \right)1}} ) = \frac{k – (i + 2)}{{\left| {k – (i + 2)} \right|}}*\left( {w(v_{k0} ,v_{(i + 2)0} ) + w(v_{k1} ,v_{(i + 2)1} )} \right); $$when (vk0,v(i+2)0) ∈ V0 and (vk1,v(i+2)1) ∈ V1;$$ Fs(v_{k0} ,v_{k1} ,v_{{\left( {i + 2} \right)0}} ,v_{{\left( {i + 2} \right)1}} ) = \frac{k – (i + 2)}{{\left| {k – (i + 2)} \right|}}*\left( {w(v_{k0} ,v_{(i + 2)1} ) + w(v_{k1} ,v_{(i + 2)0} )} \right); $$when (vk0,v(i+2)1) ∈ V0 and (vk1,v(i+2)0) ∈ V1;Fs(vk0,vk1,v(i+2)0,v(i+2)1) is the support degree of vk for vi+2. If \(\sum\limits_{k}^{{}} {Fs(v_{k} ,v_{i + 2} )} \le 0\), it indicates that the support for locus vi+2 from the upstream loci is less than the support from the downstream loci. In this case, it is also considered to have a switch error at locus vi+1, and locus vi+1 is labelled as an error locus.Following the given procedure, error detection is performed for each SNP locus to be phased. For the error loci, we consider a high probability of switch errors at those loci. Therefore, GCphase partitions the entire set of SNP loci into consecutive reversal blocks, with the error loci as the first loci in each reversal block. The reversal blocks are numbered from 0 in the order of their positions in the reference sequence. We reverse the alleles of SNP loci within the reversal blocks with odd numbers, which means exchanging the grouping of the two alleles.Step 2 of error correction: We determine the support level for each SNP.For a SNP locus vi and a locus vk located before it, there exist edges for the four alleles (vi0, vi1, vk0, vk1). We calculate the sum of the weights Fn(vk, vi) of edges between alleles assigned to the same set, that is:Fn(vk, vi) = w(vk0, vi0) + w(vk1, vi1); when (vk0, vi0) ∈ V0 and (vk1, vi1) ∈ V1;Fn(vk, vi) = w(vk0, vi1) + w(vk1, vi0); when (vk0, vi1) ∈ V0 and (vk1, vi0) ∈ V1;If \(\sum\limits_{k}^{{}} {Fn(v_{k} ,v_{i} )} = 0\), it is considered that the posterior support of SNP vi is greater than the anterior support, and it is considered to have a switch error at locus vi.With these two error correction steps, the switch error and Hamming distance of the initial phasing results can be reduced.Producing haplotype blocksSince GCphase utilizes a graph-based approach for phasing SNPs, each maximum connected subgraph in the graph represents a haplotype block. This means that there are no edges connecting the two haplotype blocks. By performing a breadth-first search to traverse the entire undirected graph, all the maximum connected subgraphs can be obtained. Sorting the SNP loci within each subgraph in ascending order based on their positions on the reference genome, GCphase can obtain all the haplotype blocks.

Hot Topics

Related Articles