Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing | BMC Bioinformatics

We solve the HLA typing problem by generalizing to the problem of quantifying a set of haplotypes that best explains the sequencing reads in a given sample. We first define the general problem and describe our solution. Afterwards we show HLA-typing specific challenges and how to overcome them.Haplotype quantificationDefinitionsGeneral. We assume that for the investigated species a reference genome is available that is complete enough to contain all the haplotypes of interest \({\varvec{H}} = \{h_1, h_2, \dots , h_n\}\) (e.g., HLA alleles). Local differences to the reference genome, for example single or multiple nucleotide variants (SNV or MNV), insertions, or deletions (indels) are called variants. Then, each haplotype h can be considered a sequence of genomic variants. We define \(\{v_1, v_2, \dots , v_k\}\) as the union of all those variants and represent the correspondence of haplotypes and variants as binary matrix \((V_{i,j})_{i=1,2,\dots ,n, j=1,2,\dots ,k}\) with \(V_{i,j} = 1\) if haplotype \(h_i\) has variant \(v_j\) and 0 otherwise. We further define the binary matrix \((C_{i,j})_{i=1,2,\dots ,n, j=1,2,\dots ,k}\) with \(C_{i,j} = 1\) if haplotype \(h_i\) spans across the locus of variant \(v_j\) (but does not have the particular variant) and 0 otherwise.Observed variables. We denote with \({\varvec{Z}} = (Z_1, Z_2, \dots , Z_l)\) the observed DNA fragments in a given sample. For each fragment, we consider a 4-tuple of mapping quality (MAPQ, the posterior probability that the locus the read is aligned to is the wrong one [14]), the alignment of the single read or the read pair (depending on the sequencing protocol), the read sequence(s) and the corresponding base qualities. We further denote for each variant \(v_j\) the subset \({\varvec{Z}}_{v_j} \subseteq {\varvec{Z}}\) of fragments that span the variant.Latent variables. We denote with \({\varvec{\Psi }} = (\psi _i \mid \psi _i \in U, i=1,\dots ,n, \sum _{i=1}^n \psi _i = 1)\) the latent fractions of each haplotype in \({\varvec{H}}\) in the given sample. Thereby, U denotes the universe of possible fractions, which can for example be continuous (\(U = [0,1]\)), or discrete (e.g., \(U = \{0,0.5,1\}\) for diploid samples). For each variant \(v_j\) we further denote with \(\theta _j\) the variant allele frequency in the given sample. Finally, we define the same binary latent variables as Köster et al. [15] for each observed DNA fragment \(Z_x \in {\varvec{Z}}_{v_j}\). First, we define \(\xi _x \in \{0,1\}\) with \(\xi _x = 1\) if the fragment is associated with the variant (i.e. has been sampled from the variant allele) and \(\xi _x = 0\) otherwise. Second, we define \(\omega _x \in \{0,1\}\) with \(\omega _x = 1\) if the fragment stems from the locus of interest and \(\omega _x = 0\) otherwise.Problem. Our goal is to find approximations of the true haplotype fractions \(\hat{{\varvec{\Psi }}}\) that best explain the observed sequencing reads \({\varvec{Z}}\) and to calculate the posterior densities of any provided solutions.Bayesian latent variable modelUnder the simplifying assumption that the given haplotypes represent all possible variation of the genome at a locus of interest, the central observation is that the variant allele frequency \(\theta _j\) of variant \(v_j\) is determined by the fractions of the haplotypes, namely$$\begin{aligned} \theta _j = \frac{ \sum _{i=1}^n V_{i,j} \psi _i }{ \sum _{i=1}^n C_{i,j} \psi _i }. \end{aligned}$$In other words, the main evidence for inferring haplotype fractions are the observed variant allele frequencies (i.e., vertical observations that are orthogonal to the horizontal haplotypes that shall be quantified) and their underlying uncertainty. Nevertheless, our approach can easily incorporate phasing information by combining close SNVs into MNVs or SNVs and indels into complex replacements (also see “Generation of candidate haplotypes and variants” section).In the following, let \(Z_x \in {\varvec{Z}}_{v_j}\) be the arbitrary but fixed x-th observed fragment spanning variant \(v_j\). As defined by Köster et al. [15], for \(\omega _x\) we assume$$\begin{aligned} \omega _x \sim {\text{Bernoulli}}(\pi _x) \end{aligned}$$with \(\pi _x\) being the complement of the probability denoted by the MAPQ value of fragment x in the BAM file with the fragment alignments of the sample. While in the single sample case of the original paper \(\xi _x \sim {\text{Bernoulli}}(\theta \tau )\) is assumed (with \(\tau\) denoting the probability that, if sampled from the variant-affected copy, the fragment indeed covers the variant [15]), we can here directly relate the distribution of \(\xi _x\) to the haplotype fractions, that is,$$\begin{aligned} \xi _x \sim {\text{Bernoulli}} \left( \tau \frac{ \sum _{i=1}^n V_{i,j} \psi _i }{ \sum _{i=1}^n C_{i,j} \psi _i } \right) \end{aligned}$$Reusing the same modelling as shown by Köster et al. [15], this leads to the likelihood function$$\begin{aligned} \Pr ({\varvec{Z}}_{v_j} \mid \psi _1, \psi _2, \dots , \psi _n) = \prod _{Z \in {\varvec{Z}}_{v_j}} \Pr \left( Z \mid \theta _j = \frac{ \sum _{i=1}^n V_{i,j} \psi _i }{ \sum _{i=1}^n C_{i,j} \psi _i } \right) . \end{aligned}$$In practice, this means that the likelihood function can be directly obtained by applying Varlociraptor to the data (see “Utilizing varlociraptor to obtain variant allele frequency distributions” section). Consequently, the likelihood of haplotype fractions given all variants \({\varvec{V}}\) is$$\begin{aligned} \Pr ({\varvec{Z}} \mid \psi _1, \psi _2, \dots , \psi _n) = \prod _{j=1}^k \Pr ({\varvec{Z}}_{v_j} \mid {\varvec{\Psi }}). \end{aligned}$$Using Bayes theorem, the posterior is thus$$\begin{aligned} \Pr ({\varvec{\Psi }} \mid {\varvec{Z}}) = \frac{\Pr (\psi _1, \psi _2, \dots , \psi _n)\Pr ({\varvec{Z}} \mid \psi _1, \psi _2, \dots , \psi _n)}{\Pr ({\varvec{Z}})}. \end{aligned}$$The marginal probability \(\Pr ({\varvec{Z}})\) depends on the fraction universe. In the continuous case (\(U = [0,1]\)), it is$$\begin{aligned} \int _0^1 \int _0^1… \int _0^1 {\mathbbm {1}_{\sum _{i=1}^n} \psi _i = 1} \Pr (\psi _1, \psi _2, \dots , \psi _n)\Pr ({\varvec{Z}} \mid \psi _1, \psi _2, \dots , \psi _n) d\psi _1 d\psi _2 \dots d\psi _n \end{aligned}$$with \(\mathbbm {1}_{\sum _{i=1}^n \psi _i = 1}\) being an indicator function that ensures that the sum of the chosen haplotype fractions does not exceed 1. In the discrete case (e.g., \(U = \{0, 0.5, 1\}\)), it is$$\begin{aligned} \sum _{\psi _1, \psi _2, \dots , \psi _n \mid \psi _i \in U, i=1,\dots ,n, \sum _{i=1}^n \psi _i = 1} \Pr (\psi _1, \psi _2, \dots , \psi _n) \Pr ({\varvec{Z}} \mid \psi _1, \psi _2, \dots , \psi _n). \end{aligned}$$By choosing the reference genome, the haplotype set, the prior (\(\Pr (\psi _1, \psi _2, \dots , \psi _n)\)), and the universe, the model can be configured for various kinds of scenarios. In this manuscript, we will focus on HLA typing, but different universes and priors can for example be used for virus lineage quantification.Practical considerations for model evaluationAs can be easily seen, calculating the marginal probability is computationally expensive. In the continuous case, it would require \(\mathcal {O}(q^n)\) with q being the number of grid points in the integral (e.g., using quadrature integration) and n being the number of considered haplotypes. To speed up the calculation, one can utilize Markov chain Monte Carlo methods (which we want to explore in the future). Further, it is possible to heuristically discard haplotypes that obviously do not reflect the observed VAFs at all upfront. This can be done using linear optimization.First, we consider for each variant \(v_j\) the maximum a posteriori allele frequency \(\hat{\theta }_j\) as it is reported by Varlociraptor [15]. We restrict the set of considered variant indices to \(W \subseteq \{1, 2, \dots , k\}\) such that each variant \(v_j\) with \(j \in W\) is covered by all the considered haplotypes. This is necessary to avoid the normalization factor when calculating the haplotype fraction induced allele frequency (see “Bayesian latent variable model” section) in the linear program. We minimize$$\begin{aligned} \sum _{i=1}^n \left| \sum _W \psi _i V_{i,j}-\hat{\theta }_i \right| \end{aligned}$$subject to$$\begin{aligned} & \psi _{i} \in [0,1],i = 1,2, \ldots ,n \\ & \sum\limits_{{i = 1}}^{n} {\psi _{i} } = 1 \\ \end{aligned}$$The optimization function minimizes the distance between the haplotype fraction induced allele frequency and the observed maximum a posteriori estimate of the allele frequency of each variant. The constraints ensure that the chosen fractions sum up to 1. By keeping only haplotypes with \(\psi _i \ge t\) and extending this set with haplotypes that have either the same set of variants or at most m more or less (with both t and m being command line parameters of Orthanq), we can obtain a set of haplotypes that is drastically reduced to exclude those that are unlikely to play a role in favorable solutions. By grouping those haplotypes into equivalence classes of haplotypes that are sufficiently similar to become de facto mutually exclusive in any solution and evaluating combinations of those classes instead of combinations of all LP-selected haplotypes, the search space for the statistical model can be further reduced (we will explore this in the future).Utilizing Varlociraptor to obtain variant allele frequency distributionsAs outlined above, the likelihood \(\Pr \left( Z \mid \theta _j = \frac{ \sum _{i=1}^n V_{i,j} \psi _i }{ \sum _{i=1}^n C_{i,j} \psi _i } \right)\) can be obtained directly from Varlociraptor. To this end, Varlociraptor offers the ability to specify so-called variant calling scenarios, which allow to configure all relevant aspects of the underlying latent variable model. The scenario consists of a definition of species specific prior assumptions (like the expected heterozygosity), the definition of samples, and the definition of events of interest, all specified in a YAMLFootnote 3 based domain specific language. The specifications depend on the use case and can be customized in any way.Footnote 4 In order to simplify the usage, Orthanq offers a subcommand that hides the scenario specification details and applies reasonable defaults. Specifically, we define a single sample, set the universe to be continuous and uniformly distributed and define a simple event that just checks for the presence of a variant.HLA-typing specific considerationsIn the following, we show how the Orthanq model can be used to perform HLA typing. Figure 1 provides an outline of the involved steps. We first generate candidate variants from comparing known HLA alleles against the reference genome. Second, we have to align reads against the reference genome, taking particular care of the homology induced challenges occurring at HLA loci. This step can obviously be shared with other analyses, e.g. for variant calling. Finally, the candidate variants are preprocessed and called with Varlociraptor, providing the required input for HLA typing with Orthanq.Fig. 1General workflow of HLA typing with Orthanq. The workflow consists of three main parts. “Prepare” consisting of candidate variant generation, “Align” containing the two-step alignment process that each sample undergoes, “Call” including both Varlociraptor for calling variants and Orthanq for calling haplotypes with the latter resulting in type allele prediction, with diploid priors to be used with healthy samples (steps with asterisk (*) are performed by respective Orthanq subcommands)Generation of candidate haplotypes and variantsOrthanq expects candidate haplotypes and variants to be given in BCF/VCF format. It provides a subcommand for automatically generating them from a given FASTA file with haplotype sequences and a common reference genome for the species. The subcommand uses minimap2 [16] to align each haplotype sequence against the reference genome and infers SNVs, MNVs and indels from the alignments, representing them in a VCF with one column per haplotype. Thereby, the column fields represent the entries of the matrices V (GT field) and C (C field) (see “Definitions” section). As mentioned above, while the central evidence used for haplotype fraction inference are variant allele frequencies, candidate variants can easily be grouped into locally phased representations like MNVs or complex replacements since Varlociraptor can handle all of these as well.For the evaluation of HLA typing presented in this paper, we used all known HLA alleles from the IPD-IMGT/HLA sequence database v3.32, excluding unconfirmed alleles and those with a population allele frequency \(< 0.05\) according to the Allele Frequency Net Database [17, 18] as input haplotype sequences and GRCh38 version 106 from Ensembl as reference genome. For now, we did not group individual variants into MNVs or complex replacements (but aim to pursue this in future work, see “Definitions” section). We run Orthanq independently for each HLA locus. Therefore, the candidate matrix was intersected with HLA genes.Pangenome based two-step alignment strategyWe adopt a two-step alignment strategy to ensure that reads align to the reference genome in the most accurate and efficient way. Naturally, the genetic variation of the investigated individual is not represented in a linear genome. The read mapper therefore has to decide about the optimal placement of a read solely based on the linear reference genome. This can lead to the situation that a variation in the individual’s genome generates a homology to a different location of the linear reference genome, pulling the read to a wrong place. Subsequently, the misplaced reads can lead to false positive or false negative variants, and thereby wrongly inferred HLA types. To overcome this issue, we utilize a graph based pangenome read alignment strategy. In graph based reference pangenomes, differential paths represent the known variants. If the pangenome is built to contain most of the relevant genetic variation of a species, wrong read placements can be reduced substantially, and more accurate mapping qualities can be reported [19] Since, at least on some hardware, pangenome read alignment can still be slower than linear reference genome alignment and we only require the maximum alignment accuracy for reads coming from HLA genes, we conduct a two-step approach. Reads are first aligned to the linear genome (GRCh38.p13) by bwa mem with default parameters. Second, we extract reads that map to any HLA class I and II locus based on their genomic coordinates. While HLA class III loci are technically possible as well, the IPD-IMGT/HLA database currently does not hold corresponding allele sequences. The extracted reads are then aligned to a human pangenome graph that captures the most relevant genetic variation [19] using vg giraffe [20, 21].

Hot Topics

Related Articles