Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index | BMC Bioinformatics

To evaluate the efficiency of the reference sequence index modification, a series of experiments were performed. The modified index contained both SNVs and indels in all experiments except for the one involving population-specific genetic variants. Reference sequences of different versions, namely GRCh38 and GRCh37, were used. Various genetic variant databases were also used for index modification. All experiments were conducted using a computer with two 32-core processors (128 threads) operating at a frequency of 2.9 GHz.Incorporating genetic variants into the GRCh38 referenceTo evaluate the effect of modifying the linear reference sequence index of the GRCh38 (GCA_0000000001405.15) version, two versions of the modified indexes were created. In the first version frequently occurring genetic variants with minor allele frequency greater than 0.05 (the value was chosen based on the results of the experiment, more details in the Supplementary Table 1) from the 1000 Genomes Project (Phase 3) were added to the standard index, taking into account possible haplotypes. For the second version, publicly available assembly for 44 samples (excluding HG002) from the HPRC was used. A series of whole-genome sequencing experiments were performed on paired-end reads with a length of 150 nucleotides for HG002 with 35X coverage from the PrecisionFDA Truth Challenge V2 [21]. The reads were preprocessed using the Cutadapt tool [22].The index was created using the recommended values of k = 21 and w = 11 for aligning short reads. In general, larger values of k lead to a higher probability that reads not containing genetic variants will be aligned, but also increase the probability that reads containing genetic variants will not be aligned [23]. However, our modification of the index should mitigate the negative effect to some extent. Smaller values of k lead to an increase in the algorithmic complexity of the alignment procedure due to the increase in the total number of k-mers. In addition, when k is set to a low value, the number of unique minimizers decreases. Two boundary cases are k = 1 (4 letters) and k = 32 (index out of bounds of the array in the minimap2 code). To demonstrate that modifying the indexing of the minimap2 tool has a positive effect across different values of k and w, not only limited to non-degenerate values, additional experiments were conducted. These experiments included pairs of k and w values obtained by proportionally shifting the original values of k and w by (+ 6, + 3) and (− 6, − 3), resulting in pairs of (15, 8) and (27, 14) for k and w, respectively.The stumbling block in parameter selection during alignment is the mapping parameter f (specified by two integers n and m) which is used to filter out the most frequent minimizers that occur more than n and m times. Such minimizers are typically derived from sequences of repeated elements (if the length of such a sequence exceeds k + w−1, then several equal minimizers will be calculated) and are ignored at the seeding stage for optimisation purposes (the second integer is used in the second round of seeding). Even slight changes in the parameter f can lead to changes in the final results [24]. However, it is unclear what values of f should be used for other sets of k and w values. Assuming that such minimizers are derived from the same regions of the reference genome, in this paper we choose the value of f such that the ratio of ignored minimizers to all other minimizers remains the same.When conducting experiments with a modified reference genome index, the parameter f must also be changed, because adding new minimizers may change the overall ratio of minimizers to each other. To minimize the possible effect of discarding frequently occurring minimizers, for iterations of the modified index experiment, the standard values were increased so that approximately the same number of minimizers was ignored in the seeding step.The Broad Institute’s WARP Whole Genome Germline Single Sample Pipeline [25] was used as the basis for the whole-genome sequencing pipeline. The initial stage of the pipeline was modified to obtain FASTQ-formatted data, which were subsequently aligned using the minimap2 tool. In addition to minimap2, different aligners like Bowtie2, BWA-MEM, BWA-MEM 2 were also added. The aligned reads from the Binary Alignment Map (BAM) file were further processed using the SAMtools tool [17]. Potential sequencing errors were corrected using the GATK BaseRecalibrator and GATK ApplyBQSR tools. Variants were called using the GATK HaplotypeCaller tool. The VCF files obtained from the pipeline were compared to the reference VCF files using the hap.py tool [26] (Fig. 4).Fig. 4Scheme of the whole-genome sequencing pipelineFor all runs, the use of the modified index resulted in a reduced number of multi-mapped reads. (Supplementary Table 2). While the index modified with data from the 1000 Genomes Project showed no significant improvement, the index modified with data from the HPRC showed the best results, outperforming BWA-MEM2. As expected, larger values of k lead to larger increases in metrics and vice versa. The lower F1 score for k = 15, w = 8 may be due to the fact that long indels (with length > 15) are not incorporated into the modified index in this case, or to the poor choice of the genetic variant database. (Fig. 5, Supplementary Table 3). Figure 6 and Supplementary Figs. 1–2 denote an example of genetic variants found using the modified index.Fig. 5The VCF files were compared to the reference files on ‘confident regions’ using HG002 and reference data from the PrecisionFDA Truth Challenge V2. The GRCh38 linear reference sequence (GCA_000001405.15) was usedFig. 6The genetic variants found using the modified index. There are 6 SNVs (chr20: 6,315,599, 6,315,601, 6,316,603, 6,316,649, 6,316,659, 6,316,773), 1 insertion (chr20: 6,316,690) and 2 deletions (chr20: 6,316,667, 6,316,774)Incorporating genetic variants into the GRCh37 referenceSimilar to GRCh38, the GRCh37 (hs37d5) linear reference sequence index was modified. The coordinates of genetic variants from HPRC’s VCF file were preprocessed using the Picard LiftoverVcf tool. Thus, some of the genetic variants were lost during conversion.Iterations of the whole-genome sequencing pipeline were performed on PrecisionFDA Truth Challenge data. Short 150-nucleotide paired-end reads were used for HG002 at 50 × coverage, along with v4.2.1 reference BED (Browser Extensible Data) and VCF files.Experiments performed using the GRCh37 human reference genome showed that the use of the modified index leads to an increase in all metrics for all performed runs.The index modified by genetic variants from HPRC outperforms the BWA-MEM2 results, for all k, w sets used. And the index modified with genetic variants from the 1000 Genomes Project showed better results than the one used for GRCh38, which may be due to both the number of genetic variants added and the database itself. (Fig. 7, Supplementary Table 4).Fig. 7The VCF files were compared to the reference data from the PrecisionFDA Truth Challenge using HG002 as the reference. The linear reference sequence GRCh37 (hs37d5) was usedIncorporating genetic variants into the GRCh37 reference for synthetic dataFor this experiment, short paired-end reads of 150 nucleotides were additionally synthesized based on a 25X-coverage VCF file of the HG002 sample. The ART toolkit [27] was used for synthesis. Reads were not generated for the whole genome but were generated separately for chromosomes 1 and 6. Chromosome 1 was selected because it is the longest chromosome in the genome and chromosome 6 was chosen because it contains the major histocompatibility complex (MHC), a highly variable region. Reference BED and VCF files v4.2.1 were used for validation. In contrast to the other experiments, the standard parameters of the minimap2 tool were used to generate the reference sequence indexes.As in previous experiments with real data, modifying the reference sequence index on synthetic data also resulted in higher final quality metrics (Fig. 8, Supplementary Table 5). However, experiments with synthetic data for other samples may cause difficulties as other publicly available samples may already be included in genetic variant databases, and adding variants associated with these samples will disrupt the entire validity of further experiments.Fig. 8Comparison of the obtained VCF files with the reference HG002 files, synthesized paired-end reads for chromosomes 1 and 6 generated from the reference VCF file of the HG002 sample were used. The linear reference sequence GRCh37 (hs37d5) was usedEvaluating the effect of adding population-specific genetic variantsPrevious experiments added all genetic variants with an MAF > 0.05 from the 1000 Genomes Project dataset to the reference sequence index. The dataset comprises 26 different populations, and some genetic variants are only characteristic of certain populations. In this experiment, we compared the effect of adding population-specific genetic variants on the final quality of the whole-genome sequencing pipeline. To achieve this goal, we generated several modified reference sequence indices by adding genetic variants of a specific population with AF > 0.05 and not found in other specified populations with AF > 0.05 from the gnomAD v3.1.2 dataset: Ashkenazi Jewish, Mixed American, and East Asian. As a result, we obtained three modified indices, each with non-overlapping genetic variants. The intermediate VCF files contained 158,018 single nucleotide variants (SNVs) for the Admixed American population, 311,393 for the Ashkenazi Jewish population, and 487,442 for the East Asian population. In this experiment, as in the first experiment, short paired-end reads of 150 nucleotides were used for HG002 with 35X coverage from the PrecisionFDA Truth Challenge V2, and alignment was performed to the GRCh38 reference sequence (GCA_00000101405.15).According to the experimental findings, the index incorporating mutations from the Ashkenazi Jewish population demonstrated the highest values for Recall, Precision, and F1-Score metrics concerning SNPs (Fig. 9, Supplementary Table 6). However, the results for indels were less unequivocal, which could be attributed to the exclusive use of SNPs in the modification process.Fig. 9The VCF files were compared to the reference on ‘confident regions’ using HG002 and reference data from PrecisionFDA Truth Challenge V2. The GRCh38 linear reference sequence (GCA_000001405.15) was used. The index was modified by adding population-specific mutations from gnomAD v3.1.2

Hot Topics

Related Articles