VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison | BMC Bioinformatics

We developed VCF Observer, a graphical web tool that can analyze, compare, benchmark, and visualize VCF files. Although there are various tools for comparing VCF files, none provide a graphical interface or visualizations. Our software tool provides this functionality and makes working on VCF files more accessible.User interfaceVCF Observer’s user interface is separated into two parts: a navigation bar (navbar) is present on the left side of the screen and a display area covers the rest. Maintaining visual consistency has been a key consideration throughout the development process to ensure an intuitive and user-friendly experience. The navbar provides users a way to navigate the website and the display area presents the results of user actions, such as uploading files and requesting analyses. There are three tabs available in the navbar: Welcome, Upload, and Analyze. Each updates the display area with its information when selected. When first visiting the website, users are greeted with the Welcome tab that describes the functionality of the website and its overall layout. They can use the button at the bottom of the navbar to continue to the Upload tab. There, they can upload files they are interested in working with and move on to the Analyze tab. Lastly, they can select an analysis, request i̇t, and view its result. On subsequent visits, users are automatically navigated to the upload tab.The Upload tab contains 4 upload boxes for the file categories accepted by the application. These are the compare set, golden set, metadata, and genomic regions. The display area shows 4 upload result summary cards corresponding to the 4 file categories. Files can be dragged-and-dropped onto the upload boxes or the boxes may be clicked to open the browser’s file selection dialog for the upload of files. If uploads are not successful, users are notified with a message below the upload box and details of the problem are shown in the display area. Upon successfully loading a category of files, the number of files loaded in that category are shown below the upload box. When the compare set or the golden set is successfully loaded, the number of variants present in each file that has been uploaded is shown in their respective upload result summary cards, listed according to each file’s number of variants in descending order. VCF files describe variants according to the variants’ location in the genome and the change that was detected there. VCF Observer assigns IDs to variants in each VCF file using this information for use during analysis. Upon successful loading of metadata, the columns describing the compare set present in the metadata are shown in the display area. For genomic regions, the display area lists the filenames of uploaded files and only shows the aggregate count of the number of regions loaded. The Upload tab and its display area showing the results of uploads for all categories can be seen in Fig. 2.Fig. 2The Upload tab of VCF Observer. Successful upload results are shown. The upload status of files can be seen under each upload box in the navbar on the left. The successful upload results for all categories of files can be seen in the display area to the rightThe Analyze tab has a radio selector containing the 4 analysis types offered by VCF Observer: Data Summary, Venn Diagram, Clustergram, and Precision–Recall Plot. Data Summary offers 3 views: variant counts per file, variant counts based on compare set grouping, and listings of loaded data. Variant counts per file in either the compare set or golden set can be viewed as a histogram or as a table. Variant counts based on compare set grouping are generated using the metadata of files in the compare set. Files are grouped together based on dynamically defined properties in the metadata. There are three methods provided when grouping files: union, intersection, and majority (see Implementation for details). This style of grouping is also available for the Venn diagram, clustergram, and precision–recall plot analyses. In this view, there is also the option to pivot the table such that some metadata columns are present along the x-axis rather than the y-axis. The last data summary view provides a raw listing of the data loaded into VCF Observer in the form of tables.Venn diagrams are available for visualizing up to 6 sets (files or groups of files). There is an option to generate a pseudo-Venn diagram for cases with 6 sets to aid in visual clarity. The variants in the intersection of all sets are provided for download as well as the figure image. Clustergrams (heatmaps with dendrograms visualizing the similarity of rows and columns) are provided to visualize similarity among files in the compare set using their Jaccard distance. Both axes of the heatmap contain all files to be compared and comparisons are shown for each file pair. The labels shown for files can be determined based on metadata such that any combination of metadata columns may be used for labeling. Rows and columns can also be color-coded based on their labels to increase the readability of the clustergram. Various coloring schemes for the heatmap are also provided to the user. Precision–recall plots are provided for the purpose of benchmarking. For each file being benchmarked, precision is shown on the x-axis and recall on the y-axis in the form of a scatter plot. Labels can be chosen for each file in the same way as described for clustergrams. Additionally, the shapes and colors of data points on the plot can be set according to values in one or more metadata columns, allowing for patterns resulting from differences in file properties to be more easily visible. All visualizations are provided with the option of setting the font size, allowing for effective use in various styles of presentation. Lastly, the bottom of the navbar Analyze tab contains options for filtering variants prior to analysis. VCF files contain a FILTER column describing whether each variant passed the filters imposed by the calling algorithm that found i̇t. Variants that have passed such filtering are marked with a value of PASS in their filter columns. In VCF Observer, variants that have a FILTER column value of PASS can be selected for analysis exclusively. Also, variants can be filtered according to whether they fall inside or outside of certain genomic regions. Genomic regions can be provided by the user or by the server. Filtering options based on variant type and chromosome number are also present. The Analyze tab and its display area showing a generated Venn diagram can be seen in Fig. 3.Fig. 3The Analyze interface of VCF Observer. A Venn diagram generated with the settings presented in the navbar on the left is shown. The numbers of variants in each intersection of 5 sets is shown along with the percentages these represent in parenthesesFor all analysis types, upon successful completion of the analysis, an option to download resultant data is provided. For clustergrams and precision–recall plots, the download option is given in the figures’ interactive windows. Images are provided as PNG files and text-based results are provided as CSV files. Variant intersection sites computed as part of the Venn diagram are provided as compressed VCF files.Use casesVCF Observer’s main utility is comparing and benchmarking VCF files and visualizing the results of these operations. To this end, it provides additional capabilities such as filtering and grouping. It can be used, for example, to determine which set of tools is best suited to call variants from a given set of sequence reads, by benchmarking the results produced by each candidate. As another example, given a set of whole genome variant lists derived from various samples, it can be used to filter out intron variants and show the degree of similarity between the variants in the exomes of these samples using a clustergram. To concretely demonstrate these capabilities, we used two pre-existing datasets provided by the SEQC2 consortium: WGS-based (whole genome sequencing) germline variant call sets and WES-based (whole exome sequencing) somatic variant call sets [13]. We chose these datasets due to their inclusion of VCF files produced using various different tools (differing aligners and callers in both cases) but using the same sequence reads. Comparing such files was our primary motivation during development.From the germline calling dataset we selected 4 VCF files created with two different aligners (BWA and Bowtie2) and two different callers (GATK’s HaplotypeCaller and GATK’s VarScan2). We chose files that had GRCh38 as their reference and that were marked as being derived from well A01. In order to observe how these files differed from one another, we produced a Venn diagram (Fig. 4A). Here, we saw that, of the ~ 5.7 million unique variants (~ 19 million including duplicates), ~ 4.1 million were present in all 4 files, corresponding to ~ 71.4% of all unique variants. The two BWA files shared a distinct ~ 5.8% between them while the Bowtie2 files shared a distinct ~ 1.3%. The VarScan2 files shared another ~ 10.6% in addition to the prior ~ 71.4% and the HaplotypeCaller files shared another ~ 3.9%. We concluded that the VCF files produced from BWA alignments were more similar amongst themselves than those from Bowtie2 alignments. Similarly, the VCF files generated by VarScan2 were more similar than those generated by HaplotypeCaller. To obtain a simpler overview of similarity information, we generated a clustergram (Fig. 4B). Here we saw that the lowest similarity score (Jaccard distance) was ~ 0.75, loosely mirroring the ~ 71.4% shared variants amongst all files. Jaccard distance is calculated pairwise for each case and thus is not directly related to overall similarity, although we can observe loose correlation in this case due to the latter’s high value. Files were clustered by their callers, while aligners appeared to have a less significant effect on similarity. Lastly, we wanted to benchmark these VCF files. We chose the highly reproducible regions provided by the SEQC2 consortium [14] as a golden set and produced a precision–recall plot (Fig. 4C). Recall values for all 4 VCF files were > 0.99, with the VCF file generated using BWA and HaplotypeCaller having the highest value. Precision values, on the other hand, showed greater variation in the range of 0.67–0.81, with VarScan2’s results being worse than HaplotypeCaller’s. To demonstrate the variant intersection sites selection of VCF Observer, we downloaded the intersection variant set provided via the above-described Venn diagram analysis and uploaded it as a golden set. We generated a precision–recall plot using this new golden set (Fig. 4D). All recall values were 1.0, because the golden set in this case was a subset for all files. We observed that the VCF file obtained using Bowtie2 and HaplotypeCaller was the most similar to the intersection set of all four files, while the file obtained using Bowtie2 and VarScan2 was the least similar.Fig. 4Visualizations generated using 4 VCF files from the SEQC2 consortium’s germline WGS analysis of NA10835. A Venn diagram comparing variants in VCF files. B Clustergram showing pairwise Jaccard distances of VCF files. C Precision–recall plot calculated based on highly reproducible regions created by the SEQC2 consortium. D Precision–recall plot where the golden set is the intersection of all 4 VCF filesFrom the somatic calling dataset we selected 12 VCF files created using 2 different aligners (BWA and Bowtie2), 3 different callers (Mutect2, SomaticSniper, and Strelka), and 2 different library preparation methods. To get an overview of the VCF files, we generated a histogram showing variant counts (Fig. 5A). Here we noticed that some VCF files produced by Strelka had significantly more variants (on the order of 100,000) while those produced by SomaticSniper had significantly fewer (on the order of 1000). The rest of the files had variant counts on the order 10,000. This indicated a possibility that some of these files had been PASS-filtered by their respective callers while others had not. For this reason, we applied a PASS filter on all files and generated a new histogram (Fig. 5B). In this visualization we saw that all VCF files had variant counts on the order of 1000, confirming our prior conjecture. We performed all subsequent analysis with the PASS filter option enabled. We created a CSV file containing the aligner, caller, and library preparation associated with each file, so that we could group and label them dynamically. We produced a Venn diagram after grouping the files with a union operation using their callers (Fig. 5C). We saw that, of all unique variants, ~ 22.2% were common to all three callers’ groups. Strelka had variants in common with Mutect2 and SomaticSniper at ~ 6.8% and ~ 6.7% respectively. We also observed that SomaticSniper had twice as many uniquely identified variants compared with the other two callers. To see the effect of library preparation on the similarity between VCF files produced, we generated a clustergram where labeling excluded the library preparation so that files differing only in that aspect were marked with the same color (Fig. 5D). This showed that other than for files produced by Mutect2, the library preparation method explained the least difference between files, and the caller explained the most. In the case of Mutect2, however, there was a remarkable degree of similarity (Jaccard = ~ 0.81, ~ 0.82) when the library preparation was the same (we regenerated the figure with library preparation type as a label to be certain of this). For Mutect2, file pairs sharing library preparation type (but differing in aligner type) were more similar to other callers than to one another (Mutect2 & LibPrep1 differed significantly from Mutect2 & LibPrep2). We benchmarked this data using high-confidence regions taken from [15]. We first produced a precision–recall plot showing values for all 12 VCF files, where data points were labeled by their aligners and callers. Data point colors showed caller type and their shapes showed library preparation type (Fig. 6A). All VCF files had recall values of > 0.85 and precision values in the range of 0.21–0.38. VCF files produced by SomaticSniper had the lowest precision and recall values. One of the two VCF files produced using BWA and Strelka had the highest recall value at ~ 0.99, while one of the two produced using BWA and Mutect2 had the highest precision value at ~ 0.37. Next, to investigate the effect of grouping pairs of files produced using the same aligner and caller combination, we generated two precision–recall plots where files differing only in the library preparation type were combined. In one plot, grouping was done using the intersection of the files (Fig. 6B) and in the other, i̇t was done using their union (Fig. 6C). In both cases, the highest recall was achieved by the variant list created using BWA and Strelka. Without grouping, the highest recall value for this combination was ~ 0.99. When grouping via union, there was a marginal increase in recall. When grouping via intersection, recall decreased to ~ 0.94. When grouping via union, the highest precision was achieved by the variant list obtained using Bowtie2 and Strelka (~ 0.31, as opposed to ~ 0.32 without grouping), in contrast to BWA and Mutect2, which produced the highest precision without grouping. This can be attributed to the decrease in the precision values of variant lists associated with Mutect2 when grouping via union. This effect, however, was not present when grouping via intersection. Mutect2’s variant lists produced using both BWA and Bowtie2 had higher precision values at ~ 0.44 and ~ 0.45 respectively, compared to values < 0.40 without grouping.Fig. 5Visualizations generated using 12 VCF files from the SEQC2 consortium’s somatic WES analysis of HCC1395BL (normal) and HCC1395 (tumor). B–D were produced after the 12 VCF files were PASS filtered. A Histogram of variant counts for each file with no preprocessing applied. B Histogram of variant counts for each file with PASS filter applied. C Venn diagram comparing variants, generated after files produced by the same callers were grouped via union. D Clustergram showing pairwise Jaccard distances of VCF files. SomSnip: SomaticSniperFig. 6Precision–recall plots generated using 12 VCF files from the SEQC2 consortium’s somatic WES analysis of HCC1395BL (normal) and HCC1395 (tumor) as well as high-confidence regions created by the SEQC2 consortium as the golden set. A Scatter plot showing benchmarking results for all 12 VCF files. B Scatter plot showing benchmarking results for the intersections of VCF files sharing the same aligner and caller. C Plot showing benchmarking results for the unions of VCF files sharing the same aligner and callerPerformanceIn order to produce an overview of VCF Observer’s performance, we performed various tests of its functionality on a 2022 M2 MacBook Air with 16 GB of RAM. We ran 4 types of tests: comparing two VCF files to generate a Venn diagram, benchmarking a VCF file to produce a precision–recall plot, applying genomic regions filtering (using a BED file) to a VCF file, and applying a PASS filter to a VCF file. Each test was performed using 5 different VCF file sizes, giving a total of 20 test configurations. Each test configuration was run 10 times and their averages are presented in Table 2.Table 2 Performance test results showing times (in seconds (s)) for various operations performed by VCF ObserverThe VCF file sizes used were 1000 variants, 10,000 variants, 100,000 variants, 1,000,000 variants and 10,000,000 variants. For the tests generating Venn diagrams and precision–recall plots, two VCF files were used where the files both contained the aforementioned number of variants each. In the BED filtering test, a BED file listing exome regions was used.When working with 100,000 variants or less, VCF Observer can provide analysis results in less than 3 s (assuming an analysis consists of both filtering options and a visualization). For 1 million variants, i̇t produces results in 10–30 s. For 10 million variants results are produced in 4 min or less.We performed genomic regions filtering tests twice: once with an unoptimized and once with an optimized algorithm (see Implementation for details). Comparing the two genomic regions filtering test results, we saw that the unoptimized version of our algorithm performed more slowly as the number of variants being processed increased. For tests with 1000 and 10,000 variants, the unoptimized algorithm had a shorter run time, whereas i̇t performed twice as slowly for tests with 1 million and 10 million variants. The theoretical time complexity calculations described in Implementation were not observable in the test results. This is because only the variant list sizes were varied while the number of genomic regions was constant.
Future workVCF Observer provides comparisons of VCF files and visualizes these comparisons. It offers a user-friendly graphical interface. During future development, we plan to provide more varieties of visualizations such as violin plots to show the read depths of variants in VCF files and idiograms to mark the positions of variants to allow for patterns amongst different VCFs to be clearly visible. We also plan to normalize variants so that different representations of the same underlying variation are not treated as distinct. Furthermore, we plan on providing a variant comparison methodology which is capable of assessing calls based on their similarity to expected results. A contemporary tool that provides this functionality on the command line is vcfdist [16].Implementing a dedicated screen for users to directly add metadata information through the web interface would improve user experience and data organization. Furthermore, providing a metadata extraction option that leverages VCF file headers and filenames to deduce certain aspects of metadata would reduce manual input efforts. Providing long-term storage of user data and analyses by implementing user accounts would be helpful for users to compare their past analyses with one another as well as to rerun them with different options.Commonly used golden sets could be made available by the server directly. The option to use precompiled high performance software tools for VCF file filtering could be provided to reduce processing times. Lastly, preserving VCF file annotations and allowing their use within the application for filtering and analysis would allow for greater flexibility in VCF Observer’s usage.

VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis