Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants | BMC Bioinformatics

VGC is a tool designed for analyzing variant data and visualizing VCF files. It utilizes a range of technologies and libraries to offer a user-friendly experience (Fig. 1).Fig. 1Design and integration of VGC. The query pipeline of VGC offers four distinct search options, as well as knowledge-based support with visualization and analysis. Within a given VCF file, users may choose to query single gene names or genomic locations as well as multiple genes or genomic locations simultaneously via file upload options. Relevant information pertaining to the queried variants is retrieved from stored files, thus allowing for efficient variant extraction from the uploaded VCF. The identified variations may then be displayed using interactive graphics, such as histograms, node graphs, spreadsheets, heat maps, sample comparisons, and gene data visualization. The pipeline is supported by several integrated databases and packages, allowing for rich analyses and visualizationsProgramming languages, applications and librariesVGC is a desktop application created using a JavaScript frontend and Java backend. The application is currently built using webpack [17] module bundler version 5.86.0, and packed for iOS, Windows, and Linux using electron-forge [18]. Communication between the frontend and backend of VGC is handled by the Axios HTTP library [19]. VGC is currently packaged using Electron for deployment, which allows the tool to be easily installed and run on a wide range of platforms and operating systems [20].UI components are created using the React framework [21] version 18.2.0, and styled using Tailwind CSS [22]. To generate highly interactive and dynamic graphics for data visualization, the application utilizes a range of libraries, including Syncfusion [23], react-force-graph [24], and Recharts [25]. These libraries provide a range of tools and functionalities for the visualization and analysis of complex data sets.Integration of publicly available databasesVGC draws from a range of public databases, including MSig Database for GO terms, as well as KEGG, Biocarta, PID, and Reactome [26,27,28,29,30]. By leveraging these powerful databases, VGC is able to provide users with rich and detailed information about the genetic pathways and functions associated with their variant data, allowing for deeper insights and a greater understanding of the underlying biology. VGC also includes a dynamic link to gnomAD for variant information, allowing users to easily access and explore genetic variation data from this well-known database [31]. Additionally, the tool includes ClinVar data for pathogenic variant information, providing users with different visualization options for identifying and understanding potentially harmful genetic mutations [32]. VGC supports the Human Genome Assemblies GRCh37 and GRCh38, ensuring compatibility with a wide range of data sets. The tool provides a range of options for exploring genetic variation, and can be tailored to the specific needs of the user by using optional phenotype input data.Dynamic link to gnomAD for variant informationThe dynamic link feature of VGC to gnomAD, a widely-used database for variant information provides users with a seamless connection to gnomAD, allowing them to access up-to-date and comprehensive variant data. The decision to implement a dynamic link specifically to gnomAD, as opposed to other databases, stems from its unique role as an aggregation database of genetic variation. This distinctive feature consolidates variant information from a variety of sources, providing a comprehensive resource. By establishing this dynamic link, VGC ensures that users have access to the latest information on variant frequencies and population-specific data. This integration enhances the accuracy and reliability of variant interpretation, empowering researchers to make informed decisions based on the most current genomic data available.Incorporation of ClinVar data for pathogenic variant informationInclusion of ClinVar data within VGC provides information on pathogenic variants and their clinical significance. By incorporating ClinVar data, VGC enables users to assess the potential pathogenicity of identified variants. Users can access curated information on variants that have been associated with specific diseases or conditions. This integration aids in variant prioritization, helping users focus on variants that may have clinical implications and guiding further investigation.Compatibility with human genome assemblies GRCh37 and GRCh38VGC is designed to work seamlessly with these widely-used genome assemblies, ensuring compatibility with a broad range of datasets. By supporting both GRCh37 and GRCh38, VGC enables users to analyze genomic variation data generated using different platforms and datasets aligned to these assemblies. This compatibility enhances the versatility and applicability of VGC, making it a valuable tool for a wide range of genomics studies and research projects.User input and preprocessingUpon opening, VGC displays a “welcome” page, allowing users to begin analyses for genome assemblies GRCh37 or GRCh38 (Fig. 2). For a given analysis, users may input two files: (1) a required VCF file, and (2) a supplemental and optional phenotype file specifying sample groupings.Fig. 2VGC user interface on startup. Users may begin an analysis by selecting a genome assembly (GRCh37 or GRCh38) and uploading the respective VCF fileExtraction and indexing of VCFWhen a new VCF file is uploaded to the program, VGC processes it to extract pertinent information, which is then stored in the user’s file system. A new directory named “VGCGeneratedFiles” is created in the user’s home directory, along with a corresponding directory that follows a specific naming scheme.For each VCF file processed, a directory named “VGC_<filename>” is created. Inside these directories, two text files, named info_<filename> and index_<filename>, store important data. The info_<filename> file holds overall file information, such as the VCF file version, total number of samples, total number of chromosomes, number of variants, the header line, and a list of chromosomes in the file. The index_<filename> file contains chromosome-specific information. This indexing by VGC enhances response times for future queries. For each chromosome in the VCF file, the following details are listed in the index file: starting and ending lines, starting and ending positions, number of variants marked as “PASS,” and the count of pathogenic variants for that chromosome.Customization to suit individual user requirements by incorporating optional phenotype input dataVGC allows users to incorporate additional phenotype information, aligning the analysis with specific research questions or clinical contexts. By incorporating phenotype input data, VGC enables users to explore genetic variations in the context of specific phenotypic traits, enhancing the understanding of genotype–phenotype relationships. This customization feature makes VGC adaptable to various research and clinical scenarios, ensuring that users can leverage the tool to its full potential in their specific domain of interest.User queries and visualizationQuery optionsUsers have the flexibility to search for specific genes or defined genomic ranges within the VCF file, enabling focused analysis of variants. When searching by gene, all variants corresponding to that gene within the VCF file are visualized. Alternatively, users can specify a genomic range, extracting and visualizing variants within the defined interval.The variant extraction process utilizes the information stored in the index_<filename> file, which, as described earlier, provides the starting and ending lines of chromosomes within the VCF file. Depending on the user’s selection of GRCh37 or GRCh38 as the reference genome assembly, the system accurately retrieves the relevant variants. Additionally, users can streamline their analysis by uploading a file containing multiple genes or genomic ranges, facilitating simultaneous querying of multiple genes or ranges. Variants associated with each queried gene or range are then extracted and visualized.Visualization optionsVGC offers a diverse range of visualization options tailored to meet various analytical needs.When a VCF file is initially uploaded, a default bar graph view will display all variants by chromosome present in the file, with each bar corresponding to the number of variants within a specific chromosome. Users can navigate through viewing history using forward and backward arrows. Hovering over a bar reveals details indicating the number of variants displayed as well as the corresponding genomic range. Clicking on a bar enables zoom functionality for a closer examination of variants within the selected data.Variant data may also be presented in a structured table format, enhancing accessibility and ease of analysis. User may choose to filter, sort, export, or other manipulate data in a spreadsheet-like display.For analysis of case–control studies, sample groupings, or sample genotypes, VGC provides a node graph visualization option. Users may toggle between 2 and 3D views, facilitating interactive exploration of variant relationships. Moreover, the tool provides Fisher’s Exact Test data for each variant relative to sample groups. The test assesses differences in variant abundance between designated groups (e.g., cases vs. controls) through Monte Carlo simulation. By analyzing a 2 × 3 matrix with default simulations (n = 2000), potential associations between variants and sample groups can be discerned, aiding in phenotype-genotype analyses.Secure and private local environment for data analysisVGC is designed to run on the local machine or servers, ensuring that users can work with their genomic data in a secure and confidential setting. By avoiding the need to upload VCF files to the cloud, VGC protects sensitive genomic data and addresses privacy concerns. This local deployment approach instills a sense of reassurance in users, as they can confidently maintain control over their data, ensuring it stays within their organization’s infrastructure. VGC requires Java version 1.8 or higher to run and is compatible with Windows, Mac, and Linux, offering flexibility for users across different platforms.

Hot Topics

Related Articles