PopMLvis: a tool for analysis and visualization of population structure using genotype data from genome-wide association studies | BMC Bioinformatics

Population structure can be inferred from Genome-Wide Association Study (GWAS) data and focuses on the genetic variation within and between populations by investigating the distributions of alleles and how their frequencies change over time [1]. Sophisticated algorithms implemented in standalone software are often used to infer population structure. A widely used tool is ADMIXTURE [2], which relies on maximum likelihood techniques [3]. Many of these software provide complementary results, but, to the best of our knowledge, there is a lack of a system that seamlessly visualizes the outputs of multiple software jointly. Another issue is that many softwares such as ADMIXTURE [2], FASTSTRUCTURE [3], STRUCTURE [4], and STRUCTURESELECTOR [5] rarely provide graphical outputs. Moreover, users cannot easily exploit existing additional related information (e.g., sex, disease status, known ancestry, etc.) while analyzing and interpreting their outputs like ClustVis [6].Here, we considered all aforementioned drawbacks and developed an interactive platform, named PopMLvis, which carries out a wide range of tasks that a user may need to infer population structure using GWAS data. PopMLvis is flexible as: (1) It supports a variety of input datasets, i.e., raw genotype data, Principal Components (PCs), and admixture membership coefficient matrix; (2) It performs dimensionality reduction using Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and PC-Air, which is a principlal component analysis that accounts for relatedness through the genetic relationship matrix (GRM) [7, 8]; (3) It performs various clustering algorithms (e.g., K-means and Hierarchical Clustering); (4) It detects outliers using Isolation Forest, OneClassSVM, and other metrics; (5) It offers an interactive and zoomable friendly graphical user interface; (6) It produces publication-ready figures in various types and resolutions. In addition, PopMLvis allows users to: (7) Download output files generated within PopMLvis with all required information that are ready for downstream analysis (e.g., association testing); (8) Link metadata with obtained clustering results; and (9) Integrate estimated genetic diversity indices generated by genetic structure programs (e.g., ADMIXTURE) and the clustering results. Since PopMLvis has a modular design, it is easy to add new modules (e.g., classification) or a new algorithm to the existing modules (e.g., uniform manifold approximation). PopMLvis is a secure web-based platform. Due to potential privacy concerns, we provide an offline version that can be installed locally. PopMLvis can be easily used without the need to write any script, which makes it more accessible to researchers.ImplementationPopMLvis consists of three main panels, each with unique functionalities that the user can perform as depicted in Fig. 1.

1.

Input and Machine Learning (ML) Panel: This panel is composed of three modules:

Choose data: The first step, users can choose to use their own data, or example data, which is provided along with PopMLvis platform. Then, users can specify the type of data to upload, if they prefer to use their own data, which could be raw or processed data. For processed data, it includes PCA and/or admixture outputs (i.e., fraction of ancestral origins as obtained by admixture tools [2,3,4,5]. Data will be immediately reflected on the visualization panel after the upload. For raw data, it includes genotype data, GRM, projected dataset, and PCA. Here, the user can perform dimensionality reduction as well. Moreover, PopMLvis supports PCA, PC-Air, and t-SNE 2D and 3D. Also, t-SNE can be run on top of PCA results to visualize the data in a more reduced space.

Clustering algorithms module: This module includes the K-means, Fuzzy C-means, and Hierarchical Clustering algorithms. The Fuzzy C-means algorithm is suitable when admixture exists between individuals, and these individuals can belong to multiple clusters/ancestries.

Outlier detection module: PopMLvis integrates outlier detection algorithms based on statistical metrics (mean, standard deviation, and covariance matrix) and machine learning techniques such as OneClassSVM, Local Outlier Factor, and Isolation Forest.

2.

Visualization panel: This panel supports three interactive plot types: (1) Scatter plots: 1D, 2D, 3D, zoom in/out, legend and label naming, download, etc.; (2) Admixture bar charts: the user can investigate the estimated ancestral fraction for each individual with different certainty values; and finally, (3) Dendrograms to visualize the hierarchal clustering of the data. Scatter plots and admixture bar charts are linked together, so a change in one plot will be reflected in the other plot.

3.

Option panel: This panel provides the users with an option to include additional information on individuals such as sex, age, disease status, etc. This can be reflected on the plots with color/shape differences. The user has flexibility to define plot name, labels, resolution, etc. This makes the PopMLvis graphical outputs ready to be integrated in scientific articles.Fig. 1PopMLvis pipeline/workflow: (1) Upload and visualize PCA and Admixture results; (2) Dimensionality reduction: PCA, PC-Air, t-SNE 2D and 3D; (3) Clustering: K-means, Fuzzy C-means, and Hierarchical Clustering; (4) Detecting outliers: Isolation Forest, local Outlier Factor, and Statistical measures; and (5) Download graphical plots and datasheets

PopMLvis system architectureThe architecture of PopMLvis consists of three main components (see Fig. 2):

1.

Front-End: The front-end is built using ReactJS. React makes our data visualization attractive and efficient. All communication with the back-end is achieved through REST APIs, benefiting from promise based HTTP clients for the browser. The website is compatible with different screen sizes, making the visualization dynamic.

2.

Back-End: The back-end on the server side is served as a REST API and was developed using Flask. We used Gunicorn as a pre-fork worker model, where the master manages a set of workers. The number of workers corresponds with the number of concurrent requests that our back-end can handle. Gunicorn should only need 4–12 worker processes to handle hundreds or thousands of requests per second. Python was used for the machine learning and computational algorithms. Numpy, Pandas, Matplotlib, Scikit-learn and Scipy are among the libraries that were used. To integrate the PC-Air R package, we needed to add another layer of communication between Python and R. In this case, Flask would serve as a middle layer, serving the front-end request to R and waiting for its response, to send it back to the front-end.

3.

Data layer: PopMLvis can handle several types of data with various file extensions, including plink binary data (.bed,.fam,.bim), pre-computed PCA results, Genomic Relationship Matrix (GRM), and admixture results. Most of the data will never be stored on the server. It will be either encrypted inside the body of the request using HTTPS protocol, or used only on the front-end. The choose data tab will keep the data in the front-end only. When settings change, no requests will be made to the back-end. The clustering algorithms and outlier detection modules require the data to be sent to the back-end for computation, but results will be returned back to the user without storing or keeping any trace of it. Because of various encryptions and file extensions, the dimensionality reduction uploads are stored locally with encrypted filenames, processed, and results are communicated to the user. All gathered data will be cleaned through a job scheduler, CRON.

Fig. 2PopMLvis schematic architecture

Hot Topics

Related Articles