Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures

Construction of G2P3D APIWe integrated public databases focusing on genes, transcripts and proteins to build an API for seamless mapping of identifiers for genes, transcripts, protein sequences and structures, referred to as the G2P3D API (Fig. 1). The HGNC22 maintains a curated online repository (https://www.genenames.org/) of approved genes and their unique symbols and names for human loci. The Ensembl genome browser (http://useast.ensembl.org/)23 offers access to a wide range of genomic annotations. UniProtKB (https://www.uniprot.org/)21 provides the most current data on protein sequences and functions. These databases each specialize in different aspects of biology and are regularly updated; thus, there would be situations where gene symbols annotated in UniProtKB have been changed or withdrawn in the HGNC, and UniProtKB IDs annotated in the Ensembl browser have been obsolete in the latest release of UniProtKB. To address this issue, G2P3D API has integrated UniProtKB, Ensembl and HGNC to ensure it captures the most comprehensive and up-to-date information on genes, transcripts and proteins.First, we obtained a list of all human proteins from UniProtKB/Swiss-Prot (indexed by UniProt Accession or UniProtAC) and their corresponding HGNC IDs. Then, we retrieved gene symbols for each protein from HGNC with the provided HGNC ID. Subsequently, all Ensembl and RefSeq transcript identifiers and corresponding UniProtKB protein isoform identifiers were obtained via the Mart View API from Ensembl BioMart56 for the human reference genome GRCh38. These data were processed to map each gene symbol (HGNC) to its encoded UniProtAC and then each protein isoform to its corresponding Ensembl and RefSeq transcript when available from Ensembl. Additionally, canonical protein isoform annotations, as defined by UniProtKB, and the canonical Ensemble and MANE Select26 annotation of transcripts were assembled. Next, the PDB6 identifiers for the experimentally solved protein structures per protein were obtained using Graph-API (https://www.ebi.ac.uk/pdbe/graph-api/uniprot/unipdb/:UniProtAC/) and the identifier for the predicted structure by AlphaFold25 was retrieved using API (https://alphafold.ebi.ac.uk/api/prediction/:UniProtAC/). As of October 2023, the G2P3D API, incorporated into the G2P portal (see an example of the API output in Fig. 1b), links 20,292 HGNC genes (Supplementary Table 2) that encode 20,242 UniProtKB/Swiss-Prot human proteins corresponding to 42,413 protein isoforms, via 53,607 Ensembl transcripts and 57,543 RefSeq transcripts, to 77,923 3D protein structures (58,027 experimentally solved and 19,896 computationally predicted).The G2P3D API is available at https://g2p.broadinstitute.org/api/gene/:geneName/protein/:UniProtAC/gene-transcript-protein-isoform-structure-map/. The Swagger user interface for the API and its documentation are available at https://g2p.broadinstitute.org/api-docs/.G2P Google Cloud infrastructureThe schematic of the G2P portal infrastructure is presented in Extended Data Fig. 1. The portal frontend is implemented in React.js, which is served by a Node.js backend running on Google Cloud Platform. The RCSB Saguaro 1D Feature Viewer57 and Mol*58 are adopted and customized as protein sequence and structure viewers, respectively, to visualize the frontend data on protein sequences and structures. The backend runs on Google App Engine, a serverless and on-demand compute offering that launches a variable number of backend instances proportional to usage.Google Cloud Storage (GCS) is utilized as the primary data store for variant and protein feature annotations per gene/protein alongside an in-memory datastore used on the backend to track the gene–transcript–protein isoform–protein structure mapping. The static data stored in GCS are collected, processed, formatted and uploaded by the portal admin (Extended Data Fig. 1). To load static data from GCS, the portal requests files directly from the frontend, which reduces latency by avoiding an additional ‘hop’ where data must first travel to the backend before reaching the frontend. From our testing, the minimum observed time for a backend request is a 60-ms round trip, and by requesting files directly from the frontend, the G2P portal saves a minimum of 60 ms per request. To load data from the in-memory datastore, the portal frontend makes requests to backend APIs, and the backend retrieves and returns the relevant records. The datastore is managed directly by the backend server, not by a separate process. In addition to managed data sources, the portal dynamically requests data from external APIs to provide the most current information possible. The full list of external and internal APIs as well as static and dynamic data maintained in the G2P portal are available in Extended Data Fig. 1.To this end, the G2P portal web app requests the latest protein sequence and structure records directly from UniProtKB21, PDBe32, AlphaFoldDB25 and EMBL-EBI APIs59. In the ‘Interactive Mapping’ module of the portal, users can provide their data (protein residue-wise annotation of variants, features, scores and protein structures) for joint analysis of user data with G2P-provided resources (‘Resources in the G2P portal’ in Results). The Interactive Mapping module can be securely accessed via Google sign-in, and to further ensure data confidentiality, all user-uploaded data remain within the user’s local browser only; therefore, no user-provided data leaves the user’s device. This ensures that the user has full, secure control over their data while simultaneously providing access to G2P Portal’s variants and protein features for joint analysis. When a user searches a gene or protein via the Gene/Protein Lookup or as part of the Interactive Mapping workflow, static mapping information is fetched directly with the G2P3D API to connect gene to protein to transcript to sequence to structure. Subsequently, detailed gene-specific and protein-specific data are fetched as static data from GCS and dynamic data from external APIs.G2P portal sitemapThe homepage is the central hub for navigating to two primary modules of the G2P portal: (1) Gene/Protein Lookup and (2) Interactive Mapping, complemented by a top navigation bar featuring tabs for About, Documentation, Statistics, API, Release Logs and Feedback (Extended Data Fig. 2). The disclaimer for using data in the G2P portal is available in the About page. The Statistics page shows the overview of the latest data in the portal. Across the two main modules of the portal, a suite of visualization tools has been implemented for intuitive exploration of the data—protein sequence viewer, variant information and protein feature cards, variant and protein feature tables, protein structure viewer, and mutagenesis output viewer. Details of these viewers are available in ‘Data visualization tools in the G2P portal’ in Methods and Extended Data Fig. 3.Users can access the Gene/Protein Lookup module by searching for a human gene or protein name. Upon valid input, users are directed to the gene/protein overview page containing the gene family and protein class information for the input gene and a navigation bar with tabs for five submodules, as follows. (1) The ‘protein sequence annotations’ tab hosts a protein sequence viewer that displays a complete list of protein features aggregated within the G2P portal (‘Protein features in the G2P portal’ in Methods). Users can choose a protein isoform identifier from the list of isoforms available for the selected protein, according to UniProtKB21. By default, protein features are displayed for the canonical protein isoform. (2) The ‘variant to protein sequence’ tab permits users to select an RNA transcript ID, to map variants from gnomAD9, ClinVar10 and HGMD11 for the selected transcript onto the protein sequence, and displays the mapped variants on the protein sequence viewer on top of protein features (Fig. 5a and Extended Data Fig. 3a). Users can apply filters on variants (different source databases and database-specific filters, for example, AF for gnomAD and pathogenicity for ClinVar) and protein features from an easily (un-)selectable checklist to the left of the sequence viewer. Variant and protein feature data displayed on the protein sequence viewer can also be explored as a table view and are exportable in CSV and PDF formats. Clicking on a specific variant within the sequence viewer, users can expand the variant and protein feature cards with detailed information on the variant and protein features at the variant position (‘Data visualization tools in the G2P portal’ in Methods and Extended Data Fig. 3c). (3) Under the ‘variant to protein structure’ tab, users can find the list of available PDB and AlphaFold protein structures for the selected gene (Fig. 5b). After selecting a structure, users are directed to the ‘structure_map’ page, where users can map variants and protein features onto structures and view them in the protein structure viewer, coupled with the sequence viewer (Extended Data Fig. 3b). Both protein sequence and structure viewer support dynamic feature and variant selection as described above. Outputs from the structure viewer are exportable in PyMOL-compatible formats. (4) The ‘gene to transcript to protein isoform mapping’ tab provides a table view of the mapping of identifiers across gene, transcript and protein sequences, downloadable in TSV format. The canonical protein isoform according to the UniProtKB, the canonical transcript in terms of Ensembl and the MANE Select transcript for the input gene are indicated in the table. (5) The ‘additional resources’ tab offers links to external gene information, such as UCSC60, ChEMBL61, DrugBank62, Orphanet63 and OMIM64. Moreover, the portal integrates MAVE data from MaveDB for 40 genes28 (Supplementary Table 7). When available, the ‘additional resources’ tab displays the MAVE data (that is, mutagenesis scores) as heat maps. Additionally, the portal shows the title, description and a short method text describing the MAVE assay. The raw JSON files of scores are available to download alongside a hyperlink to the original source of data.In the ‘Interactive Mapping’ modules, users can start their exploration from either a gene/protein identifier or their own protein structures (respective case studies are presented in Fig. 6 and Supplementary Fig. 11). When starting with a gene/protein identifier, users can provide their target gene of interest as input and then choose a structure (PDB or AlphaFold structure). The portal retrieves the protein sequence and the list of available structures dynamically from the UniProt sequence API and PDB/AlphaFold APIs, respectively. Alternatively, users can start with their own protein structures can upload them in PDB format. In both scenarios, the final step prompts a window for annotations, providing a sample format and allowing users to enter their annotations (variants, scores or features). The resulting data are displayed in the ‘view results’ section (Fig. 6a), featuring both sequence and structure viewers. When starting with a gene/protein identifier, users can also append additional feature annotations, such as protein features and variants, corresponding to the selected transcript or protein isoforms, and map them simultaneously with the user-uploaded data on protein sequences and structures.Data visualization tools in the G2P portalProtein sequence viewerWe adopted the RCSB Saguaro 1D Feature Viewer44 and customized it for online visualization of variants and protein features mapped onto the protein sequence with dynamic applications of filters on variants and protein features, referred to henceforth as the ‘protein sequence viewer’ (Extended Data Fig. 3a). The protein sequence viewer in the G2P portal is highly flexible. Variants and features are grouped under collapsible and expandable headers according to variant databases and feature groups and can be easily filtered in and out from the sequence viewer according to AF or pathogenicity criteria (see ‘Resources in the G2P portal’ in Results and ‘Protein features in the G2P portal’ in Methods for further details on variants and features integrated in the G2P portal). Users can download the customized mapping data as residue-wise annotations in CSV or PDB format. For example, Extended Data Fig. 3a shows the mapping of CBS gnomAD missense variants with the filter ‘singleton’ and ClinVar missense variants with the filter Pathogenic/Likely pathogenic, in the context of UniProt sequence features alone and other protein features collapsed for clarity.Protein structure viewerWe integrated the Mol* protein structure viewer58 to visualize variants, protein features and scores on protein structures, simultaneously with protein sequence (Extended Data Fig. 3b). Users can map three types of data from sequence to structure: variants (mutation positions, as spheres), scores (continuous variable, as a heat map) and multiclass features (discrete/categorical variable discretely colored by category). Users can map, review and recolor features as desired, and apply data filters concurrently. For example, a user can filter CBS ClinVar missense PLP variants (orange spheres) and gnomAD synonymous singletons (green spheres) and map them concurrently with the domain annotation (light blue) from UniProtKB on the protein structure (Extended Data Fig. 3b). In the Interactive Mapping module, users can map user-uploaded annotations on the structure and can further add variant and feature annotations from available databases, to inspect user-uploaded data in the context of existing data.The structure viewer is interconnected with the sequence viewer; when a user hovers over residues in sequence, they are highlighted in the structure, and vice versa. The G2P portal is dynamically linked with and loads structures from the PDB6 and AlphaFold25. Many AlphaFold structures show high-confidence structured domains surrounded by low-confidence regions, which challenge users to analyze the structure by obscuring structured regions and globular domains. As such, the structure viewer provides additional functionality, allowing users to hide residues on AlphaFold structures based on the AlphaFold confidence of the structure (pLDDT). To export data for subsequent analysis, the structure viewer allows users to download structures and all accompanying features in a prepared PyMOL file, which includes user-uploaded and the G2P portal-provided features as annotations in the PyMOL session.Variant and protein feature tableUsers can view per-residue annotation of variants and protein features per gene (or protein) by clicking ‘view as table’ on top of the protein sequence viewer (Extended Data Fig. 3a). For gnomAD variants, the table includes the HGVS annotation of variants (HGVSp, HGVSc), AC and frequency information, homozygote count, and so on (for example, see https://g2p.broadinstitute.org/table/LDLR/P01130-1/ENST00000558518/missense/). For ClinVar variants, the details include genomic and protein consequences, ClinVar variation type, and other clinically relevant information as available in ClinVar (for example, clinical significance, phenotypes and review status; for example, see https://g2p.broadinstitute.org/clinvartable/LDLR/P01130-1/ENST00000558518/clinvar_single/). Similarly, for HGMD variants, the table lists the variant consequences (genomic and protein), codon change, HGMD confidence and disease annotations (for example, see https://g2p.broadinstitute.org/hgmdtable/LDLR/P01130-1/ENST00000558518/missense/). The protein feature table (for example, see https://g2p.broadinstitute.org/features/LDLR/P01130/P01130-1/) includes all features described in ‘Protein features in the G2P portal’ in Methods. Data in these tabular views can be downloaded as machine-readable text files for further usage by users, except for the licensed HGMD professional data. Note that all variant-level information reflects data available in source databases (gnomAD, ClinVar and HGMD) and users are referred to respective databases for the definitions and details of those information.Variant information and protein feature cardsFrom the protein sequence viewer, users can click on a variant position to view detailed variant information and protein features for the variant position as summary reports in ‘variant information’ and ‘protein feature’ cards, respectively (Extended Data Fig. 3c). These cards include details of a selected variant, which is also available in the ‘table view’ for the entire gene or protein (as described above in ‘Variant and protein feature table’). For example, in the case presented in Extended Data Fig. 3c, users can click on the CBS ClinVar missense variant at Gly116 on the protein sequence viewer, and a card will display below revealing details for variant p.Gly116Arg, such as p.Gly116Arg has been classified as a PLP variant and is associated with homocystinuria. At the same time, the protein feature card shows a summary of five categories of protein features for the residue position Gly116. The summary highlights that the variant p.Gly116Arg substitutes a small, flexible amino acid Gly to a charged amino acid Arg (physicochemical properties), the variant is located at a buried region of the protein structure with an accessible surface area of 7 Ã…2 (structural features), and this missense variant substitutes a known PTM site (PhosphoSitePlus PTMs). Whenever available, each variant and feature information in the cards are linked to their original sources for users to check for any update in the original data source (Extended Data Fig. 3c).Mutagenesis output viewerWe implemented a mutagenesis output viewer to display the MAVE from MaveDB28, when available (Supplementary Table 7). Users can view MaveDB data under the ‘additional resources’ tab of the Gene/Protein Lookup module (Extended Data Fig. 2). For single missense mutations, a 21 × N heat map is displayed, where N is the range of mutations covered by MAVE perturbations with 21 rows for the 20 different amino acids and the stop codon possible at each position. Each value in the heat map corresponds to the score recorded in the MAVE, or the average of multiple scores if multiple scores were recorded for the same mutation. An example is shown in Extended Data Fig. 3d for CBS MAVE readouts collected via DMS-TileSeq at low levels of vitamin B6. Scores show a clear distinction between residues 90 and 390 (low scores in blue) and residues at the N terminus and C terminus (high scores in red). For double mutant MAVEs, where two different residues were perturbed concurrently, an N × N heat map is displayed where the row and column each represent one of the two residue positions perturbed in the experiment. As with the single missense mutations, the value in the heat map corresponds to the reported score from the mutation or the average of all scores reported for the residue pair. Different MAVEs utilizing different techniques have different score scales and scores that require interpretation in the context of the methodology used by the corresponding MAVE. To this end, the G2P portal includes a brief description of the experimental technique and scoring methodology of the paper, as provided by MaveDB, and additional links to the score set page in MaveDB and the associated publication such that users can best understand the experimental conditions under which any specific score of interest was collected. To facilitate deeper analysis, the portal includes a downloadable JSON file with all coding and noncoding variants from MAVE data.Variant aggregationWe downloaded raw VCF files (https://gnomad.broadinstitute.org/downloads/) for genome and exome datasets from gnomAD9 v2.1.1 and selectively extracted variants that passed all variant filters for quality control (filter = ’PASS’ flag) and possessed valid HGVSp annotation. When the same variant was identified in both genome and exome datasets, we summed the AC and the sample count, subsequently calculating the merged AF value. Variant data from ClinVar10 (October 2023 release) was downloaded directly from the FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variants_summary_txt.gz). Variants were filtered based on the reference genome GRCh38 and valid HGVSp annotation. From HGMD professional release (version 2023.1)11, variants on GRCh38 and with disease-causing state (variantType = ‘DM’ flag, indicating disease mutation) were extracted. Among those variants, we collected variants that have z valid HGVSp annotation retrieved from Ensembl Variant Effect Predictor65 REST API (https://rest.ensembl.org/vep/human/hgvs/:hgvs_notation/). Variants were excluded under the following conditions: (1) reference or altered amino acids are not 20 natural amino acids, or (2) the gene was not included from a list of genes from the G2P3D API, which contains only a reliable set of genes both present in HGNC22 and UniProtKB21 databases. The resulting variant aggregation spans 18,014,632 gnomAD variants, 1,749,628 ClinVar variants and 312,783 HGMD variants mapped on protein sequences.Variant and feature mapping onto proteinsGenetic variants are annotated on the transcript; for example, variants sourced from gnomAD9 are annotated on Ensembl23 transcripts (ENST-), and those from ClinVar10 and HGMD11 are annotated on RefSeq24 transcripts (NM-). Each variant aggregated from the databases was linked to its corresponding protein isoform IDs using the in-house G2P3D API (Fig. 1b) and then mapped onto its amino acid position upon fetching the protein sequence using UniProt REST API (https://rest.uniprot.org/uniprotkb/:UniProtAC.json/). Variants were mapped to both canonical and noncanonical protein sequences but only to structures of canonical protein sequences. Finally, proteins’ functional and structural features were annotated onto variant positions at the protein level (‘Protein Features in G2P portal’). The predicted structures cover the full-length protein sequences; however, the experimental structures often cover only parts of the protein and have gaps. We used the polymer_coverage API (https://www.ebi.ac.uk/pdbe/api/pdb/entry/polymer_coverage/:pdbid) to map experimental structure coverage to the sequence space for each chain. We then mapped protein residue positions from sequence to structure and consequently transferred the variants (that is, protein consequence positions) to protein structures, leveraging Mol*58 functionality to properly align variants to positions before and after gaps. We found some limitations with polymer_coverage API and Mol* coverage detection, for example, a gap in a PTEN structure, PDB 5BUG (that is, missing region in crystallographic structure), is incorrectly reported in the API response and incorrectly aligned in the Mol* software.Protein features in the G2P portalThe G2P portal provides a comprehensive set of protein features on both protein sequences and structures, which include physicochemical properties of amino acids, sequence annotations collected from external databases such as UniProtKB21 and PhosphoSitePlus27, 3D structural features collected from PDB6 and AlphaFold25 and readouts from the MAVEs when available in MaveDB28.

(1)

The physicochemical properties of reference amino acids: The 20 natural amino acids are grouped into six categories based on physicochemical properties of their side chain R-groups; (i) Aliphatic—alanine (Ala/A), isoleucine (Ile/I), leucine (Leu/L), methionine (Met/M) and valine (Val/V); (ii) Aromatic—phenylalanine (Phe/F), tryptophan (Trp/W) and tyrosine (Tyr/Y); (iii) Polar/neutral—asparagine (Asn/N); glutamine (Gln/Q), serine (Ser/S) and threonine (Thr/T); (iv) Positively charged—arginine (Arg/R), histidine (His/H) and lysine (Lys/K); (v) Negatively charged—aspartic acid (Asp/D) and glutamic acid (Glu/E); (vi) Special—proline (Pro/P; a cyclic side chain and cannot make backbone hydrogen bonds), glycine (Gly/G; does not have a side chain that allows flexibility) and cysteine (Cys/C; a reactive sulfhydryl group -SH in the side chain). In addition to these groupings, the molar mass (g mol−1) and hydropathy index (a numerical measure reflecting the hydrophobicity of a side chain—the larger the number is, the more hydrophobic the amino acid) of each amino acid are shown for the protein sequence.

(2)

3D structural features: The G2P portal provides precomputed annotations on structural features. These features are computed based on AlphaFold-predicted structures, aiming for extensive coverage. Secondary structures of amino acids refer to the local 3D conformations of the polypeptide backbone. DSSP30 (Define Secondary Structure of Protein) is the standard tool for determining secondary structure by classifying each residue into a three-class structure (H, helix; B, β-sheet/strand; C, loop/coil) or a nine-class structure (G, 310-helix; H, α-helix; I, π-helix; P, polyproline helix; B, isolated β-bridge; E, parallel β-sheet; S, bend; T, turn; C, loop/coil). We utilized DSSP to annotate both three-class and nine-class secondary structures on AlphaFold structures. When experimental structures are available (for example, from PDBe/SIFTS31), we provide PDBe/SIFT secondary structures, which are derived from experimental structures, in a separate track. Additionally, DSSP calculates the accessible surface area (in Å2) and the backbone torsional phi/psi angles (in degrees) for each amino acid position within the context of the protein’s 3D structures. Furthermore, we include a per-residue confidence score produced in AlphaFold, known as pLDDT. The score ranges from 0 to 100 and categorizes the confidence as ‘very high’ (pLDDT > 90), ‘high’ (pLDDT > 70), ‘low’ (pLDDT > 50) or ‘very low’ (pLDDT < 50). Residues are color coded accordingly. It is important to note that residues with very low pLDDT scores may indicate that their structures are disordered in isolation.

(3)

Sequence annotation from UniProtKB: We gathered the sequence annotations that describe various regions, domains, or sites of interest for a protein, elucidating its function, binding, sequence motif, domain/site/region, molecular preprocessing and more. The G2P portal offers 31 selected sequence annotations: active site, binding site, chain, coiled coil, compositional bias, cross-link, disulfide bond, DNA binding, domain, glycosylation, initiator methionine, intramembrane, lipidation, modified residue, motif, mutagenesis, non-adjacent residues, non-standard residue, non-terminal residue, peptide, propeptide, region, repeat, sequence conflict, sequence uncertainty, signal, site, topological domain, transit peptide, transmembrane and zinc finger.

(4)

PTM: PTM refers to the covalent and enzyme-mediated modification of proteins to form mature proteins. We collected amino acid positions of seven different PTM types from the PhosphoSitePlus database: (i) acetylation—addition of an acetyl group; (ii) methylation—addition of a methyl group; (iii) O-GlcNAc—addition of N-acetylglucosamine, also known as O-linked N-acetylglucosamine; (iv) O-GalNAc—addition of N-acetylgalactosamine, also known as O-linked N-acetylgalactosamine; (v) phosphorylation—addition of a phosphoryl group; (vi) SUMOylation—addition of SUMO protein (small ubiquitin-like modifiers); (vii) ubiquitination—attachment of ubiquitin.

(5)

Readouts from MAVE: MaveDB28 is a public repository dedicated to housing datasets from MAVEs. These datasets primarily result from deep mutational scanning or massively parallel reporter assay experiments. When a gene/protein is available in MaveDB (Supplementary Table 7), amino acid positions displaying variants whose effect falls within the top and bottom 99th percentile are highlighted in the protein sequence and structure viewer. The rationale behind displaying only the top and bottom 99th percentile was clarity of visualizing the data, but the full data are displayed as heat maps under the additional resource tab of the Gene/Protein Lookup module and are downloadable in JSON format (Extended Data Figs. 2 and 3d).

Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles