OmicScope unravels systems-level insights from quantitative proteomics data

OverviewTo develop OmicScope, we conducted a survey of computational tools designed to conduct proteomics downstream analysis, with a particular focus on those capable of performing differential analysis. Our survey identified 15 computational environments, which were evaluated based on criteria such as tool distribution (package, web tool, desktop application), input formats, features for conducting differential proteomics analysis, capacity for enrichment analysis, data integration capabilities, meta-analysis functionalities, export options, and code availability. The details of all evaluated tools are provided in Supplementary Data 1, which served as the basis for defining the features of OmicScope.OmicScope was designed to be an integrative pipeline for proteomics data analysis, encompassing differential proteomics, enrichment analysis, and meta-analysis. Developed as Python package, OmicScope pipeline includes three primary components: OmicScope, EnrichmentScope, and Nebula (Fig. 1). Once quantitative data is inserted into the workflow, the OmicScope determines differentially regulated proteins (DRPs). These DRPs are then subjected to enrichment analysis using the EnrichmentScope algorithm, aiming to elucidate key biological features. Additionally, individual studies analyzed using the OmicScope and/or EnrichmentScope algorithms can be exported and used as input for Nebula. In Nebula, users can analyze multiple studies collectively, establishing correlations and identifying shared features across independent results. Each component, when activated, generates a set of figures and tables, streamlining user interactions for both the package and web application.Fig. 1: OmicScope workflow.The OmicScope workflow begins with the import of data from various sources, including outputs from proteomics tools and generic formats. Once imported, the OmicScope module determines differentially regulated proteins. These proteins are then directed to the EnrichmentScope module, which facilitates over-representation and gene-set enrichment analyses. Data derived from both OmicScope and EnrichmentScope can be seamlessly used as input for Nebula, a module that integrates results from multiple studies using a systems biology approach. Each module within OmicScope is equipped with its own visualization toolset and allows for the export of tables, vectorized images, and graphML files.To facilitate the access of non-programmers to OmicScope’s pipeline, we implemented the whole package’s functionalities in a user-friendly and highly interactive web application (See details in Appendix and Supplementary Figs. 13–15). OmicScope Web allows users to extract proteome information from dynamic plots, including bar plots, dot plots, and networks. In addition to providing explanations for each plot and its corresponding parameters, the web application enables users to customize the OmicScope workflow to meet their specific requirements. To enhance user experience by minimizing clicks and simplifying data handling, the OmicScope web application automatically generates all results and figures based on user input, which are accessible throughout the analysis process.Furthermore, OmicScope workflow prioritizes the reporting of proteomics results to the scientific community, providing broad range of export methods, including tables, figures, and networks. Our tool exports figures in a vectorized and high-definition manner, tables containing data used for plots, and networks using the universal graphML file format.Input methodsProteomics research exhibits substantial diversity in experimental workflows, including mass spectrometer selection, acquisition modes, fragmentation methods, and quantitative approaches. This inherent diversity requests a wide array of software tools for protein identification and quantitation, each with its strengths and limitations, leading to interoperability challenges10,23.To address these challenges, OmicScope offers eight data import methods (See Methods, Appendix, and Fig. 1), including six tailored to widely adopted proteomic software: MaxQuant, PatternLab V, DIA-NN, Proteome Discoverer, FragPipe, and Progenesis QI for Proteomics. These methods import outputs from respective software considering their unique characteristics. For software not yet integrated into OmicScope, the “General” method allows users to create custom spreadsheets for input into the OmicScope pipeline. This method accepts generic expression files, making OmicScope compatible with data from various omics platforms, such as genomics and transcriptomics. “General” method is able to perform differential proteomics analysis or import existing statistical analyses based on imported spreadsheets.Aiming to provide an import method that joins succinctness, simplicity, and speed, we implemented “Snapshot” method, in which the users can import proteomics results containing assessed proteins, along with their associated fold changes and statistical outcomes. While Snapshot presents certain limitations concerning the number of plots that can be generated (refer to Supplementary Data 2), this method substantially improves interoperability across studies, especially given that many studies typically provide restricted information from their analyses, as demonstrated in the cases of Nie 2021 and Wang 2021. By integrating all of these input methods, OmicScope stands out as the platform capable of handling the widest variety of files (see Supplementary Data 1).OmicScope: the core moduleThe central module of OmicScope shares the same name as the algorithm described herein. This module plays a pivotal role in organizing data, performing normalization and data imputation, filtering proteins, and carrying out differential proteomics analysis. It identifies differentially regulated entities and generates ready-to-publish figures (Fig. 2A).Fig. 2: OmicScope performs differential proteomics analysis and data visualization.A OmicScope offers various data import methods, including established software and generic approaches. Once data is successfully imported, OmicScope defines data architecture, performs or import differential proteomics analysis, filters data, identifies differentially regulated proteins (DRPs), and generates tables, figures, and exports. In Crunfli study (provided as a Source Data file), the two-tail t-test was performed followed by multiple hypothesis correction using BH approach (B–E) Illustrative figures generated by OmicScope: (B) Bar plot displaying the count of identified proteins and DRPs. C Volcano plot with accompanying density plot highlighting the top 10 DRPs based on Adjusted p value. D Heatmap of DRPs with Adjusted p value less than 0.002, with colors representing z-score. COVID-19 patients and controls are denoted as dark cyan and purple, respectively. E Boxplot depicting the abundance of proteins identified from the MAPK. For this import method, boxplot considers 38 MS-runs coming from 19 subjects. Data are presented as median (center), quartiles (bound box), whiskers with 1.5*Interquartile Range, and outliers according to inter-quartile range. F Protein-protein interaction network generated by OmicScope with DRPs having an Adjusted p value less than 0.005. In the left graph, proteins are colored based on log2(fold change), while the right graph represents proteins colored according to their communities identified using Louvain algorithm.To provide maximum versatility, pre-processing and statistical steps are optional within the OmicScope pipeline. When no statistical results are provided, OmicScope autonomously conducts statistical analysis, filtering data based on pre-specified parameters and selecting the most suitable statistical tests based on the data architecture (see Appendix section for details). This flexible architecture accommodates various experimental designs, including static and longitudinal approaches. In static cases, comparisons between independent groups are typically made using t-tests for binary comparisons or One-way ANOVA for more than two independent conditions. In longitudinal analyses, OmicScope employs the Storey approach19, considering that differentially regulated genes vary over time based on natural cubic splines. In this longitudinal approach, statistical evaluations consider both within-group and between-group comparisons. Once nominal p value is calculated, OmicScope performs Benjamini-Hochberg multiple hypothesis correction24. By default, OmicScope designates proteins as differentially regulated if their adjusted p value is below 0.05, although users can define other parameters, such as fold-change and nominal p value cutoffs (Fig. 2A).OmicScope module offers a visualization toolkit for data overviewing, clustering, and protein-specific features (Supplementary Fig. 1). In the overview category, users can generate bar plots, volcano plots, MA-plots, and dynamic range plots, facilitating the visualization of data distribution and normalization, providing initial insights into the dataset. The clustering category includes functions for hierarchical clustering, principal component analysis, and K-means clustering, allowing users to compare samples based on protein abundances and assess sample clustering. In this category, users can select various metrics and calculation methods to perform clustering analysis for static and longitudinal experimental designs. Lastly, protein-specific category aims to extract deeper insights about selected proteins, using bar plots and box plots. In this category, OmicScope also includes an integration with STRING API, providing a PPI network of DRPs, being one of unique environments to couple quantitative proteomics to PPI survey (Supplementary Data 1).To demonstrate the capabilities of OmicScope, we employed previously published COVID-19 studies as illustrative examples (refer to the Methods section for details). These studies employed quantitative proteomics and transcriptomics to investigate SARS-CoV-2’s effects on various tissues. Specifically, we conducted a single analysis example, showcasing both differential proteomics and enrichment analysis, using proteins quantified by Crunfli25 in the brain tissue of patients who succumbed to SARS-CoV-2 complications. In this study, the authors meticulously detailed the processing parameters and furnished quantitative outputs from the analysis, ensuring reproducibility, and enabling result comparisons.Crunfli’s dataset was imported into OmicScope with default parameters, filtering out contaminants26, and resulting in the identification of 721 DRPs (Fig. 2B). After OmicScope defines the DRPs, proteomics figures can be generated using a dedicated function for each plot type. For scatter plots and heatmaps, users can specify gene names as arguments to highlight specific target proteins (as demonstrated in Fig. 2C). Additionally, for clustering analyses, users optionally can set a p value cutoff to filter proteins and conduct analyses based on statistical significance (Fig. 2D).In Crunfli’s dataset, for instance, we selected the MAPK family, including MAPK1, MAPK14, and MAPK9, all of which showed upregulation in SARS-CoV-2 infection compared to the control group (Fig. 2E). Moreover, the protein-specific category includes a function for exploring PPIs by querying the STRING database27. In this network analysis, users can identify communities based on the Louvain algorithm28 and filter data based on protein p values and/or specific proteins. In our analysis, we filtered proteins based on a p value threshold (pAdjusted < 0.005), applied the Louvain algorithm to conduct modularity analysis, and exported the data to facilitate data visualization (Fig. 2F).While the Crunfli dataset offers advantages for our pivotal analysis, it does pose a technical limitation due to the relatively small number of evaluated proteins in the study. To address this, we challenged OmicScope against a benchmark dataset provided by Meier29 and Demichev30. Meier spiked-in two distinct concentrations of Yeast digest into HeLa digest, while Demichev employed Frag-Pipe and DIA-NN workflows, resulting in the evaluation of over 12,000 proteins, specifically identifying DRPs from the yeast digest. Using OmicScope, we identified two distinct expression profiles highlighting differential abundance among yeast protein concentrations, as demonstrated by Meier and Demichev (Supplementary Fig. 2). These outcomes highlight OmicScope’s capacity to handle varying data formats and sizes, performing a reproducible analysis of differential expressions.To illustrate OmicScope’s capabilities in conducting longitudinal analysis, we analyzed data provided by Grossegesse31, wherein they investigated proteome changes induced by SARS-CoV-2 in CaLu cell lines across four time points: 2, 6, 10, and 24 h post-infection (Supplementary Fig. 3A). In this analysis, OmicScope identified 614 proteins that were differentially regulated (p < 0.05) between the SARS-CoV-2 and Mock groups. Examination of the K-means plots revealed three protein clusters, wherein SARS-CoV-2 induced a distinct protein pattern compared to the Mock group (Supplementary Fig. 3B). Further analysis of the PPI network derived from proteins assigned to cluster 0, which exhibited the highest fold-change variation, demonstrated up-regulation of proteins associated with interferon signaling during SARS-CoV-2 infection, consistent with previous findings31,32. These results underscore OmicScope’s integrative feature, wherein proteins identified through K-means clustering can be leveraged to explore PPIs and elucidate molecular mechanisms underlying biological phenomena.EnrichmentScope: enhancing biological insightsOne of the critical and challenging aspects of omics studies is extracting meaningful biological insights from hundreds or even thousands of differentially regulated entities. A commonly applied method for this purpose is enrichment analysis, wherein experimental gene or protein sets are compared against pre-established datasets, which may encompass biological pathways, molecular functions, kinase-associated genes, and other relevant categories. EnrichmentScope addresses this challenge by furnishing specialized enrichment analysis capabilities.After executing the OmicScope module, users can proceed to perform enrichment analysis on EnrichmentScope module, specifying between two approaches: Over-Representation Analysis (ORA, conventional enrichment) or GSEA. Then, users must select specific databases, choosing between the 224 libraries offered by Enrichr18. Optionally, EnrichmentScope also can consider all proteins evaluated in the study as background for enrichment analysis. Once the analysis is performed, the module provides a result table and a toolkit of visualization functions, including the ability to export quantitative and enrichment data (Fig. 3A, Supplementary Fig. 4).Fig. 3: EnrichmentScope employs a systems biology approach for enrichment analysis based on data provided by OmicScope.A EnrichmentScope performs Over-represented analysis (ORA) or Gene Set Enrichment Analysis (GSEA) using Enrichr libraries. For Crunfli’s dataset (provided as a Source Data file), we applied ORA workflow, performing a Fisher’s exact test with multiple hypothesis correction using BH approach. B–F Depiction of figures generated using the EnrichmentScope module. B Dot plot illustrating the top 10 enriched terms in the analysis. C Dot plot showcasing the count of differentially regulated proteins in terms related to neurodegenerative diseases. D Heatmap of differentially regulated proteins associated with terms from (C). E Network connecting enriched terms with their respective proteins, colored based on foldchange. The labeled proteins are shared among all processes. F EnrichmentMap displaying all enriched pathways, colored by modules defined using the Louvain algorithm. Term labels were determined based on intra-module connectivity and p value.EnrichmentScope offers visualization tools like dot plots, facilitating the assessment of enrichment statistics and the number of proteins considered for enrichment (Fig. 3B, C). Users can select top enriched terms based on adjusted p values to identify relevant biological processes (Fig. 3B). Another dot plot option allows users to explore protein regulation in depth, illustrating the number of DRPs in each enriched term (Fig. 3C). In Crunfli’s study, for instance, the top 10 enriched terms using KEGG Database were filtered, and pathways related to neurodegenerative diseases were selected, showing the ratio of up- and down-regulated proteins (Fig. 3B, C).EnrichmentScope also generates heatmaps and network graphs linking enriched terms to respective proteins (Fig. 3D, E). These visualizations reveal protein fold changes and proteins overlap among groups, shedding light on key factors in biological events. In the previously chosen pathways, proteins related to processes such as the proteasome, electron transport chain, and cytoskeleton were shared across all neurodegenerative processes, offering insights into the effects of SARS-CoV-2 on COVID-19 patients. Following this analysis, users can further investigate proteins of interest within the OmicScope module using functions like box plots, PPI networks, and more.A challenge encountered in enrichment analysis is dealing with data redundancy, particularly prevalent in hierarchical databases such as Reactome33 and Gene Ontology34, which can lead to an overwhelming amount of information, as many pathways indicate a similar biological function (Supplementary Fig. 5). To address this limitation, EnrichmentScope apply systems biology approach similar to what is proposed by EnrichmentMap, wherein enrichment terms are represented as nodes within a network35 (Fig. 3F). Besides providing a simplified network representation, this strategy also simplifies information extraction, reduces data redundancy without omitting data and aids in selecting targets for further experimental validation. To connect each enriched term in the network, the algorithm calculates the pairwise Jaccard similarity indices, considering genes/proteins overlapped between target terms (See Appendix). By default, EnrichmentScope establishes links when the Jaccard Similarity Index exceeds 0.25, enabling graph construction. Additionally, EnrichmentScope automatically searches for communities within the enrichment map, labeling nodes (terms) that present highest intra-module degree (Fig. 3F). In addition to integrating quantitative and enrichment data, our implementation offers a wide selection of libraries, two enrichment approaches, and network visualization capabilities, distinguishing it as a notable feature compared to other platforms (see Supplementary Data 1).Nebula: from singular studies to meta-analysisThe advent of omics platforms has exponentially increased the accumulation of data over the years, driving scientists to develop tools capable of comparing independent studies or even integrating experiments in a multi-omics fashion. Therefore, OmicScope introduces the Nebula module, designed to enhance data integration, interpretability, and comparison between studies. Although evaluating multiple independent proteomes simultaneously is a common approach, our software survey revealed that meta-analysis is a rare feature among computational proteomics tools (see Supplementary Data 1)9,36,37.The Nebula workflow utilizes the outputs of OmicScope/EnrichmentScope for data integration and visualization. These outputs have the extension “.omics” and can be generated by running the OmicScope module, which returns quantitative data, or the EnrichmentScope module, which provides both quantitative and enrichment results. For each independent analysis, one of these previously described modules must be executed, and Nebula will read each output file to compile them into a unified object. Once the files are imported into Nebula, a set of visualization functions becomes available for conducting studies comparisons at the protein and/or enrichment levels (Fig. 4A, Supplementary Fig. 6).Fig. 4: Nebula, the meta-analysis module, compares independent studies utilizing data outputs from OmicScope and EnrichmentScope.A Nebula facilitates comparative analysis of independent studies based on OmicScope and EnrichmentScope outputs (provided as a Source Data file). B–H Figures generated using Nebula. B Bar plot depicting the count of whole (gray) and differentially regulated proteins/genes (colored) across various studies, as well as the combined count. C Dot plot showing the count of up-regulated and down-regulated entities. D Top 10 enriched pathways according to the KEGG database for all organs. Upset plots for (E) proteins and (F) enrichment terms, illustrating overlapping sizes among conditions. G Circular plot displaying all differentially regulated proteins and their shared relationships among evaluated groups (cyan links), along with shared enrichment terms among groups (black links). Each protein is annotated with its respective foldchange. H Circular plot depicting proteins differentially regulated in Oxidative phosphorylation among studies, with accompanying foldchange values. Source data are provided as a Source Data file.To demonstrate Nebula’s capabilities, we used data from Crunfli 2022, Nie 2021, and Wang 2021. These selected studies assessed the effects of SARS-CoV-2 on patients’ tissues, with Crunfli examining proteomic signatures in the brain, Wang evaluating proteomics and transcriptomics effects in the lungs, and Nie reporting the liver as the most affected organ in proteomics terms. In Nie’s and Wang’s studies, the authors just provided DRPs and genes, enabling the application of the Snapshot method for ORA (Fig. 4B).Nebula’s pipeline supports various plots that facilitate the simultaneous comparison of all target groups. Bar plots and dot plots offer an initial overview of the groups by comparing the number of proteins and pathways evaluated in each condition, serving as initial steps in establishing associations between studies (Fig. 4B–D). In the selected datasets, the lungs exhibited the highest number of DRPs and genes, followed by the liver and brain (Fig. 4B). Utilizing the Nebula integrative analysis approach, we noteworthy all examined tissues presents a consistent elevated number of up-regulated entities when compared to down-regulated counterparts (Fig. 4C). When filtering enrichment terms to highlight the top 10 pathways identified in each condition, Nebula can pinpoint several potential pathways worthy of further investigation (Fig. 4D).To delve deeper into comparisons, Nebula offers tools for examining overlaps at both protein and enrichment levels. While Venn diagrams are commonly used for visualizing overlaps, they have limitations when comparing more than four conditions, producing illegible plots (Supplementary Fig. 7). To overcome this limitation, Nebula includes circus plots and upset plots in its pipeline (Fig. 4E, F). In the Upset plot38, each condition is depicted in a row, while columns illustrate non-zero intersections exclusively among the labeled groups specified in the frame (Fig. 4E, F). The advantage of the Upset plot lies in its readability and the absence of limitations regarding the number of groups analyzed. In the example datasets, only 19 proteins and genes exhibited dysregulation in all tissues, whereas the largest overlap encompassed 467 DRPs between the lung and liver proteomes (Fig. 4E). On the other hand, when examining overlaps in enrichment terms, the highest overlap was found between brain and lung proteomes, with 34 terms exclusively shared between these two tissues.In addition to the Upset plot, Nebula can also perform comparisons across groups using circular plots. In this plot, Nebula links each group with lines, with each link representing a protein that overlaps between those conditions. Each protein also displays its respective fold change in the respective study, generating a circular heatmap. This circular plot complements the Upset plot by providing a view of the proportion of up- and down-regulated proteins shared among groups. As expected, in the studies under evaluation, the major shared proteins were up-regulated (Fig. 4G).Nebula also offers a three-dimensional interpretation of data, considering groups, proteins, and enrichment terms simultaneously. Our circular diagram allows user to specify an enrichment term to be searched in all datasets, followed by the filtering of proteins associated with those terms in each study. Nebula then generates a circular plot that connects study and proteins, color-coding them based on their respective fold changes per group. In the example datasets, “oxidative phosphorylation”, enriched in all studies, was chosen to demonstrate that major proteins in this pathway were indeed up-regulated in all organs (Fig. 4H).To provide systems-level information about multiple studies, Nebula’s array of visual representations also comprises network and statistical analyses. Similar to the methodology employed in EnrichmentScope, Nebula generates a graphical representation that establishes connections between studies and their corresponding DRPs, which also can be exported to third-party software tools (Fig. 5A).Fig. 5: Systems biology approach with Nebula.A Nebula employs a systems biology approach, presenting proteins differentially regulated for each study as networks, enabling a detailed exploration of shared proteins among groups (provided as a Source Data file). B Differentially regulated proteins can be compared pairwise using statistical tests or similarity indices, with a heatmap displaying each pairwise result. Users have the option to define background size or utilize all identified genes in all conditions as background for Fisher’s Test. Here, we performed Fisher’s exact test to compare groups using human proteome background length (20,423). C Based on results from similarity indices or Fisher’s Test, users can generate a network, establishing links between studies according to parameter thresholds.Two other systems biology strategies employed by Nebula to assess the similarity between studies are similarity analysis and statistical tests. In pairwise similarity analysis, Nebula computes similarity indices using the Jaccard algorithm by default across the target studies35. Nebula is also capable of using alternative metrics, like Pearson, Euclidean, and others, to calculate the similarity index using protein fold change. On the other hand, while performing statistical tests, Nebula applies Fisher’s Exact test to compare the overlap between studies by considering the entire set of imported proteins as the background, which results in pairwise p values. Similar to conventional enrichment analysis, users can optionally specify alternative background sizes, such as the number of reviewed proteins in a specific organism according to the Uniprot database. Alternatively, Nebula also encompasses other statistical analysis, such as t-test, Wilcoxon, or Kolmogorov-Smirnov test, using the fold-change distribution to compare studies. The results from the similarity and statistical analysis can be visualized using heatmaps and graphs. In the network representation, each node represents a group, while links are depicted according to pre-defined thresholds for similarity indices or p values(Fig. 5 B, C).In the example discussed here, DRPs from the four groups were compared using the Fisher’s Exact Test approach, utilizing the reviewed human proteome database as the background (proteome size: 20,423 proteins). The heatmap showcases all pairwise p values generated in this analysis (Fig. 5B), whereas the network representation filters p values below 0.05 and connects each group accordingly (Fig. 5C). This analysis illustrates that the effect triggered by SARS-CoV-2 exhibits a stronger relationship at the protein level, particularly between the liver and lung proteomes, as previously suggested by other Nebula plots.

Hot Topics

Related Articles