mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate | BMC Bioinformatics

Approaches implemented in the mulea packageThe mulea R package provides two different approaches for functional enrichment analysis. For unranked sets of elements, such as significantly up- or downregulated genes, mulea employs the set-based overrepresentation analysis (ORA). Alternatively, if the data consists of ranked elements, for instance, genes ordered by p-value or log fold-change calculated by the differential expression analysis, mulea offers the gene set enrichment (GSEA) approach.The unordered set-based overrepresentation analysis (ORA)Mulea implements a set-based enrichment analysis approach that utilises the hypergeometric test (which is analogous to the one-tailed Fisher’s exact test) to identify statistically significant overrepresentation of elements from a target set (e.g., significantly up- or downregulated genes) within a background set (e.g., all genes that were investigated in the experiment). Therefore, a predefined threshold value—such as 0.05 for the corrected p-values and z-scores, or twofold change—has to be used in the preceding analysis.Addressing multiple testing: p-value correction and eFDR in the ORA analysisPerforming numerous statistical tests, such as evaluating enrichment across all ontology entries, leads to an inflated number of significant results (p-values < 0.05) due to chance, even if all null hypotheses are true. This phenomenon, known as the multiple testing problem, necessitates p-value correction. mulea offers various methods, including Bonferroni and Benjamini–Hochberg p-value correction, and empirical false discovery rate (eFDR).However, Bonferroni and Benjamini–Hochberg methods assume independent tests, which rarely holds true in functional enrichment analyses. For example, GO categories exhibit a hierarchical structure, potentially leading to unnecessary exclusion of significant results (enriched ontology entries). Therefore, mulea implements the robust, resampling-based eFDR, which takes into account the distribution of test statistics, making it better suited for analysing gene sets and ontologies typically employed by biologists. The eFDR implementation is based on the method described by Reiner et al. [14] and Hastie et al. [13].The empirical false discovery rate (efdr)For each ontology entry (j = 1,2,…,J) and the investigated target set (e.g., significantly differentially expressed genes), mulea calculates a p-value (pj) based on the hypergeometric test. To assess the unbiased statistical significance of each ontology entry, we compute the empirical false discovery rate (eFDRj) using a resampling-based approach.First, we determine the rank of each ontology entry’s p-value relative to the p-values of all ontology entries. Rj refers to the rank of the p-value of the jth ontology entry. Here, we do note the indicator function with Iverson brackets: I():$${R}_{j}={\sum }_{i=1}^{J}I\left({p}_{i}\le {p}_{j}\right) , j=1…J$$To calculate the expected rank (\({\overline{R}}_{j}\)) of the p-value of the jth ontology entry, we apply a resampling strategy, where resampling steps are indexed with s = 1,2,…,S. In each resampling step, we generate a simulated target set with the same size as the original target set, but with randomly selected elements from the background set. Then we recalculate the hypergeometric tests and the ranks of the p-values (\({R}_{j}^{s})\) for each resampling step. Let \({\overline{R}}_{j}\) be the expectation of the \({R}_{j}^{s}\) values over s:$${\overline{R}}_{j}=\frac{{\sum }_{s=1}^{S}{R}_{j}^{s}}{S}$$The eFDR of the jth ontology entry (eFDRj) is calculated as the ratio of the expected rank (\({\overline{R}}_{j}\)) to the actual rank (Rj). If the calculated eFDRj exceeds 1, it is truncated to 1.$${eFDR}_{j}=min\left(\frac{{R}_{j}}{{\overline{R}}_{j}}, 1\right)$$To enhance clarity, a simplified R code illustrating the functional enrichment test and eFDR calculations is provided in the Supplementary Note.Calculating the eFDR for all ontology entries is computationally demanding, especially for large ontologies and numerous resampling steps (mulea recommends at least 10,000). To significantly improve the processing speed, mulea implements the eFDR functionality in efficient C +  + code which can be run on multiple threads. While similar approaches exist in tools like Gowinda [15] and FuncAssociate [16], mulea offers advantages in terms of data type compatibility and offline usability.The ranked list-based gene set enrichment analysis (GSEA)Mulea facilitates ranked list-based enrichment analysis using the GSEA approach. This method requires an ordered list of elements (e.g., genes or proteins) as input, where the order reflects the user’s prior analysis (e.g., p-values and/or fold-changes). The list should encompass all elements involved in the analysis, such as all expressed genes in a differential expression study. mulea leverages the Kolmogorov–Smirnov statistic coupled with a permutation test [5] to assess enrichment within gene sets. This implementation is achieved through the integration of the fgsea Bioconductor package [17].Input types and formatsIn the mulea, the overrepresentation analysis is implemented in the ora function which requires three inputs: (1) an ontology (e.g., GO) data frame that fits the investigated taxa and the applied gene or protein identifier type; (2) a vector containing the list of elements under investigation (e.g., differentially expressed genes), named as target set; and (3) a vector containing the background set of elements for the analysis (e.g., all genes that were investigated in the experiment). Likewise, the gsea function, which calculates the enrichment of ranked elements, requires three inputs: (1) an ontology data frame; (2) a vector containing the names or identifiers of the elements (genes or proteins); and (3) a vector containing values corresponding to their order (e.g., p-values or log fold-change values).OntologiesMulea R package offers a rich collection of ontologies to enhance functional enrichment analyses. These ontologies span across diverse species: Arabidopsis thaliana, Bacillus subtilis, Bacteroides thetaiotaomicron, Bifidobacterium longum, Bos taurus, Caenorhabditis elegans, Chlamydomonas reinhardtii, Danio rerio, Daphnia pulex, Dictyostelium discoideum, Drosophila melanogaster, Drosophila simulans, Escherichia coli, Gallus gallus, Homo sapiens, Macaca mulatta, Mus musculus, Mycobacterium tuberculosis, Neurospora crassa, Pan troglodytes, Rattus norvegicus, Saccharomyces cerevisiae, Salmonella enterica, Schizosaccharomyces pombe, Tetrahymena thermophila, Xenopus tropicalis, Zea mays, and cover various biological concepts:

(1)

Gene Ontology (GO): Ontologies are available for 13 species, ranging from Escherichia coli bacteria to humans, and are provided as separate ontologies for investigating biological processes, cellular components, and molecular functions. Additionally, combined versions for all three aspects are included.

(2)

Pathway databases: Pathway information from various resources like Pathway Commons [18], Reactome [19], SignaLink [20], and WikiPathways [21] is incorporated for 14 species, providing insights into molecular networks.

(3)

Transcription factor regulation: This category integrates data from ATRM [22], dorothEA [23], RegulonDB [24], TFLink [25], TRRUST [26], and Yeastract [27] databases, covering the regulatory influence of transcription factors for 8 species.

(4)

MicroRNA regulation: For 8 species, mulea offers microRNA regulation data sourced from miRTarBase [28], aiding in understanding microRNA-mediated gene regulation.

(5)

Gene expression data: Specifically for fruit flies (Drosophila melanogaster), gene expression data from Flyatlas [29] and Modencode [30] is included.

(6)

Genomic location data: This category, downloaded from Ensembl [31], provides detailed chromosomal location information (bands and consecutive genes) for 26 species.

(7)

Protein domain content: Protein domain information retrieved from the PFAM database [32] is available for 24 species, facilitating the identification of functional domains within sets of proteins.

The wide range of ontologies we provide, are stored in a standardised GMT format. These ontologies are also accessible through the muleaData ExperimentData Bioconductor package for added convenience. To cater to various research needs, the ontologies are available in all major gene and protein identifiers, including UniProt protein ID, Entrez ID, Gene Symbol, and Ensembl gene ID. We applied the API of UniProt (https://www.uniprot.org/help/id_mapping) to map between different identifiers. For direct access, 879 pre-defined GMT files are hosted on a dedicated GitHub repository (https://github.com/ELTEbioinformatics/GMT_files_for_mulea).The GMT format contains collections of genes or proteins associated with specific ontology entries in a tab-delimited text file. The GMT file can be read to R as a data frame using the read_gmt function of mulea or the ExperimentHub identifier of the ontology using the muleaData package. Each term is represented by a single row in the GMT file and in the data frame as well. Each row includes three types of elements: (1) The ontology identifier, referred to as “ontology_id”, which uniquely identifies the element within the file or data frame. (2) The ontology name or description, referred to as “ontology_name”, provides a user-friendly label or textual description that clarifies the meaning of each term. (3) A list of associated gene or protein identifiers (separated by spaces) belonging to each term.Additionally, in the GMT files, comment lines starting with “#” provide supplementary information about the referenced ontology: e.g., type, source, species, version, and identifier type. Similar information is also available in the muleaData package when using the query function of the ExperimentHub library [33]. We recommend using the “primary identifiers” (listed in each GMT file and the Supplementary Table 1) for each ontology whenever possible. It is important to note that using alternative identifiers might lead to slightly different enrichment results due to potential inconsistencies in identifier coverage.To enhance the biological interpretation of genomic and proteomic ndata, mulea’s adoption of the standardised GMT format allows the seamless integration of external data resources. This includes established databases like MsigDB [5] and Enrichr [6], as well as KEGG pathways (a conversion script is available in the https://github.com/ELTEbioinformatics/GMT_files_for_mulea GitHub repository). Notably, mulea accommodates user-defined ontologies, facilitates the translation between list variables and data frames with the list_to_gmt function, and has a dedicated function (write_gmt) to save GMT files, for further expanding its analytical capabilities.Refining enrichment analysis by filtering ontology entriesEnrichment analysis results can sometimes be skewed by overly specific or too broad ontology entries. mulea empowers users to address this issue by enabling the exclusion of such entries from the analysis. This filtering capability allows researchers to ensure that the results better match the expected scope.

Hot Topics

Related Articles