AutoFocus: a hierarchical framework to explore multi-omic disease associations spanning multiple scales of biomolecular interaction

Description of AutoFocus frameworkThe AutoFocus tool enables fast clustering and phenotype association of multiple omics datasets, accompanied by an intuitive, interactive application for result exploration. Preprocessed, matched-sample omics datasets from any specimen, body fluid, or platform, are combined and pairwise correlated (Fig. 1b, c). These correlations are transformed into a distance metric that is used to structure all molecules into a single dependency tree based on well-established hierarchical clustering (Fig. 1d). Univariate associations of each molecule with a desired phenotype of interest are calculated, and significantly associated molecules, which are the “leaves” of the tree, are annotated at the bottom of the diagram (Fig. 1d–f).The tree is then scanned from top to bottom. For each internal node of the tree, the leaves descending from that node create a cluster (see highlighted parts of Fig. 1f). An enrichment analysis of significant hits is performed on the molecules within that cluster. If a user-defined enrichment threshold is reached, that internal node is labeled as an “enrichment peak” (Fig. 1f). Finally, functional annotation is performed on the modules associated with each peak along with a phenotype “driver” analysis (Fig. 1g). Drivers are defined as module members sharing a direct, unconfounded correlation edge with the disease phenotype based on a mixed-distribution graphical model.All AutoFocus functionalities are available as an R package at https://github.com/krumsieklab/autofocus. As input, the method accepts Excel sheets of preprocessed measurements from multiple omics datasets along with dataset-specific molecular annotations and sample-specific annotations, including phenotype(s) of interest and covariate information. Accompanying the workflow in Fig. 1 is an interactive Shiny application that allows a user to set an enrichment threshold and easily explore the resulting functional modules.Intra- and inter-dataset relationships in the 12-dataset multi-omics QMDiab studyAs all data-driven clustering methods depend on the similarity relationships between the measured variables, we first explored the correlation structure of a dataset with various omics layers to get an overview of the highly complex underlying statistical structures. The QMDiab dataset consists of 5135 biomolecules from 8 metabolomics datasets (5 different platforms performed on plasma, 2 on urine, and 1 on saliva), 3 blood glycomics datasets, and 1 blood proteomics dataset (Table 1). These 12 datasets were combined and the pairwise biomolecule correlations were calculated. A systemic correlation bias was detected across the various assays: Intra-dataset correlations were systematically higher than inter-dataset correlations (Fig. 2a). This bias persisted even in instances where the same molecule was measured on different platforms. For example, when analyzing two specific molecules, valine and leucine, measured on two almost identical plasma metabolomics platforms, we observed that valine had a higher correlation with leucine measured on the same platform than its correlation with itself measured on a different platform (Fig. 2b). As a consequence of this bias, molecules from the same dataset tended to be in close proximity in a hierarchical structure (Fig. 2c). This poses a problem when using correlation networks to statistically extract interactions between these molecules, a common approach to inferring biological relationships. We systematically probed the QMDiab correlation network for the optimal statistical cutoff to create a network whose edge set best models ground truth interactions. We found that this optimal cutoff differs between ground truth annotations for intra-dataset edges (metabolite pathways) and ground truth annotations for inter-dataset edges (KEGG and Recon3D-based gene-metabolite edges, Supplementary Fig. 1)32. This indicates the inability of a single statistical cutoff to recover biologically relevant interactions in a multi-dataset context.Table 1 Overview of -omic types, sample types, platforms, collection methods, and analyte count for the QMDiab and ROS/MAP datasetsFig. 2: Correlation values within and across datasets.a Proportion of significant correlations between biomolecules within and across datasets. For every dataset, the proportion of significant correlation coefficients within each dataset is substantially larger than across datasets. Consequently, statistical methods that depend on correlations will be biased towards intra-dataset interactions in a multi-omics setting. b Example correlations between two molecules measured on the sample blood samples using two similar metabolomics platforms, Metabolon Plasma HD2 and Metabolon Plasma HD4. Valine on the HD2 platform correlated stronger with Leucine measured on the same platform than with Valine on the HD4 platform. This further illustrates the tendency for stronger correlations within a dataset than between datasets. c Dataset distribution in the correlation-based hierarchical structure formed on the QMDiab dataset. Strong intra-dataset correlations can be seen for lipids (brown) and to a lesser extent for proteomics (light green), as these two datasets have dense regions where they segregate from the other -omics datasets which are otherwise thought to be well integrated.Notably, this observation of higher correlations within the same omics layers appears to be a natural feature of multi-omics datasets. Neither AutoFocus, nor any other clustering-based method can sufficiently remove this bias, as all clustering methods rely on the associations between measured molecules. However, unlike existing methods, the hierarchical framework of AutoFocus does not apply a fixed statistical cutoff to correlations between analytes, allowing any intra- and inter- dataset relationships to naturally emerge from the data structure. As the AutoFocus method evaluates clusters at every internal node of a hierarchy, clusters formed by nodes closer to the root of the tree will encompass molecules spanning different -omics, fluids, and datasets whose relationships would have been excluded when using statistical significance-based cutoffs (Fig. 1c). The resulting clusters are more representative of multi-omic, multi-fluidic, and multi-dataset biological interactions as compared to cutoff-dependent clustering methods.AutoFocus analysis on QMDiab reveals impact of type 2 diabetes at multiple levels of molecular interactionsSystems-level analysis of type 2 diabetesThe phenotype of interest used for this analysis was Type 2 Diabetes (T2D) diagnosis. After correcting for age, sex, and BMI, 188 of the 5135 molecules were found to be significantly associated with Type 2 Diabetes (p < 0.05, Bonferroni adjusted), covering 10 of the 12 omics datasets. The IgG and IgA glycomics datasets showed no significant associations with T2D. We observed a broad distribution of signal across the hierarchical tree (Fig. 3a), suggesting a system wide T2D effect across omics and body fluids. Certain regions of the tree had substantially denser distributions of significantly associated biomolecules, suggesting hotspots of T2D perturbation.Fig. 3: AutoFocus on the QMDiab dataset.The dataset included a total of 388 samples and 5135 biomolecules from 12 datasets: 5 metabolomics platforms on plasma, 2 on urine, and 1 on saliva, 3 blood glycomics datasets and 1 blood proteomics dataset. a View of the full hierarchical structure created from the QMDiab dataset. Magenta circles at the bottom of the tree indicate significant molecules, circles within the tree indicate modules that passed the enrichment threshold. Significant molecules were dispersed throughout the leaves of the tree and enriched modules were scattered throughout the hierarchy at a wide range of heights. The high-density region of significant molecules towards the right corresponds to the largest enriched module at the highest height. Below is a zoomed view of this module, with the left sub-tree in yellow and the right sub-tree in pink. b Pie charts of the dataset and pathway makeup of the two largest modules along with their size and significant-node enrichment fraction. Pathway annotations were only available for the metabolites measured by Metabolon. c Confounder-corrected mixed graphical model of the molecules in the largest module with phenotype. The zoomed-in view is of nodes with edges to the Type 2 Diabetes phenotype which include 1,5-AG in saliva, ornithine in urine, and the CXCL12 protein, along with the confounder age and 2 unknown molecules. As these molecules are directly connected to the T2D phenotype, we mark them as statistical “drivers” of the disease in this module.Type 2 diabetes modulesTo identify T2D modules for the QMDiab dataset, we applied a “majority vote” enrichment threshold of 0.5, where at least 50% of a cluster’s members must be significantly associated with T2D for it to be designated as a T2D module. The AutoFocus method identified 21 modules, ranging in size from 2 to 192 biomolecules (Supplementary Fig. 2 and Supplementary Data 1). In addition, there were 33 single-molecule modules, identified as T2D-associated molecules that did not belong to any of the 21 modules. The identified T2D modules substantially ranged in scale (Fig. 3a), from very high correlation near the leaves at tree height 0 to low correlations near the root at tree height 1. This shows that Type 2 Diabetes manifests at various levels of the biological hierarchy, from closely connected molecules to larger pathways.As expected, most of the smaller, highly correlated modules tended to contain molecules from only one dataset, due to the aforementioned within-dataset correlation biases, most notably within the lipidomic and proteomics datasets. However, AutoFocus identified six T2D modules that contained molecules from multiple omics or fluids (Supplementary Fig. 2). The smaller of these modules included molecules that were measured multiple times but on different platforms, e.g., one module which was made up of pyroglutamine measured on the Metabolon HD2 and HD4 platforms. The largest module with 192 molecules (Fig. 3a), comprising of the bulk of the T2D-associated analytes in the QMDiab dataset, brought together molecules from both metabolomic and proteomic datasets and all three body fluids in QMDiab (Fig. 3b).This 192-analyte module contained two sub-modules, each with substantially different functional components. The larger, right-hand “child” tree (Fig. 3a, pink) contained molecules involved in energy metabolism, including various carbohydrates, such as mannose, glucose, and 1,5-anhydroglucitol, which are known biomarkers of diabetes33,34, as well as TCA cycle metabolites like pyruvate and lactate in plasma. In addition, this module showed significant changes in the abundance of ketone bodies acetoacetate and 3-hydroxybutyrate in urine, supporting the prevalence of ketosis and ketone body secretion in T2D patients35.The left sub-tree (Fig. 3a, yellow) in this module contained biomolecules related to bone growth, mineralization, and degradation, as well as some chemokines and endothelial cell proteins. The bone degradation molecules included the proteins Osteomodulin (OMD), Integrin-binding sialoprotein (IBSP), and C-type lectin domain protein (Clec11A) and the metabolite prolylhydroxyproline in both plasma and urine36,37,38,39. Osteoporesis has a well-documented relationship to T2D, and although the mechanisms are not established, hypotheses for the link include inflammation and microangiopathy40. The presence of chemokines Stromal cell-derived factor 1 (CXCL12) and C-C motif chemokine 22 (CCL22), as well as Endothelial cell-specific molecule 1 (ESM1) in this sub-module presented potential osteoporosis links to inflammation and microangiopathy, respectively.Type 2 diabetes module driver analysisWe further analyzed this module using a mixed graphical model (MGM) approach, which allowed us to differentiate direct correlations between biomolecules and T2D from indirect, statistically confounded correlations. We identified and labeled as drivers those molecules that had a direct correlation with T2D diagnosis, signified by sharing an edge in the MGM network. The MGM identified 5 biomolecules showing direct correlations with T2D, including CXCL12 and ornithine in urine, 1, 5-anhydroglucitol in saliva, as well as 2 unknown urine metabolites (Fig. 3c). The variety of drivers likely reflects the multiple functional components associated with T2D (such as hyperglycemia and inflammation).Stability of QMDiab hierarchyThe modules identified by the AutoFocus algorithm are highly dependent on the hierarchical structure of the data. A bootstrapping method was used to assess the stability of the hierarchical structure in the QMDiab dataset. Briefly, this involved generating bootstrap samples by randomly sampling with replacement and constructing hierarchical trees from these samples. The cophenetic correlation between each bootstrapped tree and the original tree was calculated, and the procedure was repeated 100 times. The results showed an average cophenetic correlation of 0.924 with a standard deviation of 0.006, indicating high stability of the QMDiab hierarchical structure.Taken together, the AutoFocus analysis on this large T2D dataset showed the benefits of exploring multi-omics datasets with a hierarchical algorithm: First, AutoFocus was able to cluster and draw links between multiple omics and fluids into functional modules in T2D at a variety of scales within the hierarchy. Second, the granularity of the hierarchical structure allowed us to explore the functional sub-modules of an identified enrichment peak separately and in detail. For the largest QMDiab cluster associated with T2D, we were able to identify that one sub-module was enriched for energy metabolism molecules and the other for bone growth and degradation, while the peak annotations showed us how these two processes interacted together at a larger biological scale. Finally, mixed graphical models allowed us to perform a driver analysis on each module, indicating which molecules had direct statistical links to the T2D phenotype in each module, and which ones were confounded correlations.AutoFocus on ROS/MAP dataset shows Alzheimer’s disease phenotype impact at different levels of the biological hierarchyAs another use case, AutoFocus was applied to an Alzheimer’s disease (AD) dataset of brain samples from the Religious Order Study (ROS) and Rush Memory and Aging Project (MAP) cohorts31. This dataset consisted of 8,193 biomolecules from one metabolomics and one proteomics platform, both performed on brain tissue from post-mortem samples (Table 1). For this analysis, we examined the association between the biomolecules and two clinical AD phenotypes simultaneously: 1) Neurofibrillary tangles (NFT), defined by the immunohistochemistry-based overall paired helical filament tau tangles load from post-mortem pathology, and 2) cognitive decline (CD), defined by the rate of change in global cognition over lifetime. These phenotypes were chosen because they represent two distinct effects of AD, molecular and cognitive.Systems-level analysis of AD phenotypesOf the 8193 molecules in the ROS/MAP dataset, 887 molecules significantly associated with NFT and 763 molecules significantly associated with CD (p < 0.05, adjusted p-values). All statistical models were corrected for age at death, sex, BMI, post-mortem interval, years of education, and APOE genotype. To maintain consistency with previous studies published on the ROS/MAP dataset41, the FDR p-value correction method was used instead of Bonferroni, leading to a dense distribution of significant hits across the tree (Fig. 4a). Both phenotypes had robust metabolic associations, as metabolites made up 20% and 26% of significant hits in NFT and CD, respectively, even though metabolites only made up 8.14% of the underlying dataset. There were 358 overlapping molecules significantly associated with both phenotypes.Fig. 4: Results of running the AutoFocus method on the ROS/MAP dataset.The dataset included a total of 500 samples, which contained 8193 biomolecules from a metabolomics platform a proteomics platform performed on post-mortem brain tissue. a View of the full hierarchical structure created from the ROS/MAP dataset with two phenotypes annotated and dataset distribution below. Magenta circles represent the neurofibrillary tangles phenotype, green circles represent cognitive decline, and orange circles are overlaps between the two. Significant molecules are dispersed densely throughout the tree and enriched modules are scattered throughout the hierarchy at a large range of heights. b Zoomed-in view of a metabolomics module enriched for significant hits associated with cognitive decline. This module contained metabolites related to oxidative stress and lipid peroxidation. c Zoomed in view of the largest module found in the dataset which was enriched for metabolites and proteins significantly associated with neurofibrillary tangles. d Zoomed-in view of the largest module enriched for both phenotypes with the left sub-tree (yellow) enriched for mitochondrial proteins and the right sub-tree (pink) enriched for proteins related to synaptic vesicle exocytosis and inhibitory neurotransmission.AD modulesA total of 171 modules were identified with a “majority vote” enrichment threshold of 0.5, with 83 modules unique to the NFT phenotype, 82 unique to the CD phenotype, and 6 modules associated with both phenotypes. There were 466 single-molecule modules that did not belong to any of the 171 modules. The multi-molecule modules ranged in size from 2 to 165 biomolecules (Supplementary Fig. 3, Supplementary Data 2). Similar to the QMDiab dataset, modules associated with both phenotypes ranged drastically in tree height across the tree, from 0 to 0.83 (Fig. 4a).An interesting feature arising from applying AutoFocus on two phenotypes was the nesting of enrichment peaks, where the peak of one phenotype was a descendant of a peak of the other. As the NFT and CD phenotypes had a large overlap of significantly associated molecules, their enriched modules tended to occupy similar regions of the tree. Despite this considerable overlap, only 6 internal nodes were identified as enrichment peaks for both phenotypes (Fig. 4a, orange nodes). For all other regions of the hierarchy where both phenotypes had overlapping significant hits, modules enriched for one phenotype contained descendent sub-modules enriched for the other phenotype (Fig. 4b, c). This nesting highlights how different phenotypes within a single disease can manifest at different scales of biological processes, where cognitive decline may be associated with a biological process at a higher level than neurofibrillary tangles pathology, and vice versa.The ROS/MAP hierarchical structure was strongly affected by the dataset correlation bias as metabolites were condensed within the tree, leading to only 6 of the 170 modules being multi-omic (Fig. 4a, Supplementary Data 2). Within the dense metabolomic region of the ROS/MAP hierarchy, CD had more significant metabolite associations, which resulted in a higher enrichment peak (larger cluster) than the NFT phenotype. This 36-metabolite module was enriched for antioxidants and lipid peroxidation metabolites42,43, indicating that CD interacts with oxidative stress metabolism at a higher biological scale than NFT (Fig. 4b, Supplementary Data 2).One of the 6 modules identified for both phenotypes was a metabolomic module enriched for amino acids, with 22 out of 29 members being amino acids or their derivatives (Fig. 4b). These include branched-chain amino acids (valine, leucine, and isoleucine) whose dysregulation is a well-known marker of AD pathology, as well as the aromatic amino acids (tyrosine, phenylalanine, and tryptophan) which are substrates for neurotransmitters like serotonin, dopamine, and norepinephrine (Fig. 4c, Supplementary Data 2)44. As this was a module where both phenotypes were enriched at the same level, these processes do not seem to be specific to either phenotype but are a shared feature of both AD traits. Interestingly, this metabolomic module breaks away from the metabolite-dense portion of the hierarchy, and is surrounded by proteins, indicating a closer functional relationship of amino acids to proteomic processes (Fig. 4a).Of the 6 multi-omic modules was the largest module in the tree, which significantly associated with the NFT phenotype (Fig. 4d). This module contained proteins and metabolites involved in a variety of processes; one sub-module showed multi-omic dysregulation of arginine flux, degradation, and metabolism45,46,47,48, one sub-module contained proteins associated with inflammatory mediator TNF-α49,50,51, while an adjacent sub-module contained glycosylation proteins52. In contrast, the CD phenotype had enrichment peaks for the NFT sub-modules involved in arginine metabolism and inflammation, but not for the region associated with protein glycosylation. This indicates that protein glycosylation has an NFT-specific association, and thus the NFT phenotype is associated at a higher level in the biological hierarchy for this process than the CD phenotype.Stability of ROS/MAP hierarchyThe bootstrapping procedure applied to the ROS/MAP dataset resulted in a mean cophenetic correlation of 0.62 with a standard deviation of 0.02. These values suggest that the hierarchical structure is less stable for the ROS/MAP dataset compared to the QMDiab dataset. The lower mean cophenetic correlation indicates greater variability in the hierarchical structure when subjected to bootstrapping. While the reason for this remains unclear, insight can be drawn from the differences in the QMDiab and ROS/MAP hierarchical structures. The internal nodes of the QMDiab hierarchy are evenly distributed across the height of the tree. In ROS/MAP, on the other hand, the internal nodes are densest higher up in the tree. This indicates weaker correlations within the data that could lead to a more variable structure when clustering the multi-omics markers.In summary, overlaying these two phenotypes on the ROS/MAP hierarchy demonstrated the difference in biological manifestation of cognitive decline and tau neurofibrillary tangles in Alzheimer’s disease. These differences highlight phenotype-specific processes, while modules equally enriched for both phenotypes indicate more universal disease processes that may not be attributable to a single phenotype.Comparison with other clustering methodsTo adequately compare the performance of AutoFocus against other methods, we chose existing clustering algorithms that met the following criteria: First, the methods must have been designed to cluster biomolecular features (instead of samples). Second, the methods must have infrastructure for a multi-scale examination of the identified clusters. We identified three statistical methods that met these criteria and are commonly used to identify clusters that show phenotype associations: (1) MoDentify12, a partial correlation network-based method that uses phenotype association to define modules. (2) MEGENA11 (Multiscale Embedded Gene Co-Expression Network Analysis) because of its ability to identify multi-scale clustering structures. (3) WGCNA10 (Weighted Gene Co-Expression Network Analysis) as it is similarly a hierarchical method, in addition to being a widely used tool for clustering omics data.Structural comparisonBoth MoDentify and MEGENA rely on underlying network structures upon which to perform their clustering methods. MoDentify models molecular interactions between biomolecules by transforming omics datasets into a partial correlation network with edges selected using a p-value inclusion criterion12. Similarly, MEGENA models these interactions using a similarity metric (filtered by an FDR p-value cutoff) to construct a Planar Filtered Network (PFN)11. Despite the difference in network derivation methods, the networks of both methods are still largely affected by the correlation bias discussed in Results “Intra- and inter-dataset relationships in the 12-dataset multi-omics QMDiab study”, with intra-dataset edges represented at a much higher rate than inter-dataset edges (Supplementary Fig. 4).In contrast, WGCNA derives a hierarchical structure using a “topological overlap matrix” (TOM) of a co-expression network10. The dataset correlation bias persisted in the WGCNA hierarchy, as evident by the proteomics dataset being highly segregated in WGCNA’s TOM-based tree (Supplementary Fig. 5a). Taken together, the correlation bias introduced when combining multiple omics datasets will affect the underlying correlation and cluster structure, regardless of the method. The ability to capture the relationships between these datasets then relies on the clustering approach.Cluster comparisonThe three methods compared to AutoFocus have vastly different designs for identifying clusters.MoDentify identifies clusters by selecting seed nodes in their network and performing a greedy neighborhood search, integrating each new visited node into a cluster, and testing for significant association with phenotype based on a conglomerative measure12. Because of this stepwise network expansion design, MoDentify does not identify nested structures. Due to these discrepancies between the approaches, the resulting clusters from MoDentify and AutoFocus are substantially different (Fig. 4a). Of note, MoDentify was considerably computationally expensive on the QMDiab dataset, taking 10 h over 4 2.7 GHz Quad-Core Intel Core i7 CPUs.MEGENA initially splits its PFN into clusters based on Newman’s modularity measure, then iteratively splits those clusters into subclusters based on a compactness evaluation11. Consequently, the MEGENA method produces nested clusters, similar to AutoFocus. This leads to a larger similarity between MEGENA and AutoFocus clusters, with two-thirds of MEGENA’s clusters with a Jaccard similarity score of over 0.5 with at least one cluster in AutoFocus’s hierarchy (Fig. 5b). However, MEGENA lacks the functionality to organize sub-clusters within their parent clusters, leaving the multi-scale relationships between clusters to manual inspection. AutoFocus allows the user to analyze these relationships in the R Shiny app, which visualizes nested clusters within their parent clusters and contextualizes their relationships.Fig. 5: Comparison of AutoFocus with MEGENA, WGCNA, and MoDentify.Highest Jaccard similarities of clusters from MoDentify (a), MEGENA (b), and WGCNA (c) with clusters from AutoFocus hierarchy. d Proportion of significantly associated molecules in associated clusters found by the four methods. AutoFocus’ proportion increases with a more stringent threshold, surpassing the proportion in WGCNA, MoDentify and MEGENA comfortably after a threshold of 0.35.WGCNA produced the most similar results to AutoFocus, likely due to the use of hierarchical clustering in both methods10. This similarity is evident as 40% of WGCNA clusters have an exact match in the AutoFocus tree, and 90% have a Jaccard similarity of at least 0.5 (Fig. 5C). However, each analyte is placed into a single cluster and the resulting clusters have no overlap. This leads to WGCNA results being a single scale per cluster, with no regard for potential sub- or super-processes. In fact, 17 of the 26 WGCNA clusters that significantly associated with type 2 diabetes had their highest Jaccard similarity score to sub-clusters found within the largest AutoFocus cluster described in Fig. 3. AutoFocus was not only able to identify the processes found within WGCNA clusters but was also able to contextualize them into the broader system in which they participate.Further, the AutoFocus framework can add this multi-scale context to a WGCNA hierarchy. The method’s framework is applicable to any input tree structure, including the hierarchical structure derived from the “topological overlap matrix” (TOM) of a co-expression network used in WGCNA10. The QMDiab dataset was rerun using WGCNA’s TOM-based hierarchical structure, the results and a comparison to the original analysis of which can be seen in Supplementary Fig. 5.Beyond the clustering approaches, all three methods use aggregated metrics of cluster members to associate clusters with disease, rather than looking at the individual molecules within. As such, the interpretability of individual node association to disease is lost, and the association signal of a single member can be dispersed into large clusters that are otherwise full of noise. The use of enrichment as an association metric, alongside the added functionality of ‘piggy-backers’ in the hierarchy leads AutoFocus to have the most signal-dense clusters among the four methods, even at moderately low thresholds (Fig. 5d).In summary, compared to MoDentify, MEGENA, and WGCNA, AutoFocus provides clusters that are denser in signal and more interpretable at the node level, alongside a visualization tool that is able to contextualize the scale of clusters and how they interact.

Hot Topics

Related Articles