Meta graphical lasso: uncovering hidden interactions among latent mechanisms

Synthesized datasetHere, we utilize synthetic data generated based on latent factors with sparse dependencies to validate whether these sparse dependencies can be accurately estimated.Data generationWe generate K datasets consisting of N samples, each containing V observed variables. These datasets share M common latent factors, among which we assume sparse dependencies. The procedure for generating these datasets is outlined below:Creating latent factor To generate a matrix \({\varvec{D}}\) that satisfies the orthonormal condition \({\varvec{D}}^T{\varvec{D}}={\varvec{I}}\), we first create a matrix \({\varvec{Z}}\) of the same size as \({\varvec{D}}\) with random values sampled from a uniform distribution over [0, 1). Then, using \({\varvec{P}}\) and \({\varvec{Q}}\) obtained from SVD \({\varvec{Z}} = {\varvec{P}}{\varvec{S}}{\varvec{Q}}^T\), we generate \({\varvec{D}} = {\varvec{P}}{\varvec{Q}}^{T}\). Here, \({\varvec{P}}^T{\varvec{P}}={\varvec{I}}\) and \({\varvec{Q}}^T{\varvec{Q}}={\varvec{I}}\), where \({\varvec{S}}\) is a square matrix with non-diagonal elements being zero.Generating precision matrix For each of the K datasets, we generate the precision matrix of latent factors, \({\varvec{\Sigma }}_k^{-1}\). Specifically, we first generate 20 M-dimensional vectors with random values sampled from a normal distribution with mean 0 and standard deviation 1. The precision matrix for these samples is then estimated using Graphical Lasso with an L1 regularization weight of 0.3, and this estimated matrix is designated as \({\varvec{\Sigma }}_k^{-1}\).Generating white noise The common variance \(\varepsilon _k\) for the remaining \(V-M\) dimensions of latent factors is generated as the absolute value of random values sampled from a normal distribution with mean 0 and standard deviation 0.1. This assumes that the variance of the remaining latent factors is smaller than that of the M-dimensional latent factors.Synthesizing samples The covariance matrix \(\varvec{\Phi }_k\) is defined as \(\varvec{\Phi }_k = {\varvec{D}}{\varvec{\Sigma }}_k{\varvec{D}}^T + \varepsilon _k ({\varvec{I}} – {\varvec{D}}{\varvec{D}}^T)\), and N samples are drawn from a Gaussian distribution represented by \(\mathcal {N}\left( \textbf{0}, \varvec{\Phi }_k\right) \).In this experiment, the number of latent factors M is set to 10, and the number of datasets K is 10. The number of observed variables V is tested in variations of 100, 1000, and 10, 000. Additionally, the number of samples N is tested for 200 and 1000. The MGLASSO model is trained using SGD for 5000 epochs, ensuring sufficient convergence. The L1 regularization weight \(\rho \) is selected for each V and N to achieve sparsity levels comparable to the ground truth. The resultant latent factors are reordered to closely match the ground truth latent factors. Specifically, we first select the latent factor closest in Euclidean distance to any of the ground truth latent factors as its corresponding latent factor and subsequently match the remaining factors based on the closest Euclidean distance. We report the average results obtained from repeating this verification process 10 times.Comparison methodsOur goal is to propose a method using latent factors derived through linear projection, which is one of the effective ways to infer latent mechanisms in a form easily understandable by humans. Therefore, we compare our proposed method with the main traditional methods using latent factors through linear projection, such as PCA and ICA, to demonstrate that our approach can appropriately estimate the sparse dependencies among latent factors.We apply traditional methods to the same synthetic data used in the evaluation of MGLASSO for our assessment. However, since PCA and ICA can only consider a single dataset, we apply PCA and ICA to a concatenated dataset comprising all individual datasets to estimate latent factors. It was observed that directly concatenating leads to an inability to estimate the dependencies among latent factors of datasets other than those with significant variance. Thus, we normalize the variance of the variables in each dataset to 1 before concatenation. For PCA, the major M principal components are regarded as latent factors. In the case of ICA, the M independent components extracted using the FastICA7 algorithm are considered latent factors. Furthermore, similar to the procedure with MGLASSO, the estimated latent factors are reordered according to their correspondence with the ground truth. The estimated latent factors are then projected into an M-dimensional space using a projection matrix \({\varvec{D}}’\), comprising column vectors of the latent factors, and a precision matrix is obtained by applying Graphical Lasso to each dataset.Results on synthesized datasetsThe results of the evaluation experiments using synthetic data are displayed in Table 1. Table 1 initially shows that a larger sample size N facilitates better alignment with the ground truth, however, increasing the number of observed variables V complicates this alignment. The proposed method consistently shows better results across all evaluation metrics compared to PCA or ICA. Notably, although the alignment of space projected by M latent factors (Space) is relatively good in PCA and ICA, the inner product of matched latent factors (Dist) and the Kullback-Leibler divergence of the Gaussian distributions (KLD) significantly outperform those in PCA or ICA. This indicates that, while PCA and ICA capture subspaces with a significant variance of latent factors, MGLASSO significantly excels in identifying latent factors from those subspaces with sparse dependencies across all datasets. It can also be confirmed that the KL divergence between the learned multivariate Gaussian distribution and the ground truth multivariate Gaussian distribution is smallest when \(M=10\) (Table 2).An example of sparse dependencies in the estimated latent space is shown in Fig. 2, illustrating the case when \(V=10,000, N=1000\). Here, the partial correlation coefficients, commonly used to understand dependencies between variables, are displayed. When the precision matrix is \(\varvec{\Lambda }\), the partial correlation coefficient between the ith and jth variables is given by \(-\frac{\varvec{\Lambda }{ij}}{\sqrt{\varvec{\Lambda }{ii}}\sqrt{\varvec{\Lambda }_{jj}}}\). It is observed that PCA and ICA are largely unsuccessful in accurately estimating the ground truth dependencies. In contrast, MGLASSO is almost entirely successful in accurately estimating the ground truth dependencies.In MGLASSO, the weight \(\rho \) for L1 regularization is confirmed to generally provide optimal BIC for each V and N (Table 3). Additionally, the number of latent factors M generally shows optimal BIC when \(M=10\), which corresponds to the ground truth (Table 4). However, for \(V=10,000\), the smallest BIC is obtained with \(M=5\), with \(M=10\) resulting in the next smallest BIC. Overall, these findings suggest that BIC serves as a reasonably good criterion for selecting \(\rho \) and M when the dataset adheres to the assumption of being generated based on a multivariate Gaussian distribution.Experiments with synthetic data strongly suggest that accurately estimating sparse dependencies among latent factors across numerous datasets is extremely challenging with traditional methods of extracting latent factors, yet the proposed method is a potent solution to this challenge.Table 1 Evaluation using synthetic data.Table 2 KL divergence between the multivariate Gaussian distribution \(\mathcal {N}\left( \textbf{0}, \varvec{\Phi }_k\right) \) with different M obtained using MGLASSO and that of the ground truth for synthetic data.Table 3 BIC values for MGLASSO applied to synthetic data using different \(\rho \).Table 4 BIC values when applying MGLASSO with different M on synthetic data.Figure 2Graphical representation of the dependencies between latent factors estimated from synthetic data. Edges represent non-zero partial correlation coefficients between latent factors, with blue edges indicating positive partial correlations and red edges indicating negative ones. The thickness of an edge corresponds to the magnitude of the absolute value of the partial correlation coefficient. Additionally, the size of a node reflects the variance of the latent factor, that is, the magnitude of the diagonal elements in the covariance matrix. PCA and ICA fail to accurately capture the ground truth dependencies, whereas MGLASSO successfully approximates them with high fidelity. GT ground truth, MGLASSO, PCA, ICA dependencies between latent factors estimated using MGLASSO, PCA, and ICA, respectively.Stock price datasetWe demonstrate that our proposed method can unveil significant societal trends behind the movements of numerous stocks by applying it to data on the movements of a large number of stocks. We apply our method to stock price data from the US market from 2009 to 2015 downloaded from the corresponding URL (https://github.com/eliangcs/pystock-data), selecting \(V=4,234\) stocks with data available on the same days. We use this data as datasets divided by year. Specifically, we create one dataset for each year from 2009 to 2015, each containing \(N_0=252\), \(N_1=252\), \(N_2=252\), \(N_3=250\), \(N_4=252\), \(N_5=252\), and \(N_6=54\) days, respectively. The stock prices are converted to logarithmic values and standardized so that the mean is 0 for each year. We use \(M=10\) as the number of latent factors, a number that is manageable for human analysis. For the L1 regularization weight, we adopt \(\rho =3\), indicating moderate sparsity. It can be observed that the BIC shows better values compared to when using different \(\rho \) (Table 5). On the other hand, although the number of latent factors M does not always yield the best BIC, it provides relatively reasonable BIC values compared to poorly fitting models such as \(M=2\) (see Supplementary Information). Considering that real-world data is not as simple as being represented by a single multivariate Gaussian distribution, it can be said that striving for the best BIC and other criteria is not very meaningful. Therefore, from the perspective of interpretability, we provide an analysis using \(M=10\) in this case. The model is learned over 5,000 steps using stochastic gradient descent.Figure 3 shows the daily magnitude of latent factors estimated using the projection matrix derived by MGLASSO. Latent factors with clear trends and those without can be observed each year, with factors showing clear trends likely having dependencies on that year. This suggests that our method proactively extracts latent factors that become sparsely dependent annually. Indeed, years with significant variance in latent factors can be seen from Fig. 4. Moreover, the sparsity of dependencies between latent factors can be confirmed from Fig. 5. Thus, our method successfully extracts latent factors that show significant movements each year and demonstrate sparse dependencies annually.Table 6 presents the characteristics of stocks that significantly contribute to each latent factor. In this data, all stocks are classified by sector, and stocks within each sector are further categorized by industry. Analyzing the behavior of latent factors shown in Fig. 3,  4, and  5 based on these characteristics leads to the following interpretations:

1.

Latent factor #1 is associated with the “Other Precious Metals & Mining Silver” industry, and latent factor #3 is linked to the “Electronic Components” industry. In 2010, these industries showed a dependency relationship, which reversed by 2012. This could reflect the impact of significant international contexts, such as China’s export restrictions on rare metals in 2010 and the tensions over rare metals between the US and Europe in 201222.

2.

Latent factor #8 is strongly related to the energy sector, particularly the “Oil & Gas Drilling” industry, showing significant movements in 2012 and 2014. This period marked a crucial phase for the U.S. oil and gas drilling industry, notably due to the “Shale Revolution,” which garnered significant attention for its rapid production increases, economic benefits, and environmental impacts23. Additionally, latent variable #9, moving in strong correlation with #8 in the same years, is closely associated with healthcare, suggesting potential insights from companies related to both sectors.

3.

Latent factor #2 is strongly associated with the “Gold” industry and exhibited significant variation in 2013. While gold prices are influenced by various factors and are difficult to attribute to any single cause, 2013 was marked by significant government interventions in gold prices, potentially reflecting these actions’ strong impact.

4.

Latent variables #0 and #5 experienced significant fluctuations with a strong dependency relationship between them in 2009 and 2011. Since no closely related industry could be identified for these latent variables, the analysis presents challenges. However, examining related companies from different perspectives may yield intriguing insights.

Such insights could never be obtained from traditional analyses focused solely on the movements of a few representative stocks. Additionally, while stock price movements are strongly influenced by major societal situations occurring each year, it is challenging to understand stock price movements over several years with a consistent societal situation by identifying major latent factors annually. Our proposed method provides a consistent explanation for all stock price movements over a certain period, including several years, and objectively estimates major societal situations affecting stock prices through characteristic interactions each year, without being hindered by experts’ preconceptions.Table 5 BIC values for MGLASSO applied to U.S. stock price data and gene expression data using different \(\rho \).Table 6 Sectors and industries significantly associated with the 10 latent factors identified by MGLASSO from US stock price data.Figure 3Daily plots of the values of latent factors extracted by MGLASSO. Horizontal axis: Days arranged in chronological order. Vertical axis: Values of the latent factors.Figure 4The variances of the latent factors estimated from US stock price data. (A) The variance of latent factors, i.e., the diagonal components of the covariance matrix \({\varvec{\Sigma }}_k\) of latent factors, plotted annually. (B) The common variance \(\varepsilon _k\) of the remaining latent factors plotted annually.Figure 5Graph representation of the dependency among latent factors estimated from the stock price data of the United States. Edges represent the nonzero partial correlation coefficients between latent factors. Blue edges indicate positive, and red edges indicate negative partial correlation coefficients, with the thickness of the edges representing the magnitude of the absolute value of the partial correlation coefficients. Additionally, the size of the nodes reflects the variance of the latent factors, i.e., the magnitude of the diagonal components of the covariance matrix.Gene expression datasetWe demonstrate that applying our proposed method to gene expression data of cancer cells observed in actual drug resistance of cancer yields surprisingly profound insights.Figure 6The variances of the latent factors estimated from gene expression data. (A) The variance of latent factors, i.e., the diagonal components of the covariance matrix \({\varvec{\Sigma }}_k\) of latent factors, plotted annually. (B) The common variance \(\varepsilon _k\) of the remaining latent factors plotted annually.Pathway network analysis for uncovering mechanisms of acquired 5-FU resistanceWe apply our method to analyze pathway networks varying depending 5-Fluorouracil (5-FU) sensitivities of cell lines for uncovering the mechanisms of acquired 5-FU resistance of cancer cell lines. We use dataset from the “Sanger Genomics of Drug Sensitivity in Cancer (GDSC) dataset from the Cancer Genome Project (downloaded from “https://www.cancerrxgene.org/downloads/bulk download”), i.e., expression levels of \(V = 17,737\) genes for 1018 cell lines and IC50 values of anti-cancer drugs. We focus on the 5-FU that is a chemotherapeutic drug commonly used for solid cancers24. The dataset is divided into four bins according to the value of IC50 of 5-FU, i.e., samples with IC50 of \((-\infty , -\sigma )\) (B1), \([-\sigma , 0)\) (B2), \([0, \sigma )\) (B3), \([\sigma , \infty )\) (B4), where \(\sigma \) is the standard deviation of IC50 values. The datasets B1, B2, B3, and B4 include \(N_0 = 160\), \(N_1 = 266\), \(N_2 = 305\), and \(N_3 = 160\) cell lines, respectively. We try \(M=20\) as the feasible number of latent factors for humans to analyze the dependencies between the factors, and extract the 20 latent factors by using the MGLASSO. We choose \(\rho =30\) so that the graph of dependencies between the latent factors is sparse enough but not completely independent of each other. As with the Stock Price dataset, it can be observed that the BIC shows better values compared to when using different \(\rho \) (Table 5). On the other hand, although the number of latent factors M does not always yield the best BIC (see Supplementary Information), from the perspective of interpretability, we provide an analysis using \(M=20\) in this case, similar to the approach taken with the Stock Price dataset. The model is trained by SGD with 20, 000 epochs. Note that we observe that the loss function seems almost completely converged at the 15, 000 epoch and few dependencies are added or removed after that. Additionally, the resulting \({\varvec{D}}\), \(\varepsilon _k\) and \({\varvec{\Sigma }}_k^{-1}\) are almost the same for three trials.We perform the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis to reveal biological pathways behind the latent factors. For the 100 genes having the largest scores of each latent factor, KEGG pathway analysis is performed. Table 7 shows the most enriched pathways of each latent factor (i.e., each pathway shows the smallest p.value). That is, the latent factors can be interpreted as the corresponding biological pathways, and the 20 pathways in Table 7 may be crucial to understand molecular mechanisms of 5-FU resistance of cancer cell lines. In order to uncover molecular mechanism of the 5-FU resistance of cancer cell lines hidden behind the molecular interplays between thousands of genes, we construct the networks of the extracted 20 latent factors (i.e., pathway networks). Figure 7 shows the estimated pathway networks varying depending on 5-FU sensitivity. The variances of these factors in each bin are shown in Fig. 6.Table 7 The most enriched pathways of 20 latent factors.Figure 7Networks of latent factors of expression levels of genes corresponding from 5-FU sensitive to resistance cell lines (i.e., B1 \(\rightarrow \) B2 \(\rightarrow \) B3 \(\rightarrow \) B4), where each edge indicates the partial correlation, the element of the precision matrix divided by the square root of the corresponding diagonal element. Blue edges indicate the positive partial correlations and red edges indicate the negative. The size of the nodes indicates the variances of the latent factors, the diagonal elements of the covariance matrix.We focus on the following interactions between the latent factors.

0 (hsa04512: ECM-receptor interaction) and 2 (hsa04145: phagosome)

2 (hsa04145: phagosome) and 6 (hsa05202: transcriptional misregulation in cancer)

2 (hsa04145: phagosome) and 12 (hsa04612: antigen processing and presentation)

1 (hsa04510: focal adhesion) and 11 (hsa04974: protein digestion and absorption)

The positive interaction between the latent factors “0 (hsa04512: ECM-receptor interaction) and 2 (hsa04145: phagosome)” can be considered as 5-FU sensitive specific characteristic. The strong positive interaction becomes weaker from 5-FU-sensitive to non-sensitive cell lines, and the interaction disappeared in 5-FU resistant cell lines ( i.e., B4). Furthermore, the interaction between the latent factors “2 (hsa04145: phagosome) and 6 (hsa05202: transcriptional misregulation in cancer)” and “2 (hsa04145: phagosome) and 12 (hsa04612: antigen processing and presentation)” are also considered as the 5-FU sensitive specific interplays. We also consider the interaction between the latent factors “1 (hsa04510: focal adhesion) and 11 (hsa04974: protein digestion and absorption)” as a crucial interplay for uncovering the mechanisms of 5-FU sensitivity in cancer cell lines, because the negative interaction in 5-FU sensitive cell lines ( i.e., B1) disappeared in B2, and becomes the positive interaction in 5-FU resistance cell lines (B3 and B4).In the gene regulatory network, the hub genes connected with many other genes are considered as crucial markers that play vital roles in biological process and gene regulation25. Thus, the hub-pathway can be also considered a crucial marker in the pathway networks. The hubness of the pathways “hsa04145: phagosome”, “hsa05202: transcriptional misregulation in cancer” and “hsa04612: antigen processing and presentation” are considered as 5-FU sensitive specific characteristic, while the their hubnesses disappeared in 5-FU resistance cell lines. It implies that their activities are 5-FU sensitive specific characteristics.Table 8 shows the enriched genes in the identified pathways, where the column “Evidences” indicates previous studies that recovered the pathways as key components to understand the mechanisms of gastric cancer and/or 5-FU.Table 8 The enriched genes of the pathways and evidences related to 5-FU sensitivity, where \(*\) indicates the 5-FU sensitive-specific pathways (ı.e., their strong interactions and hubness are observed in the networks of 5-FU sensitive cell lines).

hsa04145: Phagosome
Guo et al.26 demonstrated that TPM4 expression is related to 5-FU drug sensitivity and TPM4 is a target to predict the 5-FU sensitivity in gastric cancer. The positively co-expressed genes with TPM4 are enriched in “hsa04145: Phagosome” and “hsa05202: Transcriptional misregulation in cancer”. Differentially expression genes (DEGs) between fluorouracil-resistant and -sensitive gastric cell lines are enriched in the pathway “hsa04145: Phagosome”28.

hsa05202: Transcriptional misregulation in cancer
This pathway was identified as one of biological process of DEGs between multi-drug resistance cell (i.e., 5-FU, paclitaxel, mitomycin, vinorelbine tartrate, cisplatin and gemcitabine) and cisplatin resistance cell29.

hsa04612: Antigen processing and presentation
The putative targets of DEGs between control and treatment with 5-FU involve in “hsa04612: Antigen processing and presentation”27.

It can be seen through the Table 8 that the identified pathways have strong evidences as crucial pathways to understand the 5-FU sensitivity. It implies that the proposed MGLASSO provides biologically reliable results for identifying crucial pathways to understand the complex mechanisms of 5-FU resistance of cancer cell lines.Figure 8 shows expression levels of the genes enriched in the 5-FU sensitive-specific pathways, where red and blue boxes indicate expression levels in 5-FU sensitive and resistance cell lines, respectively. The enriched genes in drug sensitive-specific pathways show relatively high expression levels in 5-FU sensitive cell lines compared with resistance cell lines. Furthermore, high expression variances of the genes can be also observed in drug sensitive cell lines. In short, the enriched genes in 5-FU sensitive-specific pathways show high activities and diversity in drug sensitive cell lines. It implies that the identified pathways and enriched genes may have key information to uncover 5-FU sensitive/resistance mechanisms of cancer cell lines.Figure 8Expression levels of genes in 5-FU sensitive-specific pathways, where red and blue boxes indicate 5-FU sensitive (ST) and resistance (RS) cell lines, respectively.We suggest though our results and literature that catalyzing the identified 5-FU-specific sensitive pathways (hsa04145: Phagosome, hsa05202: Transcriptional misregulation in cancer, hsa04612: Antigen processing and presentation) and expressions of the enriched genes in the pathways may play crucial roles to prevent acquisition of 5-FU resistance of cancer cells.To the best of our knowledge, this is the first computational study on biological pathways network for uncovering mechanisms of acquired drug resistance of cancer cell lines. We expect that our method will be a useful tool to reveal complex biological mechanisms hidden behind molecular interplays between giant number of genes.

Hot Topics

Related Articles