Targeted co-expression networks for the study of traits

Bootstrapped LASSO decreases algorithm’s instability when identifying seed transcriptsThe first step to generate a TGCN on a transcript expression profile is to select the transcripts that contribute to predict the trait of interest T with LASSO regression. We are concerned with the stability of LASSO when selecting the seeds. Our algorithm would be stable if a slight variation in the sample set does not significantly vary the seed transcripts set composition. To empirically assess stability, we used the gene expression profiles of 13 different brain regions from GTEx (see supplementary Table S1) to predict the age of the donors at the time of death and created 13 age-TGCNs. The empirical instability of LASSO is illustrated in Fig. 1A. Each curve corresponds to a different value of the number of times the transcripts are selected or ratio of appearance, r. We observe that, when the number of LASSO runs (x axis) increases, the number of seeds keeps increasing at high pace, preventing the seed geneset from becoming stable (see curve for r value of 1 in Fig. 1A). However, when we bootstrap the LASSO, then the curves stop growing early on (see curves for r values 6, 7 and 8 in Fig. 1A). We experimented with LASSO runs values up to 50 in all 13 brain regions and observed that, amongst all seeds detected with 10 LASSO runs, the ones that were selected in at least 75% of the models remain selected at the same rate after running 50 LASSO runs. This suggests no more than 10 bootstraps are needed. This strategy allowed us to detect the more confident and universal transcripts to predict the age of the donors in each brain region of the GTEx dataset. Additionally, we observed that, for most of the brain regions, the age-predictive models created with only the repeatedly selected transcripts as predictors (seed transcripts models) reported a lower RMSE on test compared with the LASSO models created for the same brain region (7.42 and 8.16 RMSE on test on average, respectively) (see supplementary table S2). Among the different brain regions, the highest R2 was observed for the cerebellum region, with a mean R2 of 0.732 ± 0.089, where the standard deviation represents the error of the cross validation. On the opposite side, the substantia nigra brain region reported the worst performance to predict the age of the donors, reporting a mean R2 of 0.193 ± 0.155. On average, the seed transcripts models reported an average R2 of 0.435 across the different brain regions.Figure 1Study of lasso stability in high-dimensionality problems. (A) The number of new transcripts selected by a LASSO run when increasing the number of runs in comparison to only one iteration for GTEx cortex tissue with donor´s age as the predicted variable. Increasing the number of runs shows a strong increase in the number of transcripts selected at least once (red curve). We are interested in the transcripts selected by LASSO in a reasonable number of iterations, e.g. 6, 7 or 8 in the graph. (B) The plot shows a positive association, R2 0.279 (P < 8.66 × 10–38) between the eligibility of a transcript (i.e., number of times it is selected across the b LASSO bootstraps) and the average size of the coefficients for that transcript within the linear models that predict the trait. This plot corresponds to the age-TGCN for GTEx cortex. Each point is a transcript, the x axis represents eligibility and the y-axis represents the mean coefficient across linear models predicting age (see supplementary table S3). (C) Both plots show the relation between the number of transcripts used to predict age and the adjusted R2 for all age-TGCNs for GTEx brain tissues. We see a clear negative relation between adjusted R2 at predicting age as the ratio of appearance increases (left), that is, because the number of transcripts decreases within the model (right). Plots in B and C illustrate a trade-off in using TGCNs. The larger the seed size, the better performer is the final linear model to predict the trait, but the TGCN gets also larger and difficult to interpret. (D) The influence of the number of samples used to train the model, m, is represented in terms of LASSO stability. For five different brain tissues of the GTEx dataset, 10 LASSO runs were applied using 25, 50, 75 and 100% of the samples for the training. We observe that, the higher the m, the higher the LASSO stability. Jaccard represents the overlap of relevant transcripts between pairs of iterations, nzero represents the number of transcripts selected through the 10 iterations and RMSE train and test represent the train and test error of the LASSO models.Interestingly, we observe that the number of times that a transcript t was selected in the bootstrap is positively correlated with the magnitude for the coefficient in the LASSO models for the corresponding transcript t (see Fig. 1B and supplementary table S3). This suggests that bootstrapped LASSO tends to select the most relevant transcripts in the prediction of the trait. On the other hand, we empirically observe the trade-off between increasing values of \(t(r)\) and lower R2 (Fig. 1C, left panel). Alternatively, large seed set sizes naturally increase R2 (Fig. 1C, right panel).Finally, we also investigated how sample size influences stability. We use the Jaccard coefficient across runs as an estimate. Figure 1D includes experiments for brain regions that showed the highest stability (cortex), the lowest stability (spinal cord) and middle stability (caudate, hippocampus and amygdala in this order). We observe that, by increasing the training sample size, we get an increase of stability (cor = 0.535, P < 5.78 × 10–62). This trend was seen for all the explored brain regions except for the spinal cord. This tissue showed the lowest performance when predicting the age of death.Seed transcripts are representatives of independent pathwaysSeed transcripts are expected to be important transcripts of each pathway involved in the trait or phenotype of interest. When creating TGCNs, we implicitly assume that biological pathways influence the phenotype of interest somewhat independently and additively. Accordingly, their respective representative transcripts, the seeds, should be independent of each other. In terms of how correlated seeds are pairwise, we observe a significant difference (T-test P < 0.05) in the mean of the maximum pairwise correlation of LASSO selected transcripts across runs (0.489) and that observed within randomly chosen genesets paired in size (0.756) for cortex tissue from GTEx dataset (Fig. 2A, supplementary table S4, see “Methods”). Therefore, the bootstrapped LASSO identifies unassociated transcripts.Figure 2Independence of seeds and overlap across modules. (A) For all 13 brain tissues, we obtained the pairwise correlation between transcripts selected by LASSO (magenta) and compared them with the same number of pairs of randomly chosen transcripts (turquoise) across runs. We consistently see across runs that LASSO gets transcripts with less association (absolute value of maximum Spearman correlation between pairs) than we would expect by random chance (see supplementary table S4). This illustrates that seeds show no association and it is maintained across bootstraps. (B) For each brain region of GTEx, enrichment (y axis) is shown for the age-specific seed transcripts grouped per ratio of appearance (x axis). Enrichment is measured as the sum of the − log10P of all the annotations normalized by group size in transcripts. No geneset enrichment reaches the required cut-off for significance, − log10(0.05). (C) Average pairwise overlap between modules in all age-TGCNs from GTEx brain tissues. For all tissues, we observe an overlap, as estimated by Jaccard, close to zero. This suggests each module represents different gene functions and pathways. (D) Overlap between modules within the substantia nigra age-TGCN. Each row and column refer to a module, led by the corresponding seed, at the row and column label. Each cell shows the number of common transcripts. Color scale represents the adjusted -log10P of the overlap based on a Fisher Exact test. (E) Same plot as D but for the hippocampus age-TGCN. We observed that CLEC5A module showed a significant overlap with CDKN2A (P < 1.58 × 10–21), VENTX (P < 3.65 × 10–91) and MARCO (P < 2.01 × 10–20) modules with a Fisher exact test after p-value adjustment. In addition, we also observe a significant overlap between VENTX and MARCO modules (P < 1.40 × 10–25). (F) PCA projection at transcript level with the union of the transcript of modules VENTX, MARCO and CLEC5A from the age-targeted network of the hippocampus brain tissue. We observe that the transcripts of module CLEC5A go in one direction, those of MARCO in another, those of VENTX in a different direction while the transcripts common to these three modules appear at the intersection of the point clouds.If seed transcripts represent independent pathways and we try to annotate them functionally as a geneset, we should not observe any significant enrichment. Figure 2A shows enrichment of GO, KEGG, REAC and HP terms found in seed transcripts grouped by ratio of appearance from 1 to 10. None of the seed transcript sets showed relevant functional abundance (see also methods). This shows that these transcripts are not involved in the same pathways. Therefore, they might be involved in different non overlapping pathways.Modules within a targeted GCN show low or null overlapFor each brain region of the GTEx dataset, the transcripts selected by LASSO in at least 8/10 iterations were kept as seed transcripts since we considered these are reliable and we also obtained a good balance between the number of modules created and the RMSE of the model (see supplementary table S2). Interestingly, we observed that the seed transcripts detected per brain region replicate, on average, in seven of the 12 different brain regions of the study (for more information, see supplementary Fig. S1).During the process of module creation, potential module member transcripts may show high correlation with more than one seed. TGCN allows transcripts to belong to more than one module, although this seldom happens. Our main interest is the characterization of pathways involved in the trait and we know genes are multifunctional29. What is relevant is that the independence of seed transcripts leads to quite different modules. In Fig. 2B, we show pairwise Jaccard coefficients between modules within each brain region-specific network. For all brain regions, the overlap between modules is practically zero, therefore the modules hold different transcripts (see supplementary table S5). Extreme brain regions regarding modules composition overlap are the age-TGCN in substantia nigra, whose modules show no overlap (Fig. 2C) and the age-TGCN in hippocampus, which showed a significant pairwise overlap (Fisher exact test P < 0.05) between three of its modules, CLEC5, MARCO and VENTX modules (Fig. 2D).Targeted GCNs provide more specific modules than WGCNAWGCNA is the most widely extended hypothesis free approach to create genome wide gene co-expression networks. Since TGCNs are focused on a specific trait and its transcript pool is restricted to gene expression pathways associated with the trait, TGCNs are smaller and much more specific than hypothesis free GCNs which act on the whole transcriptome. For example, on average, the GTEx WGCNA networks we created for the 13 brain tissues use transcript pools of 15,281 genes in size and have nine modules average each network, with a mean size for each module of 1787 transcripts. The corresponding age-TGCNs use, on average, 426 transcripts in total and have 12 modules average each network with a mean module size of 37 transcripts. 35.9-fold more transcripts in WGCNA GCNs (see supplementary tables S6 and S7). Remarkably, even though TGCN modules are smaller in transcripts, TGCN achieves similar or higher functional abundance across all 13 brain regions, on average 67.95 and 44.02 functional abundance per module in TGCN and WGCNA respectively.Seed transcripts in targeted GCNs are orthogonal to WGCNA hub transcriptsWe evaluated the relevance of the GTEx age-targeted seeds transcripts into the corresponding WGCNA modules. We found those transcripts we consider as seeds in TGCNs play no particular relevant role in their WGCNA module, showing a modest module membership, with an average percentile of 14.51 (see supplementary table S8). A plausible explanation for this is that in this case, the trait, age at death, has a subtle association with the genome wide gene expression profile.We also investigated the possible role of WGCNA hubs transcripts into the age-TGCNs in brain samples. For each WGCNA module, we selected the top 10 transcripts with highest adjacency values (hub transcripts) and checked the overlap with the TGCNs modules from the same brain region. We found that only a small proportion (18.75%) of WGCNA hub transcripts significantly overlapped with at least one TGCN module from the same brain region (see supplementary table S9).TGCN modules tend to be subsets of single WGCNA modulesIn order to better understand the relation between WGCNA modules and TGCN modules, we investigated how transcripts in a TGCN spread across the WGCNA GCNs created from the same gene expression profile. We found that, across all GTEx brain regions, 75.95% age-TGCN modules are included within a single WGCNA module while 16.46% WGCNA modules contribute significantly to more than one age-TGCN.See, for example, the crosstab plot (see “Methods”) comparing the WGCNA GCN (6 modules) and the age-TGCN within the caudate brain region (16 modules) (supplementary Fig. S2).All the transcripts not included within the age-TGCN fell into the cells of the rightmost column and only the MT1G, GDF15 and IGF2BP2 modules significantly overlapped with more than one module while the rest of modules overlapped with just a single module. WGCNA modules which include any transcript appearing in the corresponding TGCN, show significant overlap with more than one TGCN module. See for example, the yellow module including entirely six modules out of the 16 TGCN total modules. At the same time, only 53.58% of the transcripts in that module are used in TGCN modules. This suggests that TGCN modules’ functions are local to single WGCNA modules and by virtue of using only those transcripts that contribute to predict the trait, and those that co-express with them, they filter out all functions not relevant to the trait.TGCNs show specificity to the trait and brain regionTGCNs are a set of transcript-based modules, led by seed transcripts with no association to any other seed through mRNA levels, and that also resemble independent pathways within each module. We hypothesize that the pathways play a biological role, mediated by gene expression, in relation to the targeted trait. Therefore, the APP-targeted GCN that models the role of APP in the AD phenotype is assumed to describe the pathways APP is involved with. To what extent, do those pathways emerge from the fact that we are predicting a trait that is really associated with the gene expression of the samples used? What would happen if instead of using APP as the target, we would use as target a totally irrelevant gene in the omics plus trait context involved? Would the TGCN approach be specific enough to detect no relevant signal when there is actually none? To shed some light about these questions, we conducted an experiment with six genes not previously related with AD. The selection of control genes has been based on two criteria: (1) the selected control genes must be well studied, and their function characterized, to minimize the chance of eventually finding them associated with AD; (2) they must be functional, in the sense that they are associated with diseases other than of neurodegenerative nature. In supplementary table S10, the disease associated with each control gene is available.Firstly, within all 10 LASSO bootstraps for APP, the minimum test set R2 for the linear model using the seeds selected through that bootstrap is 0.935, mean of 0.952. In all 10 LASSO bootstraps for all the other six genes, the maximum R2 was 0.702 and the minimum was 0.149 (see supplementary table S10, supplementary Fig. S3A). Secondly, the seeds detected by the method in all ten LASSO bootstraps for APP were nine. In three of the six other TGCNs, the method detected no seeds in any of the runs. In the other three, the method detected only two seeds in each (supplementary table S10, supplementary Fig. S3B). Finally, the average functional abundance of modules in the APP-TGCN, measured as the sum of -log10(Pval) for all enriched terms, is 293.9. We see practically no average enrichment in the other six networks, with 10.8 enrichment on average for the best network, TNNT2-TGCN. This is a 26-fold difference (see supplementary table S10, supplementary Fig. S3C), which we consider as a strong indication of TGCN specificity.Seed transcripts and co-expression modules of the APP-TGCN replicate in two alternative AD cohortsTGCNs are constructed in three steps, by (1) detecting the seeds with model bootstrapping, (2) constructing the co-expression modules from those seeds and (3) annotating them (see Fig. 3). We constructed an APP-TGCN on the ROSMAP cohort (see materials) for r values of 8, 9 and 10. Thoroughly the paper, we focused on the ratio of appearance of eight, as it is a good trade-off between APP transcript expression prediction and complexity (i.e., it has 25 modules and 880 transcripts in total). We wanted to shed light about the APP transcript role in AD from brain samples gene expression. To assess whether this TGCN replicates, we used two alternative cohorts, Mayo and MSBB (see “Materials”). The seeds detected in ROSMAP with a minimum ratio of appearance of eight lead to a linear model with high APP predictivity (R2 = 0.97). These seeds also showed a high APP-predictivity in Mayo and MSBB cohorts, with R2 of 0.82 and 0.86, respectively (see supplementary table S11 for details on all models for r values of 8, 9 and 10). To properly assess whether those values of R2 were better than random chance, a permutation analysis was applied. To this end, we created 100,000 APP-predicting linear models with randomly chosen seed transcripts (same size of relevant seeds) and compared their performance with the APP-predictive model created with the relevant seed transcripts. This procedure was applied separately for both Mayo and MSBB. Results showed that the APP-predictive model created with the relevant transcripts performs better than APP-predictive models created with randomly chosen seed transcripts, leading to empirical p-values of statistical significance P < 0.059 and P < 4 × 10–5, respectively (see supplementary table S11, supplementary Fig. S4A,B). This suggests that seed transcripts discovered in ROSMAP reasonably explain APP transcript expression at the other two cohorts as well, nominally in Mayo and clearly in MSBB. Additionally, we evaluated how critical it would be to not find some of the seed transcripts in the expression matrix of the replication set. To this end, a permutation test was applied to compare the performance of the model created with the real seeds with the models created (1) removing each seed individually; (2) removing sets of five seeds (10,000 simulations); (3) removing sets of 10 seeds (10,000 simulations). This strategy was applied for Mayo and MSSB cohorts separately. We observed that, since it is an additive model, the effects of missing seeds are not noticeable (see supplementary table S12).Figure 3Workflow for the generation of targeted co-expression networks. The pipeline to create a TGCN is split into three steps (1) select the specific transcripts for the trait of interest (i.e. seed transcripts) based on a bootstrapped LASSO, (2) create the TGCN around those transcripts and (3) annotate the network modules. Step 1: to identify the transcripts that best predict the trait of interest, bootstrapped LASSO is used based on gene expression. Bootstrapped LASSO increases reliability of selected transcripts (seeds): the LASSO model is created b times with different train and test resamples. After all bootstraps, the TGCN software reports transcripts selected one or more times, along with how many times they were selected. Seed transcripts are selected on the basis of their ratio of appearance across bootstraps, r. Those transcripts are then used within a linear regression model as predictors to evaluate their prediction power on the trait and, therefore, the relevance of gene expression predicting that trait. Step 2: each seed transcript is the seed for a co-expression module. Modules are completed by adding the genes more co-expressed with the seed based on the absolute Pearson correlation. To determine how many transcripts to add to each module, the algorithm offers three options. In the first option, all seed transcripts are equally relevant within the network, therefore all modules should have the same size (i.e. 100 transcripts). In the second option, the higher the magnitude of the coefficient in the linear model, the more relevant the seed is within the TGCN. Therefore, the seed transcript with the highest coefficient receives the maximum number of transcripts (i.e. 100 transcripts). In the third option, we look for the minimum size of the module that reports the best enrichment possible.In order to assess whether co-expression modules are also present in the replication cohorts, for each APP-TGCN module built on the ROSMAP dataset, we tested for significant overlaps in the modules of Mayo and MSBB APP-TGCNs (see supplementary Fig. S4C,D). Out of the 25 modules identified in ROSMAP, seven modules did not show a significant overlap with any other module of the replication networks. The small size of the unreplicated co-expression modules (average size is 40) might explain, in part, their low replicability. Regardless, the replicability of 18 out of 25 modules across three brain transcriptomic AD datasets, demonstrated the robustness of this new method.Insights into the biology of APP and its role in ADThe TGCN R software package generates, for each targeted network, a whole bundle of information, in HTML format, on the TGCN construction process and results annotation (see the TGCN GitHub for the one generated for APP-TGCN). All this is summarized in a table with an entry for each module (see supplementary table S13). The complete APP centric network is composed of 880 transcripts, spread across 25 modules, whose size varies in the [10, 80] genes range (see “Methods”). The 25 seeds alone can explain most of the APP gene expression variation in a linear model (R2 0.96). The CellTypeEnriched column of supplementary table S13 shows four modules that clearly contain cell marker enrichment, including the ATP1B1 module enriched for dopaminergic neuron markers (Fisher’s exact P < 4.55.10–3), the ABCD3 module, enriched for astrocytic markers (P < 1.083.10–3), the FRZB module enriched for mural cell markers (7.761.10–4), and the NFASC module enriched for oligodendrocyte markers (3.899.10–4). This adds evidence to the already known complex expression profile of APP in AD brains and different expression patterns across brain cell types. After multiple test corrections, the eigengenes of two of these modules were significantly correlated with AD status, i.e., the ATP1B1 and NFASC modules (P < 0.038 y P < 7.02.10–4, respectively). The ATP1B1 module is enriched for GO biological processes like synaptic vesicle cycle (GO:0,099,504, P < 3.65.10–4) and vesicle mediated transport in synapse (GO:0,099,003, P < 4.46.10–4). The NFASC module seems to be involved in the development of neurons, particularly through axons in the neurons since this module is enriched for terms like main axon (GO:0,044,304, P < 9.32.10–5) and axon development (GO:0,061,564, P < 6.49.10–4). Both ATP1B1 and NFASC modules replicated in two independent cohorts. Specifically, AT1BP1 module replicated in Mayo and MSBB cohorts with empirical p-values of P < 2.76.10–9 and P < 1.96.10–10, respectively. NFASC module replicated in Mayo and MSBB cohorts, reporting P < 4.51.10–4 and P < 1.03.10–11, respectively.Additionally, the HNRNPA2B1 module, showed a strong association with AD case/control status (P < 2.13.10–5), and seems to be involved in the metabolism of RNA through GO BP terms like RNA metabolic process (GO:0,016,070, P < 1.23.10–10), gene expression (GO:0,010,467, P < 1.73.10–8), particularly seems to be specialized in splicing, through the reactome term mRNA splicing (REAC:RHSA-72163, P < 6.89.10–4). This module also replicated in Mayo and MSBB cohorts (P < 3.16.10–4 and P < 1.83.10–15, respectively).

Hot Topics

Related Articles