Prediction of plant complex traits via integration of multi-omics data

Prediction of complex traits using individual omics dataThe six traits, namely flowering time (days until the first flower was open), rosette leaf number (RLN), cauline leaf number (CLN), diameter of the rosette (DoR), rosette branch number (RBN), and stem length (SL), were collected for 383 Arabidopsis accessions from published studies9,10,11 (Fig. 1a, Supplementary Data 1). Samples for G, T, and M were taken from mixed rosette leaves harvested just before bolting at 22 °C, flowering time was measured at 10 °C, and the other five traits were measured at 16 °C9,10,11. Before investigating the utility of different omics data for plant complex trait prediction, we first examined the omics data structure among accessions (Fig. 1b), with the assumption that accessions with more similar trait values are expected to have more similar genetic information (i.e., G, T, or gbM). However, G, T, or gbM data alone explained only a small amount of variation, as there was no obvious relationship between the trait and omics similarity matrices: the Pearson’s r between phenotypic trait similarity among accessions (pCor; e.g., pCorflowering time see Supplementary Fig. 1a) and the corresponding similarity of G (kinship, Supplementary Fig. 1b), T (eCor, Supplementary Fig. 1c), and gbM (mCor, Supplementary Fig. 1d) only ranged from −0.02 to 0.17 (Fig. 2a). This weak correlation is consistent with the findings in our previous study of yield, height, and flowering time in maize3, and is expected because only a subset of G/T/gbM variants (e.g., a few SNPs) are expected to contribute to the variation in a complex trait, and the linear correlation between the whole set of variants (e.g., all SNPs) and complex trait values is low. In contrast, kinship and mCor were correlated with each other at a higher level (r = 0.43, Fig. 2a), indicating that gbM is more heritable than phenotypic traits, or that the M data were confounded by G, which will be discussed later on.Fig. 2: Relationships between omics data and traits, and trait prediction model performance.a Correlations between kinship (K), expression similarity (eCor), methylomic similarity (mCor), and trait similarity (pCor) matrices. Color in the heatmap: Pearson’s r between omics similarity matrices. FLT: flowering time; RLN: rosette leaf number; CLN: cauline leaf number; DoR: diameter of rosette; RBN: rosette branch number; SL: stem length. b Performance of prediction models for six traits. Models were built using genomic (G), transcriptomic (T), gene-body methylation (gbM) data, or omics data similarity matrices (K, eCor, mCor) with the ridge regression Best Linear Unbiased Prediction algorithm (for models built with Random Forest, see Supplementary Fig. 2a). The Pearson Correlation Coefficient (PCC, to be distinguished from the Pearson’s r between omics data similarity and trait similarity matrices in a) between true and predicted trait values on the test set (20%, held out before training the model) was used to evaluate model performance. The PCCs of models built using population structure (first five principal components of genetic variation) are indicated by blue dashed lines. Bar and error bar: average PCCtest and standard deviation of replicate runs (n = 10). Colors: models built using G- (blue), T- (red), and gbM-related (yellow) features. All: all three types of gbM data, namely, CG-, CHG-, and CHH-types. Source data are provided as a Source Data file.While the overall correlations are low, the likelihood that predictive information is encoded within the overall G/T/gbM data led us to build machine learning models to take advantage of all features from single omics data for trait prediction. We established single omics data-based trait prediction models using the algorithms ridge regression Best Linear Unbiased Prediction (rrBLUP)12 and Random Forest (RF)13, and the model performance was assessed using a hold-out test dataset and measured as the Pearson Correlation Coefficient (PCC) between true and predicted trait values (see Methods and Fig. 1c). In a previous study we found that, despite their simplicity, rrBLUP and RF outperformed other commonly used algorithms for most species and traits tested14; furthermore, RF has the advantage of allowing interpretation of the resulting models, in particular allowing identification of non-linear interactions between predictors. In addition to the whole set of features for each omics data, the similarity matrices of omics data (i.e., kinship, eCor, and mCor, which are derived from G, T, and gbM, respectively) were also used to build models. As expected, the higher the correlation between the omics data and trait values (Fig. 2a), the higher the model performance was for most traits and omics data types (Fig. 2b, Supplementary Fig. 2a–d). For each trait, model performance was similar regardless of algorithm or whether omics data values or value similarities among accessions were used as predictive features (Fig. 2b, Supplementary Fig. 2a, e). Most importantly, G-, T-, and gbM-based models had comparable performances (Fig. 2b, Supplementary Fig. 2a). For example, G- and T-based rrBLUP models had the highest performance for flowering time and CLN, respectively, whereas gbM-based RF models had the highest performance for flowering time. This is consistent with the findings of our previous maize study using both G and T data for trait prediction3.Since CG, CHG, and CHH methylation have different regulatory mechanisms and functions in plants15, we also established models using the different gbM types as separate features. Models using individual gbM data types tended to have similar or poorer performances than models using combined gbM data, and models using CHH methylation tended to have the worst performance (Fig. 2b, Supplementary Fig. 2a). Because of the complexity of M data (e.g., heterogeneity in numbers of methylated cytosine sites within genic regions), we also explored six additional derived M features (single site-based M, hereafter referred to as ssM, see Methods) for predicting flowering time as an example. rrBLUP models using ssM features had higher prediction accuracies than those using gbM (Supplementary Fig. 3a, b), but the prediction accuracies for RF models were not higher for unknown reasons (Supplementary Fig. 3c, d). The improved prediction of ssM-based rrBLUP models may be because the ssM features captured the distribution of methylation across each gene, which provided more detailed information about methylation patterns than the gbM data. Taken together, these results indicate that, similar to G and T data3,16, M data are also useful for predicting plant traits, and that M data need to be represented in different ways to maximize their predictive power.Distinct contributions of omics data to complex trait predictionsTo identify the informative variants embedded in the G/T/gbM data, we investigated the importance of features for trait prediction by interpreting the prediction models. Three measures were used to evaluate feature importance: (1) coefficients of features in the rrBLUP models (Supplementary Fig. 4a–c); (2) gini importances in the RF models (Fig. 3a–c); and (3) the average absolute SHapley Additive exPlanations (SHAP) values17 obtained from RF models (Supplementary Fig. 4d–f). Here, we focus on features important for predicting flowering time (for feature importances for the other five traits, see Supplementary Data 2–7) because there is abundant knowledge about the genetic control of this trait, which is crucial for interpreting the important features. To allow for a comparison of importances with T and gbM features, which are gene-based, G variants were mapped to genic regions (see Methods). Genes corresponding to or harboring important G/T/gbM features (defined as those with >95th percentile importance values) were considered important for flowering time prediction and are hereafter referred to as important genes. We found weak or no correlation between importance scores from models built using different types of omics data, regardless of the feature importance measure examined (Spearman’s ρ: −0.07–0.11), and there was little overlap of important genes between models (Fig. 3, Supplementary Fig. 4a–f).Fig. 3: Correlation of feature importance between different types of omics data.a–c Density scatter plots showing the Spearman’s rank correlation (ρ) of the importance scores between genomic (G), transcriptomic (T), and gene-body methylation (gbM) features of genes when Random Forest gini importance scores were used as measures of feature importance (n = 24,175). a G vs. T. b G vs. M. c T vs. gbM. All feature importance values < 10−6 were assigned a value of 10−6. Red dashed line: 95th percentile of feature importance. Color: gene density. SOC1: SUPPRESSOR OF OVEREXPRESSION OF CO 1; FT: FLOWERING LOCUS T; FLC: FLOWERING LOCUS C; SPL15: SQUAMOSA PROMOTER BINDING PROTEIN-LIKE 15; AGL15: AGAMOUS-LIKE 15; DOG1: DELAY OF GERMINATION 1. Source data are provided as a Source Data file.These results suggest that the similar trait prediction accuracy of models built with different omics data is not due to shared features, consistent with the findings of the maize study3. However, the correlation between the importance of G and gbM variants (Fig. 3b, Supplementary Fig. 4b, e) was higher than that for other comparisons, consistent with the relatively higher correlation between G and gbM similarities across accessions (i.e., kinship and mCor, Pearson’s r = 0.43, Fig. 2a) and suggesting potential confounding effects of G on gbM data. To disentangle gbM from G data, we built trait prediction models with the mCor residuals to exclude the kinship effects (Methods). Consistent with a confounding effect of G on gbM, these new models had significantly lower performance compared with models based on the mCor matrix for all traits (differences in PCC scores, which were used as measures of prediction accuracy: 0.01–0.18, p from two-sided Wilcoxon rank sum test <0.05) (Supplementary Fig. 4g). However, mCor-based models with confounding effects of kinship removed still performed better at predicting flowering time and RLN than those using the first five principal components of G data to approximate population structure (Supplementary Fig. 4g, the performance of the population structure-based model is used as the baseline for trait prediction). Thus, the components of gbM independent from G are also important for trait prediction.Benchmark flowering time genes identified as important features in flowering time prediction modelsIn the previous section we showed that models built using different omics data identified different important genes (Fig. 3, Supplementary Fig. 4a–f). We next asked how many genes with known functions in flowering time were identified as important genes. We downloaded 426 benchmark flowering time genes from FLOR-ID (http://www.phytosystems.ulg.ac.be/florid/)18 and TAIR (https://www.arabidopsis.org/) (Supplementary Data 8), and found that 169 were identified as important according to at least one of the three importance measures for at least one of the three individual omics datasets (Fig. 4, for full gene list, see Supplementary Data 9). Only two genes, FLOWERING LOCUS C (FLC) and its paralog MADS AFFECTING FLOWERING 2 (MAF2), were identified as important by all three independent omics datasets (orange font, Fig. 4). This is consistent with the roles of FLC10,19,20 and MAF221,22 in flowering time regulation being established through studies of genetic variation, transcript levels, and methylation levels. Another 27 genes (blue font, Fig. 4) were identified by two independent omics datasets. For instance, FLOWERING CONTROL LOCUS A (FCA), which increases H3K4 dimethylation in the central region of FLC and regulates its expression23, was considered important in the G and gbM models. The remaining 140 genes were dataset specific (black font in Fig. 4, and Supplementary Data 9), such as SUPPRESSOR OF OVEREXPRESSION OF CO 1 (SOC1), which was only identified as important in T models. This is consistent with our observations that there is little overlap between important genes identified by G, T, and gbM models (Fig. 3, Supplementary Fig. 4a–f), and that gbM data, although it is confounded with G information (Supplementary Fig. 4g), can make unique contributions.Fig. 4: Example benchmark flowering time genes identified using different importance measures and datasets.The heatmap shows how often benchmark flowering time genes were identified as important using different datasets. For each omics dataset, there are three feature importance measures: (1) RF – Random Forest gini values; (2) rrBLUP – absolute value of ridge regression Best Linear Unbiased Prediction coefficient; and (3) SHAP – average absolute SHapley Additive exPlanations values. Color in the heatmap indicates importance value percentile: dark lawn green, ≥99th percentile; light lawn green, ≥95th and <99th percentile; gray, <95th percentile. Font color indicates the number of omics datasets that identified a benchmark flowering time gene as important: orange, all three datasets; blue, two datasets; black, a single dataset. G, T, and gbM: genomic, transcriptomic, and gene-body methylomic data, respectively. All: all three types of gbM data, namely, CG-, CHG-, and CHH-types. Source data are provided as a Source Data file.We next asked whether significantly more benchmark flowering time genes than random chance were identified by our models. When the 95th percentile was used as the threshold, only G and T models with RF gini importance scores identified significantly higher proportions of benchmark genes (8.78% and 8.62%, respectively) than expected by random chance (p = 1.22e-03 and 1.76e-03, respectively, Fisher’s exact test, Supplementary Data 10). To understand why no more benchmark genes were identified than expected by random chance for most models and importance measures, we explored the following potential factors (for the rationale and analysis process, see Methods): cut-off threshold (95th or 99th percentile), the number of accessions used for training models, and difference in gene contributions to flowering in short days (SDs) and/or long days (LDs). The former two factors had little impact on the identification of benchmark genes (Supplementary Data 10–12), except for SHAP values when the 99th percentile was used, where significantly more benchmark genes than expected by random chance were identified for the G and T models (Supplementary Data 11). In addition, genes that when mutated or overexpressed had flowering time phenotypes in two conditions (SDs and LDs) were more likely to be identified using our approaches than those that only had a phenotype when mutated or overexpressed in a single condition (SDs or LDs) (Supplementary Data 10, 11).One thing worth noting is that gbM-based models identified no more benchmark genes as important than expected by random chance, regardless of the threshold used (Supplementary Data 10–12). This may be because gbM does not adequately represent the methylation profiles of a given gene. To assess this, we also interpreted ssM-based models in which the methylation profile across a gene region was represented (see Methods). An additional 144 benchmark flowering time genes were recovered as important by at least one ssM feature type/feature importance measure combination (Supplementary Data 9), and more benchmark genes were identified as important for a single ssM model (a maximum of 44 genes, Supplementary Data 13) than a single gbM model (a maximum of 24 genes, Supplementary Data 10). One example gene is CLEAVAGE STIMULATION FACTOR 77 (CSTF77), which is involved in the 3’ processing of antisense FLC mRNA24. Accessions with CG-type methylation at two different sites in CSTF77 had significantly longer flowering times than those without (Supplementary Fig. 5a, b), but there were no significant differences in flowering time between accessions with different SNP alleles (Supplementary Fig. 5c, d). In addition, there was no correlation between the expression levels or gbM levels of CSTF77 and flowering time (Supplementary Fig. 5e–h). These results explain why CSTF77 was uniquely identified by ssM-based models (Supplementary Data 9).Another potential reason for the low degree of enrichment of identified benchmark genes for most models is the fact that the plants scored for flowering time phenotypes were grown at 10 °C and those used for the G, T, and M data were grown at 22 °C. We also built models using the same G/T/gbM data to predict flowering time phenotypes recorded at 16 °C10, and found that the temperature at which flowering time was scored affected gene importance for flowering time prediction (Supplementary Data 14,15). For example, the expression of FRIGIDA (FRI), which, when functional, confers a vernalization requirement25, had a SHAP value of zero for flowering time prediction at 10 °C but a SHAP value of 0.075 (ranked 14th) at 16 °C (Supplementary Fig. 6a). This is consistent with the larger differences in flowering time between accessions with functional and non-functional FRI copies at 16 °C than at 10 °C (a temperature at which the vernalization requirement is met for accessions with functional FRI); the larger SHAP values for FRI in the 16 °C model indicate that it makes a higher contribution to flowering time prediction when the plants are not vernalized (Supplementary Fig. 6b). Furthermore, the higher importance rank of FRI features at 16 °C compared with 10 °C was also observed for all G- and T-models regardless of the importance measure examined (Supplementary Fig. 6c, d, and Supplementary Data 14, 15).Since a number of benchmark flowering time genes were identified when RLN and CLN were used as proxies for flowering time in some previous studies, we also asked how many benchmark flowering time genes could be identified by the RLN and CLN models. When only G-, T-, and gbM-based models were examined, 42 genes were identified as important by all flowering time, RLN, and CLN models (Supplementary Data 16); these are generally hub genes in the flowering time regulation network, such as FT, FLC, SOC1, FRI, BROTHER OF FT AND TFL1 (BFT), and SHORT VEGETATIVE PHASE (SVP)26. Consistent with the importance of FRI for predicting flowering at 16 °C, FRI was important for predicting RLN and CLN in both G- and T-models, probably because the temperature (16 °C) was the same as that at which RLN and CLN were measured. An additional 20 benchmark genes were identified as important by both RLN and CLN models, and 37 and 49 were specifically identified by RLN and CLN models, respectively. For example, TIMING OF CAB EXPRESSION 1 (TOC1) was previously shown to control photoperiodic flowering response when RLN was used as a proxy for flowering time27. Consistent with this, TOC1 only had importance ranks above 95th percentile in G models for RLN. In our prediction models, TERMINAL FLOWER 1 (TFL1) was only important for CLN prediction, which is consistent with the previous finding that tfl1 mutants showed significantly decreased CLN, but not RLN or days to bolting (another proxy for flowering time), compared with wild type (WT)28. Taken together, these results indicate that different omics datasets, ways to represent data, importance measures, environmental factors, and ways to measure traits must be considered to better capture the genetic basis of Arabidopsis flowering time and likely other complex traits.Identification of additional genes involved in regulating flowering timeTo determine whether all the features relevant to benchmark flowering time genes are sufficient to predict flowering time, we built RF models using G, T and gbM features for 426 benchmark genes separately or combined (hereafter referred to as benchmark gene-based models). Compared with the corresponding full models built using the features for all genes, the benchmark gene-based models had significantly lower performance for gbM-based and the combined models (Fig. 5a), suggesting that genes in addition to the 426 benchmark genes are involved in regulating flowering time. To test this, we selected the 426 most important genes that were not benchmark genes (hereafter referred to as “top non-benchmark” genes) from the full model, which was built using G/T/gbM features combined for all genes (Supplementary Data 17). We found that the top non-benchmark gene-based gbM model (built using the gbM features of these top non-benchmark genes) performed significantly better than the benchmark gene-based gbM model, but not the corresponding G-, T-, and combined models (Fig. 5a). This may be attributed to the fact that most benchmark flowering time genes were identified using genetic approaches (e.g., via forward genetic screens or GWAS) and/or transcriptomic data (e.g., via gene differential expression analysis), rather than methylomic data.Fig. 5: Identification of additional genes involved in flowering time regulation.a Performance of Random Forest (RF) flowering time prediction models using all features (dark slategray), only features related to 426 benchmark flowering time genes (dark sea green), or only features related to the top 426 non-benchmark genes (light gray). PCCtest: Pearson correlation coefficient between true and predicted trait values for accessions in the test set. G, T, and gbM: genomic, transcriptomic, and gene-body methylomic data, respectively. p-values are from two-sided Wilcoxon rank sum tests. Bar and error bar: average PCCtest and standard deviation of replicate runs (n = 10). b–h Statistical analysis of flowering time in single-gene mutants of important genes (bold font) and their paralogs (regular font), double mutant (DM) with loss of function of both paralogs, and wild-type (WT) plants. Numbers of individuals for each genotype are shown in the parenthesis. Flowering time was normalized across flats within blocks (see Methods). p-values are from two-sided Wilcoxon Rank Sum tests. Horizontal line in the box: median value; box range: interquartile range (IQR), 25th (Q1) to 75th percentile (Q3); whisker below box: Q1–1.5 IQR to Q1; whisker above box: Q3 to Q3 + 1.5 IQR; violin plot: distribution of datapoint values; dot: datapoint from an individual plant. Color of violin plot: light blue, single-gene mutant; light green, double mutant; light orange, wild type. Source data are provided as a Source Data file.To validate the functions of important genes in flowering time, we took advantage of an existing dataset in our lab to compare flowering time in mutants of 21 non-benchmark genes that were identified as important and the WT (Methods, Supplementary Data 18, 19). Six of these 21 genes affected flowering time when mutated (Fig. 5b–f). For example, a loss-of-function mutant of AT4G11070 (WRKY41), which controls seed dormancy29, flowered significantly earlier than WT (Fig. 5c). Consistent with this, WRKY41’s homolog Dlf1 regulates flowering in rice30. Another three genes had loss-of-function effects on flowering when mutated along with their paralogs (Fig. 5g, h). The remaining 12 genes did not have a significant effect on flowering time when mutated, alone or with their paralogs (Supplementary Data 18). One potential explanation for why no flowering time phenotype was observed for these 12 important genes is that our validation experiments were conducted in the Col-0 genetic background, but the important genes were identified in models built across multiple accessions.To check whether our models perform well in identifying genes that function in flowering time, we also measured the flowering time of the loss-of-function mutants of 37 “non-important” (importance rank ≤95th percentile) genes and their paralogs. We found that 43.2% (16 genes) had significantly altered flowering time when mutated (Supplementary Data 18, 19). This percentage is only slightly lower than that of experimentally validated important genes (42.9%, 9 out of 21 genes). One potential explanation for the similar percentages of predicted important and “non-important” genes with experimentally validated roles in flowering time is that, as discussed above, the importance of these genes may be dependent on the accessions and/or environmental conditions. In addition, there are far more features (e.g., >20,000 T features) than instances (only 383 accessions) in our models. Therefore, we do not have enough power to detect variants with significant but lower degrees of contribution to flowering time. Our false negative rate when using importance rank as a criterion is expected to be high.Accession-dependent contributions of genes to flowering time predictionThus far, we have evaluated the contribution of G, T, and gbM features associated with genes to the prediction of flowering time across accessions by dissecting the models through global interpretation31. However, some genes may be important contributors to flowering time only in specific accessions. To assess this, we determined the contributions of important features to flowering time in each accession (local interpretation) by examining the SHAP value for each feature in each accession (individual SHAP values, as opposed to the averaged value discussed earlier, see Methods). Here, a positive SHAP value for a feature in an accession means that the value of the feature in that accession contributed to a higher predicted trait value, i.e., longer flowering time. A negative SHAP indicates the opposite: the feature value contributed to reduced flowering time in an accession. The absolute SHAP value describes the degree of feature contribution to trait prediction. Here we first present the SHAP values of the top 20 important genes from the T model in detail as an example because more benchmark genes were among the top 20 genes in this model (Supplementary Fig. 7a) than in the G (Supplementary Fig. 7b) and gbM models (Supplementary Fig. 7c). Organizing the accessions into clusters based on SHAP values of the top 20 T features allowed us to examine the way that different features contribute to flowering time prediction. The accessions we examined formed eight clusters (Fig. 6a), and flowering time varied greatly across these clusters (Fig. 6b).Fig. 6: Accession-dependent effects of transcriptome features on flowering time.a Heatmap shows the SHapley Additive exPlanations (SHAP) values of the top 20 genes (y-axis) for each accession (x-axis) in the T model. Benchmark gene names are in red on the left of the heatmap. Color scale of the heatmap: SHAP values; colors of the branches: different clusters. All values > 0.5 were assigned a value of 0.5, and values < −0.5 were assigned a value of −0.5. SOC1: SUPPRESSOR OF OVEREXPRESSION OF CO 1; FT: FLOWERING LOCUS T; FLC: FLOWERING LOCUS C; SPL15: SQUAMOSA PROMOTER BINDING PROTEIN-LIKE 15; PIF3: PHYTOCHROME INTERACTING FACTOR 3; AGL15: AGAMOUS-LIKE 15. b Violin plot shows the distribution of flowering time of accessions within clusters shown in (a). Colors of the violin plots: different clusters. Horizontal line in the box: median value; box range: interquartile range (IQR), 25th (Q1) to 75th percentile (Q3); whisker below box: Q1–1.5 IQR to Q1; whisker above box: Q3 to Q3 + 1.5 IQR; violin plot: distribution of datapoint values. Numbers of accessions in each cluster are shown in the parenthesis. c Pearson correlation (r) between SOC1 (left), FT (middle), and FLC (right) expression levels (x-axis, loge [transcript per million + 1]) and flowering time (n = 306). Color scale: SHAP values for SOC1, FT, and FLC when using expression to predict flowering time (y-axis). d, e Relationships between SOC1 and FT SHAP values among accessions (n = 306); example accessions are labeled. Colors in (d): membership of accessions in clusters specified in (a); color scale in (e): flowering time. Source data are provided as a Source Data file.In cluster 1 and 2 accessions (Fig. 6a, b, d, e), the SHAP values for SOC1, FT, and FLC expression were all positive, indicating they contribute to longer flowering time. Their positive contribution is mainly due to lower SOC1 and FT expression and higher FLC expression compared with other accessions (Fig. 6c). In cluster 8 accessions (Fig. 6a, b), all three genes have negative SHAPs, and the shorter flowering time (Fig. 6e) is due to higher SOC1 and FT expression but lower FLC expression (Fig. 6c). This is consistent with earlier findings that SOC1 and FT promote flowering32, FLC represses flowering33, and the expression levels of SOC1 and FT are negatively correlated with FLC expression34. This coupling between SOC1, FT, and FLC in terms of expression (Fig. 6c, Supplementary Fig. 8a, b, f, g), SHAP values (Fig. 6d, Supplementary Fig. 8c), and flowering time contribution (Fig. 6e, Supplementary Fig. 8e) is the predominant flowering time regulatory mechanism for most accessions. However, the action of these components can be decoupled. For example, in cluster 4, 5, and 6 accessions (dots in the second and fourth quadrants, Fig. 6d), the SHAP values of SOC1 and FT have opposite signs, which is associated with moderate flowering time values (Fig. 6b, e). Another example is cluster 7 (orange dots in the third quadrant, Fig. 6d) where, despite the positive contribution of FLC (Fig. 6a, Supplementary Fig. 8c), both SOC1 and FT contribute negatively to flowering time (Fig. 6d). This coupling and uncoupling between SOC1, FT, and FLC expression across accessions indicates that regulation of flowering time may be more complex than what has been reported. Accession-dependent contributions of other important T features are also observed, although to a much lower extent. An example is SQUAMOSA PROMOTER BINDING PROTEIN-LIKE 15 (SPL15); clusters 2 and 4–8 can be divided into sub-clusters depending on whether SPL15 expression contributes positively to flowering time in an accession (Fig. 6a).We also clustered accessions using SHAP values of the top 20 important features from G and gbM models, and obtained eight and nine clusters, respectively (Supplementary Fig. 9a, c), which resembled the distribution of T model-based clusters in that the clusters could be characterized by a few top features. For example, the second (AT4G38550) and the third (AT1G35610) most important genes from the gbM model can be coupled or uncoupled with FLC in a similar fashion as SOC1 and FT (Supplementary Fig. 9c). The different omics data types yielded clusters with different accessions (Supplementary Fig. 10). These findings further demonstrate that different omics data reveal different contributors to flowering time variation among accessions, and that the different effects of genes on flowering time among accessions can be disentangled through model interpretation. In addition, knowledge about flowering regulation in some accessions may not be generalizable to others, as demonstrated by the identification of different benchmark genes when different sets of accessions were used (Supplementary Data 10, 12). This may be explained by the differences in genetic backgrounds among accessions25 and the high complexity of genetic interaction networks regulating flowering35. The accession-dependent effects of genes on flowering may also partially explain the low degree of enrichment of benchmark genes among important genes in our original G, T, and gbM models because these benchmark genes were predominantly discovered in the Col-0 accession.Genetic interactions revealed through integration of multi-omics dataIn the above sections we showed that different types of omics data revealed overlapping but mostly distinct genes impacting flowering time. Next, we asked whether combining different types of omics data improves the prediction accuracy. We found that combining G and T or all three datasets improved model performance for RF models, but not for rrBLUP models (Fig. 7a). Because the RF algorithm considers non-linear feature combinations while rrBLUP does not, the better RF model performance suggests that the inclusion of interactions between features from different omics data may have improved model performance. To evaluate this possibility, we established an additional RF model, which only included G + T + gbM features relevant to the 426 benchmark flowering time genes (see Methods) to facilitate model interpretation. We used the SHAP approach36 to identify feature interactions (see Methods), where the contribution of feature X to trait prediction is influenced by values of feature Y. The SHAP feature interaction can help us identify potential genetic interactions between genes or variants, such as epistasis37. We identified 7,186 feature interactions, including all six possible combinations between omics data types: G-G, T-T, gbM-gbM, G-T, G-gbM, and T-gbM (Supplementary Data 20). T-gbM, G-T, and T-T interactions were the most prevalent (Fig. 7b). The T-features of three most important genes in the T model—SOC1, FT, and FLC (Fig. 6a)—had the highest numbers of feature interactions (707, 638, and 279, respectively), consistent with their reported functions as floral integrators32, which receive floral promotion or inhibitory signaling inputs from distinct pathways (Fig. 7c, Supplementary Data 20).Fig. 7: Prediction models integrating all omics data types and putative interactions between flowering time genes.a Prediction accuracy of flowering time (Pearson Correlation Coefficient between true and predicted values for accessions in the test set, PCCtest) for models built using single (peachpuff) and combined (cornsilk) omics data. Left panel: ridge regression Best Linear Unbiased Prediction (rrBLUP) models, right panel: Random Forest (RF) models; G, T, and gbM: genomic, transcriptomic, and gene-body methylomic data, respectively; bar and error bar: average PCCtest and standard deviation of replicate runs (n = 10); gray dashed line: the highest average PCCtest for flowering time models when single omics data were used. p-values are from two-sided Wilcoxon rank sum tests. b Distributions of SHapley Additive exPlanations (SHAP) interaction values for different types of feature interactions. X-axis label: interaction type and number of identified interactions. Orange and blue lines: 99th and 95th percentiles of interaction values, respectively; numbers in orange and blue font: numbers of feature interactions above the 99th and 95th percentiles, respectively, for each type of interaction. c Network of features with ≥3 interaction values above the 95th percentile. Blue, red, and orange points: G, T, and gbM features, respectively; node size: number of interactions; edge thickness: log10 (interaction value + 1)/2; blue edges: 18 interactions that ranked above 20th and had nodes with size ≥ 3; gray edges: all the other interactions; black font: three genes with the highest numbers of interactions. d–f Interactions between SUPPRESSOR OF OVEREXPRESSION OF CO 1 (SOC1) expression (n = 306) and FLOWERING LOCUS T (FT) expression (d), ZEITLUPE (ZTL) CG-type gene-body methylation (e), and a NUCLEAR PORE ANCHOR (NUA) single nucleotide polymorphism (SNP) (f). Equations in (d,e) are for regression of the SHAP values (y-axis) on the feature values (x-axis) of a given feature in accessions with certain FT (d) and SOC1 (e) expression values (loge [transcript per million + 1]): green font, FT expression ≥ 0.5 or SOC1 expression ≥ 3; blue font, FT < 0.5 or SOC1 < 3. The numbers of accessions with certain feature values are shown to the left of the arrows. Major: major allele; minor: minor allele. Source data are provided as a Source Data file.Among the top 20 interactions with the highest interaction values, five were T-T interactions of SOC1 with FT (ranked 1st), MIR172B (2nd), SPL5 (13th), FLC (19th), and PHYTOCHROME INTERACTING FACTOR 3 (PIF3) (20th) (Fig. 7c, d, Supplementary Fig. 11a–e). SOC1 was previously shown to regulate MIR172B38 and SPL539 by directly binding to their promoters. However, no direct biological interaction between SOC1 and PIF3 has been reported. Also among the top 20 interactions, two were T-gbM and T-G interactions of SOC1 with ZEITLUPE (ZTL, ranked 5th, Fig. 7e) and Nuclear Pore Anchor (NUA, ranked 6th, Fig. 7f). ZTL and its two homologs, LOV KELCH PROTEIN 2 (gbM-T interaction with SOC1 ranked 121st, Supplementary Fig. 11f) and FLAVIN-BINDING, KELCH REPEAT, F-BOX 1 (gbM-T interaction with SOC1 ranked 3333rd, Supplementary Fig. 11g), function as E3 ubiquitin ligases and regulate flowering time regulators, such as CONSTANS and FT 40. No direct biological interaction has been reported between SOC1 and ZTL or its two homologs, whereas the interaction between SOC1 and NUA was reported before: expression of SOC1 was higher in nua-1/4 mutants compared with WT41. In addition to these T-T, T-gbM, and T-G interactions, we also observed G-G, G-gbM, and gbM-gbM interactions that have not been reported before: a G-G interaction between AT1G11520 and HASTY 1 ranked 8th (Supplementary Fig. 11h), gbM-gbM interaction between VIN3-LIKE 3 and INDUCER OF CBF EXPRESSION 1 ranked 25th (Supplementary Fig. 11i), and G-gbM interaction between AT1G11520 and FY ranked 45th (Supplementary Fig. 11j). These interactions might indicate potential biological interactions between genes.Furthermore, by examining the feature interactions in detail, we found some interesting patterns. For example, the T-gbM interaction between SOC1 and ZTL illustrates how SOC1 and ZTL interact across accessions (Fig. 7e): in accessions with higher SOC1 expression, contributions of ZTL CG-type gbM were near-zero regardless of the methylation levels of ZTL, whereas in accessions with lower SOC1 expression, the contributions were larger either positively or negatively. A similar pattern was also observed for other interactions, such as a T-T interaction between SOC1 and MIR172B (Supplementary Fig. 11b). Taken together, these findings show that potential interactions at different molecular levels can be identified on a large scale by interpreting computational models integrating multiple omics data.

Hot Topics

Related Articles