Comprehensive mapping and modelling of the rice regulome landscape unveils the regulatory architecture underlying complex traits

Charting a reference atlas of chromatin accessibility in riceTo generate a comprehensive landscape of accessible chromatin in rice (Oryza sativa), we took advantage of an improved ATAC-seq protocol (UMI-ATAC-seq12, which incorporates unique molecular identifiers to the regular ATAC-seq technique for accurate quantification and footprinting) to perform chromatin accessibility profiling in 23 tissues/organs spanning the entire life cycle of rice. The representative tissues include callus, radicle, plumule, leaf, leaf sheath, root, apical meristem (AM1/AM2), dormant buds (DBuds), shoot apical meristem (SAM1/SAM2/SAM3), panicle neck node (PNN), stem, young panicle (Panicle1/Panicle2/Panicle3/Panicle4), lemma, palea, pistil, stamen and seed coat (Seed1/Seed2/Seed3). The experiments were conducted in three representative rice varieties, namely Nipponbare (NIP; japonica subspecies), Minghui 63 (MH63; indica subspecies type II), and Zhenshan 97 (ZS97; indica subspecies type I), with each experiment consisting of at least two biological replicates (Fig. 1a and Supplementary Data 1). In total, 145 genome-wide chromatin accessibility datasets with high sequencing depth (~30.7 M reads on average) were generated. We applied the ENCODE standards13,14 to establish the analysis pipeline (see Methods). Compared to published ATAC-seq datasets in the plants as deposited in the ChIP-Hub database14, our data exhibited a significantly higher signal-to-noise ratio (Supplementary Fig. 1). Through data analysis using the corresponding reference genomes of the three cultivars15,16, we identified on average of 40,676 (ranging from 28,991 to 49,737) reproducible OCRs (with an Irreproducible Discovery Rate [IDR]17 < 0.05) per experiment (Fig. 1b). As expected, the identified OCRs from all experiments predominantly located either in the proximal upstream regions of the transcription start site (TSS) or the distal intergenic regions (Fig. 1b, f, Supplementary Fig. 2 and Supplementary Data 2), resembling promoters or enhancers, respectively18. Of note, OCRs from intragenic regions accounted for a relatively small proportion (about 15.7%), while most of these OCRs originated from intronic regions (Supplementary Fig. 2b). These observations indicate that the vast majority of OCRs originate from noncoding regions of the rice genome.Fig. 1: Characterization of an open chromatin landscape in rice.a ATAC-seq and RNA-seq experiments were conducted in three varieties (Nipponbare, Minghui 63, and Zhenshan 97) of rice in various tissues across the entire life. See Supplementary Data 1 for detailed descriptions of sample collection. Consistent tissue color code is used throughout the figure. b Bar plot showing the number of reproducible OCRs identified from each tissue in the three rice varieties. The OCRs are classified into three categories based on the distance of the OCR summit to its closest transcription start site (TSS): distal (>1 kb), proximal (<=1 kb), and intragenic. No data from the tissues of SAM3 (NIP), Seed1 (ZS97), and Stem (ZS97). c The proportion of the rice genome annotated as open chromatin regions (OCRs) in our study. d The accumulative number of unique OCRs in each tissue, calculated by excluding OCRs that overlap with the OCR superset. e Density plot showing the enrichment of TF binding sites (TFBSs) around the OCRs in Nipponbare (NIP). TFBSs were predicted either by ChIP-seq datasets for 56 distinct TFs (left) or DNA motifs for 458 TFs (right), which were obtained from the ChIP-Hub database14. The flanking area on both sides is 1 kb. f The distribution of the distance of OCR summit to its closest TSS in the three rice varieties. Published open chromatin data14 in rice (NIP) were included for comparison. Based on the distribution, a cutoff of 1 kb (dashed line) was used to distinguish the proximal and distal regulatory OCRs. g The distribution of the conservation PhastCons score19 around the NIP OCRs. h The t-SNE plot showing an unsupervised clustering analysis of chromatin accessibility across different samples. Each dot represents one replicate. Color code as in a. i Boxplot showing the distribution of tissue specificity score of intragenic (n = 14239), proximal (n = 29524) and distal (n = 57153) OCRs (left) or the median score in each tissue. P1 = 4.01e-39, p2 = 2.13e-96, p3 = 1.23e-95. All p-values were calculated by two-sided Mann–Whitney U test between proximal and distal OCRs in terms of specificity. Tissue color code as in a. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). Source data are provided as a Source Data file.We estimated that approximate 15% of the rice genome could be annotated as OCRs, with a consistent pattern observed across each variety (Fig. 1c), and the estimation appeared to have reached saturation in rice (Fig. 1d). OCRs contain multiple TF binding sites and are responsible for regulating the expression of target genes6,18. We collected publicly available ChIP-seq data for 56 distinct TFs (Supplementary Data 3) and predicted DNA motifs for 458 TFs in rice from the ChIP-Hub database14, and showed that OCRs were significantly enriched for TF binding sites (Fig. 1e). Furthermore, we found that OCRs are highly evolutionarily constrained compared to flanking genomic regions (Fig. 1g), supporting previous findings that conserved noncoding sequences (CNSs) are predictive of OCRs in plants19,20.We next assessed the overall similarities and differences of chromatin accessibility across varieties and tissues. We quantified all datasets based on the merged OCRs (n = 117,176) called from the same reference genome (i.e., Nipponbare) and visualized their global patterns using t-distributed stochastic neighbor embedding (t-SNE)21. While dimension 1 and 2 of t-SNE results generally reflected differences between the indica (MH63 and ZS97) and japonica (NIP) subspecies, dimensions 2 and 3 primarily delineated distinct clusters among tissue types (Fig. 1h). For instance, the chromatin accessibility patterns of vegetative and productive tissues of NIP were separated into distinct clusters, whereas young panicles and callus tissues exhibited similar patterns regardless of their variety origin. We further calculated the tissue specificity of each OCR based on the Jensen-Shannon divergence (JSD) index. Obviously, distal OCRs showed significantly higher specificity scores than proximal OCRs (Fig. 1i and Supplementary Fig. 3a, b), consistent with previous findings14,18,22.In short, the comprehensive accessible chromatin landscape in rice represents a value resource for crop functional genomic studies.Linking open chromatin regions to target genesTo decipher which genes these OCRs may regulate, we generated matched RNA-seq datasets for the investigated tissues in each rice variety (Supplementary Fig. 3c and Supplementary Data 4). We adopted a strategy23 to predict OCR-to-gene links based on correlation analysis between the OCR accessibility and gene expression across all samples (Fig. 2a; see Methods). Genes can be regulated by multiple OCRs (including promoters and enhancers) through chromatin interactions, which are supposed to occur within topologically associated domains (TADs). Since the size of TADs in the rice genome was estimated to be ranging from 35 kilobase pair (kb) to 45 kb based on Hi-C data24,25, we restricted our analysis to 40 kb (i.e., from 20 kb upstream to 20 kb downstream of the TSS) to predict target genes of OCRs. Using a cutoff of absolute Pearson correlation coefficient |R|>= 0.4 and P < 0.05, we obtained a total of 59,075 unique links between OCRs (n = 38,437, 32.8% of all OCRs) and genes (n = 18,781, 48.1% of annotated genes; Supplementary Fig. 4a, b and Supplementary Data 5). As expected, the OCR-to-gene links tended to occur more frequently in the proximal OCRs, and consequently the correlation between the gene expression and chromatin accessibility is higher for proximal links (Supplementary Fig. 4c–f).Fig. 2: Tissue-specific OCRs.a Schematic diagram illustrating the correlation-based approach to link ATAC-seq OCRs to target genes based on correlation analysis between chromatin accessibility and gene expression. b Heatmap showing the tissue-specific OCR-to-gene links (R > = 0.4, P < 0.05, two-tailed Z-test). Each row in the left panel is a unique OCR. Each row in the middle panel is a gene, corresponding to target genes for OCRs in the left panel. Representative genes are shown on the right. c Examples of tissue-specific OCRs (in the dashed box) regulating dynamic expression of the corresponding target genes. The orange lines indicate the OCR-to-gene links, and the deeper the line the higher the correlation between the chromatin accessibility and gene expression. d Enrichment of biological processes gene ontology (GO) terms for target genes in each OCR cluster in b. The asterisk (*) denotes P < 0.05 (P-values were calculated by Hypergeometric test after Bonferroni correction). e Bar plot showing the percentage of OCRs from different categories based on the genomic location. Source data are provided as a Source Data file.Genetic variants within OCRs can contribute to changes in gene expression levels through expression quantitative trait loci (eQTL). We colocalized the identified OCR-to-gene links from our study with published eQTL data in rice26, and we found a significant overlap (Chi-squared test, P < 1.55e − 06) between OCR-to-gene links and eQTL-gene pairs (Supplementary Fig. 4g). As expected, the correlation coefficients of colocalized OCR-to-gene links with eQTLs are significantly higher than those without colocalization (Wilcoxon test, P = 4.11e−38; Supplementary Fig. 4h). We identified numerous known regulatory variants that influence the expression of genes associated with agronomic traits. To name a few, a variant within a distal regulatory region ( ~ 12 kb upstream) of qSH1 modulates its expression dynamics, leading to change the seed shattering in rice9. Accordingly, there is a positive correlation (R = 0.47, P < 0.013) between the accessibility of this enhancer and the expression of qSH1 in various tissues, particularly in SAM where gene expression increases (Supplementary Fig. 4i, l). Similarly, OsLG1 is tightly linked to upstream regulatory regions that colocalize with a strong QTL associated with the panicle shape trait27 (Supplementary Fig. 4j, l). IPA1 showed significantly positive correlation (R = 0.84, P < 2.95e−8) between its enhancer activity and gene expression, with increased expression in yield-related tissues (Supplementary Fig. 4k, l), confirming an important role of IPA1 to shape rice ideal plant architecture (IPA) and thus to enhance grain yield28.Taken together, the predicted OCR-to-gene links provide regulatory insights into agronomic trait development in rice and highlight targetable OCRs of important genes for genome editing.Dissecting tissue-specific and stage-specific regulatory grammarThe comprehensive chromatin accessibility landscape of representative tissues gave us an opportunity to uncover tissue-specific regulatory grammar. We quantified the tissue-specificity of OCRs by utilizing the JSD score, which enables the discrimination of target genes from housekeeping (e.g., GAPDH29 and OsGOGAT130) to tissue-specific (e.g., OsYABBY531 and OsWRKY4732) according to the above predicted OCR-to-gene links (Supplementary Fig. 5 and Supplementary Data 6). We have specifically focused on analyzing highly tissue-specific OCRs (n = 6686 with a cutoff of JSD > 0.08, ~ 7% of all OCRs) as they may encode the tissue-specific regulatory grammar. These OCRs were further annotated as promoters (n = 2322) or enhancers (n = 4364) according to the genomic distance to the TSS. By performing joint clustering analysis of chromatin accessibility and target gene expression using OCR-to-gene links, we identified 20 distinct clusters of OCRs (Fig. 2b and Supplementary Data 7). Each cluster had 200~500 OCR-to-gene links that were highly activated in specific tissues, and showed a high degree of consistency with the known biological characteristics of the corresponding tissues (Fig. 2b–d). For instance, the palea- and lemma-specific links in cluster 5 (C5) contained promoter-enhancer interactions at the locus of GW8, which is a known gene controlling grain weight in rice33 (Fig. 2c). Accordingly, GW8 was highly expressed in pistil, lemma, and palea. Gene ontology (GO) enrichment analysis using genes from C5 revealed that biological processes such as ‘pollen−pistil interaction’ and ‘pollination’ were overrepresented (Fig. 2d). Similarly, we identified a number of OCRs in C19 that were highly and specifically accessible in meristem-like tissues (including young panicle and shoot apical meristem), and the associated target genes showed significant enrichment for functions related to ‘reproductive system development’, ‘flower development’, and ‘shoot system development’ (Fig. 2b, d). Notably, RFL, a crucial regulator for plant architecture and flowering time34,35, was among these target genes (Fig. 2c). Interestingly, we observed that a higher proportion (28.9%) of tissue-specific OCRs originated from distal intergenic regions compared to constitutive OCRs (12.3%). In contrast, approximately 85% of constitutive OCRs were derived from the proximal-promoter regions. (Fig. 2e).To delineate the TFs that may bind to these tissue-specific OCRs, we used GimmeMotifs36, a versatile tool can detect tissue-specific TF binding motifs by comparing TF binding activity across multiple experiments. We restricted our analysis to the top 2,500 OCRs in each tissue, as determined by their specificity measurement (SPM) score37. The predicted regulatory motifs showed significant enrichments in a tissue-specific manner in matching tissue types (Supplementary Fig 6 and Supplementary Data 8). We narrowed our focus to the top enriched regulators in each tissue type, and found many of the inferred links correspond to known regulatory relationships (Fig. 3a). For example, OsIDS1, a gene that plays a vital role in shaping inflorescence structure and establishing floral meristems38,39, exhibited relatively high activity in the panicle. OsbZIP72, enriched in plumule tissue, has been found to regulate plumule length by modulating abscisic acid (ABA) signaling and promote seed germination40,41. Notably, the tissues of seed and pistil demonstrated a co-enrichment pattern of crucial regulators involved in flower and seed development, including MFO1 and MADS6342,43,44 from the MADS gene family (Fig. 3a). For each tissue type, we performed a systematic analysis to calculate the relative preference of regulators within TF families. Our analysis revealed distinct tissue-specific TF binding patterns, indicating clear preferences for specific regulators in different tissues (Fig. 3b). For instance, the TCP TF family showed a preference for enrichment in stem, stamen, and panicle neck node (PNN) tissues. This observation aligns with the known biological function of TCP genes, specifically their role in regulating cell proliferation in developing tissues45.Fig. 3: Tissues-specific and stage-specific regulatory elements.a Enrichment of TF motifs in tissue-specific OCRs. Only top 5 enriched TFs in each tissue are shown. See Supplementary Data 8 for the full list. The thickness of edges is proportion to the corresponding enrichment score. b The relative preference of regulators within TF families in each tissue type. Only the top 100 TF motifs in each tissue were used for analysis. c The scatter plot showing the distribution of the Pearson correlation coefficient between TF footprint score and its expression. Only absolute values of correlation coefficients greater than 0.5 are marked. d The scatter plot showing the distribution of TF footprint score and its gene expression in NIP, MH63, and ZS97(left). The error bands indicate 95% confidence intervals. Distribution of Tn5 cuts around the footprint of DL and OsSPL9 at different stages of young panicle (right). Source data are provided as a Source Data file.Analyzing temporal ATAC-seq data through footprinting could assist in identifying key regulators, such as pioneer factors, that control developmental progression and transition46. We generated temporal open chromatin data from the young panicle, which is a crucial organ determining the yield of rice47,48, across four successive developmental stages (<1 mm, 1–2 mm, 3–5 mm, and 5–10 mm; Fig. 1a). We endeavored to identify regulatory motifs that exhibited either positive or negative correlation with the young panicle developmental stage in terms of enrichment, using dynamically changing OCRs (n = 9244; Fig. 3c, Supplementary Fig. 7a and Supplementary Data 9). The regulators that were most enriched displayed predominantly positive correlations, indicating their function as transcriptional activators. Conversely, a subset of factors exhibited negative correlations, suggesting a repressive role. In this regard, DL (encoding OsYABBY49), OsSPL950, and OsSPL1451 were identified as representative positive regulators, during the development of young panicles in rice (Fig. 3d and Supplementary Fig. 7b). However, further experimental data is necessary to validate the potential involvement of these TFs in young panicle development.Overall, the above results provide a valuable resource that can help guide studies of candidate key regulators for tissue-specific gene regulation.Systemic localization of GWAS variants in tissue-specific regulatory DNAGenome-wide association studies (GWAS) have identified numerous natural variations linked to various agronomic traits in rice3. To systematically colocalize GWAS-associated variants with the above annotated regulatory elements, especially those from noncoding regulatory regions, we compiled a comprehensive rice GWAS catalog from recent genome-wide association meta-analysis studies2,52,53,54 as well as the NGDC GWAS Atlas database55. In total, we collected 4831 significant (P < 1e−5) and representative (only considering lead SNP) associations for 209 distinct quantitative traits which can be classified into seven major categories56: morphological characteristics, physiological features, yield components, grain quality, resistance, coloration, and others (Fig. 4a and Supplementary Data 10). In a nutshell, these GWAS SNPs dominantly located in intergenic noncoding regions (Fig. 4b and Supplementary Fig. 8a) and 24.5% of them were either situated within a noncoding OCR (21.1%) or located in linkage disequilibrium (LD) with SNPs in a neighboring OCR (3.4%) (Fig. 4c). Moreover, OCRs revealed significantly higher enrichment of GWAS SNPs than protein-coding sequences (Fig. 4d), highlighting the crucial function of regulatory variants in determining phenotypic characteristics.Fig. 4: GWAS-associated variants localize in tissue-specific OCRs.a Categorical proportions of lead SNP in each GWAS. The inner circle indicates the proportions of the seven major categories, and the outer circle indicates the subcategories contained in each major category. Only high proportions are marked in the outer circle. b Distribution of curated lead SNPs by genomic context. All lead SNPs are the same as in a. c Overlap proportions of lead SNPs and sets of SNPs with strong linkage disequilibrium (LD > 0.8) with lead SNPs with ATAC-seq OCRs, ChIP-seq peaks and footprints identified by NIP ATAC-seq, respectively. d The barplot showing the SNP density of OCR and CDS regions at different GWAS P-value thresholds. The error bars are the standard deviations of the SNP densities in the six GWAS catalogs from a. Data represents the mean ± SD of 6 independent GWAS catalogs. The P values were calculated by two-tailed Student’s t-test. e Boxplots showing the tissue-specificity score distribution of OCRs that overlap with grain width53 and leaf blade width52 GWAS SNPs. For grain width, the sample sizes for the “with” and “without” groups are 896 and 4480, respectively. For leaf blade width, the sample sizes for the “with” and “without” groups are 2864 and 5728, respectively. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). The P-values were calculated by two-tailed Student’s t-test. f The enrichment of GWAS SNPs2 in OCRs with different GWAS P-value threshold. g Manhattan plot showing the GWAS signal distribution of vg0724670482 and the LD distribution of its surrounding SNPs. The track plot demonstrates that the OCR where this SNP is located has a higher accessibility in palea tissue. “O2G” represents OCR-to-gene links. h Same meaning as g, except that vg0431203743 has a higher accessibility in SAM and young panicle. Source data are provided as a Source Data file.Furthermore, our findings demonstrated that OCRs containing GWAS SNPs exhibited greater tissue specificity (Fig. 4e, f and Supplementary Fig. 8b-d). For instance, one of the OCRs containing a GWAS lead variant vg072467105553 (C/T, GWAS P < 9.27e−8) significantly associated to panicle number. This OCR was found to be highly accessible specifically to young panicle tissues and its accessibility showed a positive OCR-to-gene link with the expression of GW7 (R = 0.59, P < 9.14e−5; Fig. 4g). In another example, the GWAS lead variant vg0431427332 is significantly associated to leaf blade width52 (P < 1.58e−8), which was located in a SAM/Panicle-specific OCR to positively regulate the expression of NAL1 (R = 0.72, P < 1.16e−6) (Fig. 4h). The previous studies have shown that NAL1 is not only associated with leaf width but also with yield52 and has natural variations in expression levels26. More examples of validated OCR-related associations are presented in Supplementary Fig. 8e.Tissue-specific regulatory variants explain agronomic trait associationsThe variation in DNA sequences within OCRs plays a significant role in driving phenotypic innovation through altering chromatin state and gene expression patterns, which usually occurs in a tissue-specific manner. To investigate the relationship between genetic variations associated with agronomic traits and tissue-specific OCRs, we calculated the enrichment of genetic variations within OCRs in a tissue-specific manner. It turned out that significant GWAS SNPs were frequently enriched in OCRs of trait-relevant tissues (Fig. 4f and Supplementary Fig. 8d). For example, GWAS variants associated with spikelet traits were highly enriched in OCRs specific to the tissues of SAM1, pistil and panicle. Motivated by this observation, we performed an enrichment analysis of GWAS-identified SNPs in OCRs from various tissues, using a SNP enrichment method termed CHEERS57 (Supplementary Fig. 9). Of the 209 curated GWAS-related traits, ~78% (163 of 209) phenotypic traits showed GWAS SNP enrichment in at least one tissue (Supplementary Fig. 10 and Supplementary Data 11). The observed enrichment of agronomic trait-associated variants in regulatory elements was highly specific to tissue types, and the association is largely compatible with our current understanding of the tissue function (Fig. 5a). For example, in various GWAS studies, regulatory variants associated with plant height was enriched in stem-related tissues; while genetic associations for grain-related traits (such as grain thickness, grain width, grain length, blighted grains per plant, and filled grains per plant) were highly enriched in OCRs specific to the tissues of seed, lemma, pistil, and stamen (Fig. 5a). Meanwhile, we found that variants associated with root length were predominantly enriched in the root tissue. Specifically, a significant SNP (vg080620195758, P < 3.98e-8) located in a root-specific enhancer of OsHAK12, which has been shown to be involved in K+ uptake in roots59 (Supplementary Fig. 11a).Fig. 5: Association of tissue type with complex traits.a GWAS SNPs enrichments for ATAC-seq OCRs of different tissues. The heatmap showing the significant tissue-specific enrichment results. The values are transformed by -log10(P) and then normalized by row. Those marked with an asterisk represent P < 0.05 for this result. The P values were calculated by Kolmogorov-Smirnov test. Only tissue data for the NIP variety were used for this analysis. The full list for GWAS enrichment result could be accessed by Supplementary Data 11. b One representative examples of genomic tracks at loci OsbZIP06 showing that GWAS lead SNP is located in tissue-specific OCRs. The GWAS study name and SNP location (denoted by red dashed line) are shown at the top of panel c. Haplotype distribution of vg0131729028 in the population. This result was obtained from the RiceVarMap 2.0 database7. d Identification of mutation information of two OsbZIP06 mutants based on Sanger sequencing. e The images show seed germination rates of wild type and mutants of OsbZIP06. f The line graph showing the germination rates of different mutants osbzip06 at different days of imbibition. “OE” represents overexpression. g Boxplot showing the enrichment results of proximal and distal OCRs with 209 GWAS results respectively. Only results where GWAS was significantly enriched with at least one of proximal and distal OCRs are shown. The sample size of each group is 764. The P value was calculated by Student’s t-test. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). h Venn plot showing the number of results significantly enriched (P < 0.05, Kolmogorov-Smirnov test) by proximal and distal OCRs. i Enrichment of GWAS SNPs in TSS proximal and distal OCRs. The names of the GWAS are marked at the top of the panel. The grey dashed line indicates the P-value threshold of 0.05. The P values were calculated by Kolmogorov-Smirnov test. Source data are provided as a Source Data file.In the case of seed germination percentage, GWAS SNPs were most significantly enriched in plumule-specific OCRs (Fig. 5a). We noted a lead SNP (vg013172902860, A/G, P < 8.4e−8) localized within an intronic OCR of OsbZIP06, where the intronic OCR and OsbZIP06 formed a positive OCR-to-gene link (R = 0.82, P < 2.55e-7) with high tissue specificity in plumule and radicle (Fig. 5b). The minor allele (G) of vg0131729028 was present in a very small proportion (0.3%) in the XI population, but in 65.80% of the Aus population (Fig. 5c). We mutated the coding region (mainly 1st exon) of OsbZIP06 with CRISPR/Cas9 and found that the germination rate was higher in two frameshift mutations (osbzip06-1 and osbzip06-2) than in the wild type (Fig. 5d-f and Supplementary Data 12). In contrast, overexpression of the OsbZIP06 resulted in a lower germination rate (Fig. 5e, f). Therefore, the integration of publish GWAS data and our chromatin landscape can greatly facilitate the identification of candidate genes and the functional annotation of noncoding variants.Furthermore, when we divided the OCRs into proximal (<3 kb from the TSS, 60,006 OCRs) and distal OCRs (>3 kb from TSS, 35,691 OCRs) before using CHEERS to do enrichment analysis. We observed that the proximal OCRs are more enriched in GWAS SNPs (Fig. 5g–i and Supplementary Fig. 11b). This implies that the enrichment above is mainly driven by the OCR close to the TSS and this result is consistent with previous studies57,61.Deep learning models accurately predict differences in chromatin accessibility between tissues and unveil common regulatory grammar among varietiesWe further investigated whether the tissue- and stage-specific regulatory grammar can be modelled. Deep learning has been successfully utilized to learn and identify essential features in genomic sequences, such as the identification of cis-elements62,63. Our previous study demonstrated that the Basenji deep learning framework64 is powerful for modelling epigenomic data in rice, such as the ability to accurately predict chromatin accessibility and to assess the impacts of variants7. Therefore, we optimized the Basenji framework to effectively model our ATAC-seq datasets from multiple tissues (Supplementary Fig. 12a,b). Three distinct models were trained for the varieties of NIP, MH63, and ZS97, demonstrating high accuracy with the mean AUROC values of 0.931, 0.921, and 0.928, respectively (Fig. 6a and Supplementary Fig. 12c). We observed that the Pearson’s correlation coefficient between the predicted and observed values of chromatin accessibility at different locations on the genome reached approximately 0.81, with the best prediction at the location of <1 kb upstream regions (Fig. 6b and Supplementary Fig. 12d). This implies that the regulatory syntax patterns within promoter regions could carry more significant information encoded in sequences, which can be effectively captured by deep learning models. Furthermore, the predicted signals from the test sets exhibit the ability to discern between distinct tissues and closely align with the clustering results of the actual values (Fig. 6c). For example, the root-specific expressed gene RCc3, responsible for regulating lateral root growth65, exhibits distinct chromatin accessibility patterns specifically in root (Fig. 6d and Supplementary Fig. 13).Fig. 6: Using deep learning model to predict chromatin accessibility across tissues and varieties.a Receiver operating characteristic curves for different tissues in the NIP cultivar. The average AUORC value was 0.931. b Distribution of Pearson correlation coefficients between predicted and true signal values for different genomic regions using NIP model. Each point represents one tissue (n = 24). Data are displayed as mean ± SD. c Comparison of clustering results based on predicted and true signal values using NIP model. d The genomic tracks show the signal values predicted by NIP model versus the true signal values for Panicle1, PNN and Root, respectively. The shaded area is labelled with the gene region of RCc3. The heatmap below the tracks show the expression of the RCc3 in NIP varieties. e The boxplot showing the distribution of Pearson’s correlation coefficients for the models of NIP, MH63 and ZS97 tested separately using sequences from the other two varieties. The red dashed line represents a correlation coefficient at 0.80. Each sample consists of 24 observations. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). f The genomic tracks showing the signal values predicted with the ZS97 model for NIP, MH63 and ZS97 sequences versus the true signal values in Stamen and Stem tissues, respectively. The shaded area represents the orthologous region of GSE9 in NIP, MH63 and ZS97 varieties. The heatmap below the tracks show the expression of the GSE9 in NIP, MH63, and ZS97 varieties. g Comparison of OCRs in the three rice cultivars (NIP, MH63, and ZS97). For each cultivar, OCRs from all tissues were merged and then compared based on whole genome sequence alignments. h Ternary plot showing the chromatin accessibility of orthologous OCRs among the three rice cultivars with Panicle1 tissue. i Comparison of the SNP density within the balanced (n = 19793) and unbalanced (n = 8385) orthologous OCRs. The P value was calculated by two-tailed Student’s t test. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). j Sankey diagram showing the true chromatin accessibility difference and the chromatin accessibility difference predicted by the deep learning model for orthologous OCRs in NIP, MH63, and ZS97. The color representation is categorized in the same way as in h. Source data are provided as a Source Data file.Subsequently, for each variety-specific model, we used test sets from the remaining two varieties to evaluate the model’s capacity for making predictions across different varieties. Our analysis revealed high Pearson correlation coefficients (about 0.8) between the predicted and observed signals (Fig. 6e). Notably, in the GSE9 promoter region, there is divergence between indica and japonica rice, marked by a 9 bp deletion and several SNPs in MH63 when compared to the sequences of NIP and ZS9766. The ZS97 model predicted the chromatin accessibility of this region in MH63 with weak signals. Contrarily, the ZS97 model accurately predicted the chromatin accessibility in NIP and ZS97, showing strong signals (Fig. 6f and Supplementary Fig. 14). These results suggest that the deep learning model can effectively make accurate predictions across varieties, implying that shared regulatory grammar across rice varieties.We next performed comparative analyses on ATAC-seq data of 22 matched tissues/organs in both japonica rice (NIP) and indica rice (MH63 and ZS97), utilizing their respective reference genomes (Fig. 1a, b). We found that roughly 60% (60,764 out of 95,697) of OCRs were shared across all three cultivars (Fig. 6g and Supplementary Data 13). The indica varieties MH63 and ZS97 exhibited a higher proportion of shared OCRs compared to NIP from different subspecies (Fig. 6g). We next sought to compare chromatin accessibility dynamics of the 1:1:1 orthologous OCRs across the three varieties (referred to as triads; see Methods). To investigate the accessible bias of orthologous OCRs, we compared the chromatin accessibility of orthologous OCRs in each individual tissue (Fig. 6h). Orthologous OCRs were assigned into seven categories on the ternary plot based on their relative accessibility, including a balanced category and six dominated or suppressed categories in specific cultivars (Supplementary Fig. 15). The proportion of OCR triads assigned to unbalanced categories varied among different tissues, ranging from 3.2% in plumule to 24.8% in AM1 (Fig. 6h and Supplementary Fig. 16a). While promoters generally display balanced OCRs, indicating consistent accessibility across different cultivars, enhancers frequently exhibit unbalanced OCRs, reflecting cultivar-specific regulation (Supplementary Fig. 16b). Interestingly, unbalanced OCRs harbored more genotypic variations in terms of SNPs (Fig. 6i). This observation led us to suppose whether sequence variation among different varieties caused the differences in chromatin accessibility of these OCR orthologs. Therefore, we used NIP-based deep learning model to predict the chromatin accessibility signals of sequences from orthologous OCRs in NIP, MH63 and ZS97, respectively, and then compare these predictions. The results showed that about 50% of the differences in orthologous OCRs could be successfully resolved in terms of sequence variation (Fig. 6j and Supplementary Fig. 17).In summary, the above results illustrate that deep learning models could accurately predict chromatin accessibility across tissues and varieties. The high accuracy of the models also indicates the high quality of our data.Elucidate key genetic changes underlying cis-regulatory divergence by deep learning modelsGenetic variants and de novo mutations in regulatory regions may lead to cis-regulatory divergence and thus changes in gene expression and organismal phenotypes67. We systematically dissected the cis-regulatory divergence due to genomic sequence changes (e.g., SNPs) in regulatory regions, which could be inferred from ATAC-seq data. To measure the effect of the variant on chromatin accessibility, we extracted variants that differed in the three varieties. The effect of different alleles of each variant on chromatin accessibility was evaluated using the deep learning models. We found that unbalanced OCRs had a higher absolute effect score than the balanced OCRs (Supplementary Fig. 18a) and these large-effect loci were significantly enriched for eQTLs26,68 (Supplementary Fig. 18b). This observation suggests that these putative large-effect variants are associated with changes in chromatin accessibility and gene expression. Meanwhile, we performed separate OCR-to-gene correlation analysis for each of the three varieties. We then identified conserved OCR-to-gene links and compared the correlation coefficients between them (Fig. 7a). Notably, OCRs with significant differences in correlation coefficients exhibited higher SNP density (Fig. 7b), and the OCR-to-gene links with large differences in correlation coefficients between MH63 and ZS97 were significantly enriched for differential cis-eQTL between MH63 and ZS97 (Fisher’s exact test, odds ratio = 1.81 and P < 1.83e−28)69. These suggesting that regulatory sequence variations among different varieties could influence gene expression. For instance, we observed that a SNP (vg0336150781, G/A) located in the GNP1 promoter region control grain number and plant height70. Among the OCR-to-gene links we inferred, the allele in NIP (‘G’ at this SNP) correlated with GNP1 (R = 0.59, P < 6.48e−04), whereas the allele (‘A’ at this SNP) did not show OCR-to-gene correlation in MH63 (R = 0.01, P = 0.99) and ZS97 (R = 0.17, P = 0.34) (Fig. 7c). In addition, eGWAS also demonstrated that this SNP affects GNP1 expression (Fig. 7d). When we evaluated the effects of this SNP with the deep learning model, we found that mutation of this SNP from “G” to “A” in Panicle2 significantly reduced chromatin accessibility (Fig. 7e). We also found that this variant overlaps with the footprint of OsSPL10 identified in Panicle2, and its binding site shows the typical “GTAC” motif of the SBP TF family. These results suggest that mutations control gene expression by affecting TF binding to alter chromatin accessibility.Fig. 7: Genomic mutations contribute to cis-regulatory divergence.a Density plot showing the difference in Pearson correlation coefficients (R) between the OCR-to-gene of NIP, MH63, and ZS97, respectively. The R of OCR-to-gene is not less than 0.4 we consider large differences while R located between −0.05 and 0.05 we consider no difference. b Boxplots showing the density of SNP differences between big and small difference groups. Comparisons are made by two-tailed Student’s t-test. Sample sizes for each group are labeled above their respective boxes. Boxplot shows the median (horizontal line), second to third quartiles (box), and Tukey-style whiskers (beyond the box). c The dot plot demonstrates that the GNP1 gene associates to an OCR (chr3:36150374-36152039) in NIP, but not in MH63 and ZS97 due to the presence of a variant (vg0336150781, G/A). Pearson’s correlation coefficient is used for the test. The error bands indicate 95% confidence intervals. The P-values were calculated by a two-tailed Z-test. d Manhattan plot showing local eGWAS results for GNP1. The eGWAS results were obtained from Ming et al. 68. e Changes in chromatin accessibility using deep learning models for mutations of 100 bp each on the left and right of vg0336150781. “Loss” represents reduced chromatin accessibility after the mutation compared to before the mutation, and “gain” represents increased. The figure shows the change in chromatin accessibility before and after the mutation in Panicle2. f The treemap showing the proportion and composition of OCRs without structural variants (SV) and OCRs with SV. Here we only consider deletions (DEL), inversions (INV), and duplications (DUP) for SV. OCRs were considered SV-related when it overlaps with DEL, DUP, and INV by at least 1 bp. g The heatmap showing the 12,313 OCR-to-gene links (R > = 0.4, P < 0.05, two-tailed Z-test) associated with SV. They were grouped into 6 clusters based on their chromatin accessibility. The number of OCRs in each cluster and the number of target genes are labeled on the right side of the heatmap. h The doughnut showing the proportion of DEL, DUP, and INV in each cluster. i Scatter plot demonstrates Pearson correlation coefficients (R = 0.83, P < 6.94e−09) between tissues for the accessibility of OCR associated with deletion and the expression of target genes (Oshsp18.0-CII). The error bands indicate 95% confidence intervals. The P-values were calculated by two-tailed Z-test. j Genome Browser showing ATAC-seq signal distribution in the vicinity of gene Oshsp18.0-CII. The gray dashed bracket represents the absence of this OCR in MH63 and ZS97 due to the deletion of this sequence. The barplot on the right shows the expression of the gene in each tissue. Source data are provided as a Source Data file.Besides point mutations, small genomic alterations (including short insertions/deletions, inversions, and duplications) may abolish OCRs and thus confer an important avenue of regulatory divergence. We quantified all OCRs based on the NIP reference genome and investigated whether their regulatory activity dynamics were associated with short genomic alterations, which were determined by whole genome comparison across different cultivars (see Methods). In total, we found that nearly one third (26.6%) of the OCRs harbored small alterations (Fig. 7f). The regulatory activity of these mutation-associated OCRs is positively correlated with their surrounding gene expression patterns in a cultivar-specific manner (Fig. 7g, h), as exemplified at the loci of Oshsp18.0-CII and MAG2 (Fig. 7i, j and Supplementary Fig. 19a). Notably, GO analysis showed that these genes were highly enriched for various ‘response’ related functions (Supplementary Fig. 19b and Supplementary Data 14). Further investigation revealed that the identified mutation-embedded OCRs were significantly overlapped with transposable elements (TEs) (Supplementary Fig. 19c). The above results indicate that TEs may contribute to modification of regulatory sequences, fine-tuning gene expression networks and driving new functions71.

Comprehensive mapping and modelling of the rice regulome landscape unveils the regulatory architecture underlying complex traits

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models