A foundational large language model for edible plant genomes

ModelIn our study, we utilize Language Models (LMs) as the primary method for learning the foundational representations of plant genomes. LMs have demonstrated remarkable performance and generalization in various domains such as NLP, proteomics, and recently in our work in genomics15,23,81. Our previous work showed the ability of our pre-trained DNA LM’s to match or out perform task specific models on variety of human genomics tasks. LM’s, as statistical systems, serve to estimate the likelihood of a sequence of tokens occurring given some token vocabulary, where tokens could represent words, amino acids, or nucleotide k-mers. There are two main approaches to language modeling: masked language modeling (MLM), where a portion of the input sequence is masked and/or randomly replaced with another token15, and causal language modeling (CLM), where a sequence of tokens is used to predict the next token82. More precisely, we use a deep learning model called transformers whose primary component,the self-attention mechanism, is capable of modeling global dependencies within sequences83. Our training technique closely follows BERT, the Bidirectional Encoder Representation from Transformer, which employs MLM, thus allowing it to benefit from the self-attention bidirectional property15.ArchitectureOur model architecture can be described as an encoder only transformer that is comprised of one billion trainable parameters. The tokenizer’s vocabulary consists of all possible 6-mers using the four nucleotides A, T, C and G, additional special tokens, and N to represent all unknown nucleotides in a DNA sequence. The tokenization algorithm works by first splitting genomic DNA sequences on N’s and subsequently identifying 6-mers in the chunked sequences. The tokenization strategy relies exclusively on non-overlapping 6-mers. The model takes as input a sequence of 1025 tokens (including the class token) which are passed into an embedding layer that projects each token into a 1500 dimensional space and are subsequently added to a learned positional embedding. This combined embedding vector is then fed through 40 attention blocks, with each attention block consisting of a layer normalization layer, multi-headed attention layer, another layer normalizatino layer, and multi-layer perceptron (MLP). The final layer,a Roberta LM head84, transforms the final attention layer output into a distribution over the tokenizer’s vocabulary at each position of the input sequence. See Supplementary Table 1 for model architecture details and Supplementary Fig. 1 for model schematic.Pre-training datasetOur pre-training dataset was built from (mostly) edible plant reference genomes contained in the Ensembl Plants database85. The dataset consists of approximately 10.5 million genomic sequences across 48 different species. See Supplementary Data 1 for the dataset details, including genome annotation version and source. For data preparation, we followed the strategy employed in our previous work23. Once we obtained the reference genome FASTA for each species, we assembled them into a single FASTA file. In this FASTA file, all nucleotides other than A, T, C, G were replaced by N. We used a tokenizer to convert strings of letters into sequences of tokens. The tokenizer’s alphabet consisted of the 46 = 4096 possible 6-mer combinations obtained by combining A, T, C, G, as well as five additional tokens representing standalone A, T, C, G, and N. It also included four special tokens: the padding [PAD], masking [MASK], the beginning of sequence (also called class; [CLS]), and unknown [UNK] token. This resulted in a vocabulary of 4105 tokens. To tokenize an input sequence, the tokenizer started with a class token and then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter N was present or if the sequence length was not a multiple of 6). Similar to our previous work, the first step of the pre-training involved splitting each reference genome into chunks of 6100 nucleotides, where each chunk shared the first and last 50 nucleotides with the preceding and proceeding chunk, respectively. As a data augmentation exercise, for each epoch and chunk, a starting nucleotide index was randomly sampled between 0 and 100, and the sequence was then tokenized from this nucleotide. At each step, a batch of sequences randomly sampled from the epoch set was fed to the model.Pre-training strategyThe MLM objective was used to pre-train our model in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations. This allows us to leverage the vast amount of unlabeled genomic sequencing data available. Specifically, 15% of the tokens in the input sequence are selected to be augmented with 80% being replaced with a mask token, 10% randomly replaced by another token from the vocabulary, and the final 10% maintaining the same token. The tokenized sequence is passed through the model and a cross entropy loss is computed for the masked tokens. Updates to the model parameters are computed through gradient descent with Adam, the adaptive learning rate optimizer. Pre-training was carried out with a sequence length of 1025 tokens and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens. The Adam optimizer was used with a modified square decay learning rate (lr) schedule where a warm up period with linear increase was used for 64K steps with initial lr of 0.00005 and end lr of 0.0001. A square decay function was then applied to the lr for all subsequent steps. The exponential decay rates β1 = 0.9 and β2 = 0.999 were used in the optimizer. See Supplementary Table 1 for training hyper parameter details.Fine-tuning strategyAfter pre-training our model with the MLM objective we further fine-tune the model in a supervised learning setting using several annotated, plant genomics datasets. We fine-tune our pre-trained model with IA330, a method which injects re-scaling weight vectors at the model’s activation’s, allowing us to achieve high levels of performance while only computing gradient updates for approximately 1% of the model parameters. The tasks upon which we fine-tune include single output regression, multi output regression, binary classification, multi-label classification.HardwareModel pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 machine containing 512 devices. Since the model was able to fit on a single TPU device, pre-training was carried out in a data parallel fashion where model parameters were replicated and the effective batch sharded across the 512 devices with subsequent gradient accumulation. We trained for a total of approx. four days.Polyadenylation site predictionWe obtained data for alternative polyadenylation (APA) sites from PlantAPAdb, which currently represents the most comprehensive manually curated catalog of APA sites in several plants. The dataset is available at http://www.bmibig.cn/plantAPAdb. For our APA site prediction, we focused on five specific species: Oryza sativa L. (japonica and indica), Arabidopsis thaliana, Chlamydomonas reinhardtii, Medicago truncatula, and Trifolium pratense. We downloaded all possible sequences around polyadenylation (poly[A]) sites for each of these species. These sequences encompassed different genomic regions such as 3ʹ UTR, 5ʹ UTR, CDS, intron, and exon. Each sequence had a length of 400 bp, with the upstream 300 bp and downstream 100 bp sequence relative to the respective poly(A) site cluster (PAC). The poly(A) site itself was located at the 301st position. To obtain a negative set, we randomly shifted the position of the poly(A) site by both negative and positive values between 1 and 50. We then extracted a 400 bp sequence based on these shuffled positions. The number of negative sequences was twice that of the positive sequences. It is important to note that this strategy of generating negative sequences is more challenging compared to approaches used in other studies. This is because the negative sequences still contain similar gene-related sequences, including those from 3ʹ UTR, 5ʹ UTR, CDS, intron, and exon regions. Furthermore, we emphasize that the model was trained to predict whether a given sequence had a poly(A) site at the 301st position or not. To prevent data leakage, we limited the test set, for each plant species, to sequences that were exclusively derived from a single sampled chromosome. Due to the imbalance in our dataset resulting from the strategy of recreating negative sequences, we evaluated the performance of the models using both the area under the receiver operating characteristic curve (AUROC) and the precision-recall curve (AUPRC). Additionally, we recreated the dataset utilized in Gao et al.33 for APA site prediction. The positive sequences were sampled from the 16K dataset originally introduced in Loke et al.86, which consists of over 16,000 downstream sequences from the 3ʹ UTR of Arabidopsis thaliana. Following the approach in Gao et al., we initially selected sequences longer than 162 base pairs and then trimmed them to a sequence length of 162 base pairs. These trimmed sequences included 131 base pairs upstream and 31 base pairs downstream of the annotated poly(A) site. To create negative samples, we randomly sampled 9,704 coding sequences, 3222 intronic sequences, and 501 5ʹ-UTR sequences from the Arabidopsis Information Resources (TAIR) database. This sampling ensured a balanced dataset, as described in Gao et al. Similarly to the positive sequences, we trimmed the negative sequences to a length of 162 base pairs. Since the Gao et al. study did not specify the training and test splitting ratio, we used a standard 0.9/0.1 split ratio, maintaining the same proportion of negative coding, intronic, and 5ʹ-UTR sequences in both the training and test datasets.Splicing site predictionWe used the dataset compiled in Baten et al.87 for splice site prediction, which is available at https://public.bmi.inf.ethz.ch/user/behr/splicing/. The dataset consists of sequences of 398 bp each, containing both acceptor and donor sites, as well as non-acceptor and non-donor sequences across the genome in Arabidopsis thaliana. To prevent data leakage, the test set was limited to chromosome 5. Given the highly imbalanced nature of the dataset, with 76,871 acceptor sequences and 2,157,898 non-acceptor sequences, as well as 76,659 donor sequences and 3,311,934 non-donor sequences, we evaluated the performance of the models using both the AUROC and AUPRC.Long non-coding RNA predictionWe obtained data for long non-coding RNAs (lncRNA) from the Green Non-Coding (GreeNC) database version 2.037, which is available at http://greenc.sequentiabiotech.com/. Our lncRNA prediction focused on six specific species: Glycine max, Manihot esculenta, Solanum lycopersicum, Sorghum bicolor, Triticum aestivum, and Zea mays. For each of these species, we selected lncRNA sequences with a length smaller than 6,000 bp. Additionally, we only included lncRNA sequences that were reported on scaffolds that were present in the version of the associated annotation file for the given reference assembly mentioned in the GreeNC database. To construct the set of negative sequences, we extracted all mRNA sequences from the reference genome, using the assembly mentioned in the GreeNC database. We removed mRNA sequences longer than 6,000 bp to maintain consistency with the lncRNA dataset. To prevent the model from relying solely on sequence length for classification and to encourage the learning of more meaningful features, we performed a matching process between lncRNA and mRNA sequences based on both sequence length and GC content. Specifically, during the matching process, mRNA sequences were selected if they had a length equal to or less than 100 bp compared to the corresponding lncRNA sequence. Similarly, the GC content of the mRNA sequences was within a range of 1% less or 1% more than the GC content of the corresponding lncRNA sequence. To prevent data leakage, we created a test set for each plant species that exclusively consisted of sequences derived from a single sampled chromosome. In the case of Triticum aestivum, the test set was limited to chromosomes 1A, 1B, and 1D. We evaluated the performance of the models using both the AUROC and AUPRC. To assess the models robustness, we also independently resampled 5 times matches numbers of mRNAs used as negative sequences, for each species separately. In addition, we replicated the dataset compiled in Meng et al.38 to compare our lncRNA prediction performance with that of PlncRNA-HDeep. This dataset was also based on the GreeNC database, but included sequences exclusively from Zea mays. This dataset consisted of 18,110 lncRNA sequences from the GreeNC database that were downsampled to generate a positive set of 18,000 samples. The negative set was based on 57,776 mRNA sequences obtained from the RefSeq database, and also downsampled to create a balanced dataset. To ensure a fair comparison, we used the same training and test datasets available at https://github.com/kangzhai/PlncRNA-HDeep/. The only difference was that we trimmed down sequences longer than 6000 bp since our model does not handle longer sequences. This trimming affected 43 sequences in the training dataset and 9 sequences in the test dataset, which are unlikely to greatly affect the performance comparison between models.Promoter and terminator strength predictionFor promoter strength prediction, we used the publicly available self-transcribing active regulatory region sequencing (STARR-seq) assays for Arabidopsis thaliana, Zea mays and Sorghum bicolor40. The STARR-seq assays included all known promoter regions for the species, which were defined as -165 to +5 bp relative to annotated transcription start sites (TSSs), and their strength was measured in two systems: tobacco leaves and maize protoplasts. To enable direct comparison with the CNN model trained in the original publication, we downloaded their training and test datasets available https://github.com/tobjores/Synthetic-Promoter-Designs-Enabled-by-a-Comprehensive-Analysis-of-Plant-Core-Promoters/, which included the 170 bp promoter sequences and their measured STARR-seq strength in the tobacco leaf and maize protoplast systems. For terminator strength prediction, we used data compiled from a second STARR-seq-based study that measured the activity of over 50,000 terminators from Arabidopsis thaliana and Zea mays in41. These were defined as a 170 nucleotide sequence from position -150 to +20 relative to a cleavage and polyadenylation site. As for the previous study, and to enable a comparison with the CNN model trained in the original publication, we downloaded the training and test datasets available at https://github.com/lampoona/Terminators-Plant-STARR-seq. As in both these studies, we used the coefficient of determination (R2) to evaluate the performance of promoter and terminator strength prediction.Chromatin profiles predictionWe used the dataset compiled in Zhao et al.11 for chromatin profiles prediction. The compiled dataset is based on transposase-accessible chromatin with sequencing (ATAC-seq) data from multiple tissues of six plant species including Arabidopsis thaliana, Oryza sativa, Zea mays, Setaria italica, Sorghum bicolor, and Brachypodium distachyon. NCBI and SRA accession number of the resources and publications used to compile the dataset can be found in the original publication. For data generation, we followed the same strategy as outlined in the publication. We generated the same number of training sequences samples, each of size 1000 bp, retrieved from a reference genome. For each ATAC-seq sample, we labeled each training sequence as 1 (i.e. positive sample) if the middle 200 bp region overlaps with an open chromatin region (OCR) by more than 50% of the sequence length or as 0 (i.e. negative sample) otherwise. To ensure a fair comparison with the Zhao et al. predictions we selected the same chromosomes used as validation and test sets for each plant species. As in the Zhao et al. study, we evaluated the performance of the models using the area under the AUROC and AUPRC metrics.Tissue-specific quantitative gene expression level predictionTo test our performance on predicting tissue-specific gene expression levels, we collected data from large-scale studies of expression in five plants (Supplementary Data 2). All the data were downloaded as raw sequencing files and reprocessed with a uniform pipeline. We downloaded all publicly available RNA-seq experiments from the Arabidopsis thaliana tissue atlas88 stored in ArrayExpress89 (https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7978), which included 56 experiments across 27 tissues. For Zea mays, we downloaded the RNA-seq data published by Walley et al.90 stored in the GEO Database91 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50191). The Zea mays dataset consisted of 68 RNA-seq experiments across 23 tissues. For Oryza sativa, we collected samples from two NCBI BioProjects (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB47919 and https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB32629), totaling 20 experiments across seven tissues. For Solanum lycopersicum, we downloaded the RNA-seq experiments associated to the whole-genome sequencing project92 and stored in the Gene Expression Atlas93 (https://www.ebi.ac.uk/gxa/experiments/E-MTAB-4812) which included ten tissues with one replicate each. Finally, we also retrieved the Glycine max tissue atlas94 from the European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/view/SRA012188) which included 14 samples across nine tissues. The genome and annotation version used for gene expression analyses were downloaded from Ensembl Plants version 5685 and are listed in Supplementary Data 3. RNA-seq read mapping was performed with the STAR aligner95 version 2.7.10b. The genome indices for each species were prepared using STAR’s –runMode genomeGenerate mode and the options –sjdbOverhang 99 –genomeSAindexNbases 12 using the genome files and gtf genome annotation files listed in Supplementary Data 3. The raw RNA-seq reads were mapped using STAR’s –quantMode GeneCounts mode, and with the options –outFilterType BySJout –outFilterMultimapNmax 40 –alignSJoverhangMin 8 –alignSJDBoverhangMin 1 –outFilterMismatchNmax 999 –outFilterMismatchNoverReadLmax 0.04 –alignIntronMin 20 –alignIntronMax 1000000. Within each species, the read counts across all tissues and replicates were imported into R and normalized with DESeq296 version 1.38.3. Size factors were first estimated using the estimateSizeFactors() method and then used to normalize the reads using the counts() method with the normalized=TRUE option. For species that had multiple replicates per tissue, we averaged the resulting counts in each tissue to get one value per tissue. Finally, a pseudocount of one was added to each count and they were log2 transformed. For training the model, we followed a similar strategy as outlined in Washburn et al.51. We used promoter-proximal sequences based on the TSS annotations and ensured that the sequence overlapped with the coding sequence by extracting 5000 bp upstream and 1000 bp downstream of the TSS, resulting in a sequence of 6000 bp. To prevent potential data leakage, we divided the promoter sequences into gene families using the Washburn et al. pipeline available at https://bitbucket.org/bucklerlab/p_strength_prediction/. Specifically, we conducted an all-by-all BLAST on a given plant’s proteome sequence to evaluate pairwise similarity between proteins. As a given gene can encode multiple protein isoforms, the results of the BLAST search were collapsed to gene-level similarity. Then, a graph was constructed with nodes representing genes and edges connecting paralogous genes. The graph was further divided into clusters (i.e., gene families) using a clustering algorithm, and genes not assigned to a family were considered as a family of their own. Using this set of partly independent gene sequences based on their family assignments, we allocated promoter sequences into training and test datasets, thus ensuring that similar promoter sequences were not leaked. Since the number of promoter sequences allocated to families might differ across species, we randomly chose a number of families that would result in an 80%/20% split for each plant species separately. For Zea mays, we also examined the percentage of genes in our test set that belonged to a PFAM family present in the training set. This analysis revealed that only 430 genes out of the total 4483 genes in the test set (i.e., less than 10%) were part of such PFAM families, indicating that the approach employed here generated gene families similar to those found in PFAM. Lastly, for all species, we used the coefficient of determination (R2) to evaluate the performance of gene expression prediction.To further evaluate the performance of AgroNT in predicting gene expression and compare it to a previously developed method, we trained a model specifically for predicting expressed and non-expressed genes. We used the dataset compiled by Washburn et al.51, which is based on 422 individual samples from seven studies, providing a comprehensive collection of Zea mays tissues at various developmental stages. The NCBI and SRA accession numbers can be found in the original publication. The dataset included 39,470 sequences with gene expression labels mentioned in the paper, but sequences were available for only 35,402 of those sequences. As we did not retrieve the missing sequences for training, the comparison to their reported performance becomes somewhat more challenging, as the training dataset was reduced. Following the methodology described by Washburn et al., we trained two separate models, each trained on either promoter or terminator sequences with a fixed size of 1500 bp. We classified each sequence as associated with an expressed gene or a non-expressed gene. As done in the Washburn et al. study, we evaluated the performance of the models using the area under the receiver operating characteristic curve (AUC).Token importance for gene expression level predictionWe also evaluated the significance of a genomic region within the 6000 bp promoter-proximal sequence by performing an importance analysis similar to the one described in29. To do this, we randomly replaced each of the 1000 possible tokens within the promoter-proximal sequence with a random token, ensuring that the newly generated token was not the same as the original one. Then, we compared the predicted expression levels of the sequence with the randomly altered token to the predicted expression levels based on the original sequence. In other words, for a given sequence, we generated 1000 new sequences where only a single token differed from the original sequence. We carried out this importance analysis using 100 randomly selected promoter-proximal sequences, drawn only from the test sets, for each of the four plant species that were employed to train the gene expression prediction models. Given that the four plant species gene expression models included various tissues, we averaged the differences across tissues and sequences to derive our final importance scores.Zero-shot scores for variant consequence predictionTo calculate zero-shot scores for a specific site of interest, we followed these steps: First, for each single nucleotide polymorphism (SNP), we retrieved a 6000 bp sequence centered on the SNP of interest based on the reference genome of a given species. Next, we created two sequences: one carrying the reference allele and the other carrying the alternative allele at the SNP position. Then, we computed several zero-shot scores that capture different aspects of the vector distances in the embedding space between these two sequences. These scores include the L1 distance (Manhattan), L2 distance (Euclidean), cosine similarity, and dot-product (not normalized cosine similarity). Additionally, we computed a loss-based score, namely the log-likelihood ratio (LLR), which compares the probabilities of alternate and reference alleles.For the evaluation of zero-shot scores on potential deleterious variants in Arabidopsis thaliana, we used the genomes from the 1001 Arabidopsis Genome Project (accessible at https://1001genomes.org/). For Oryza sativa, we used the recently compiled database RiceVarMap249, which included over 4,000 individual samples from multiple populations (accessible at http://ricevarmap.ncpgr.cn/). In this analysis, we only considered bi-allelic SNPs and removed any SNP with a genotyping rate lower than 95% and 99% in Arabidopsis thaliana and Oryza sativa, respectively, to ensure reliable allele frequencies. We used a less stringent rate in Arabidopsis thaliana as using a threshold of 99% resulted in only 286,672 SNPs. We evaluated a total of 6,494,574 SNPs for Arabidopsis thaliana and 3,128,064 SNPs for Oryza sativa in these analyses. We note that the number of SNPs in Oryza sativa represent a 50% sample of the more than 6 million SNPs that remained after the genotyping rate filter. To annotate missense and non-synonymous SNPs, we used the General Feature Format (GFF) files from the TAIR10 and IRGSP-1.0 reference genomes obtained from Ensembl Plants. For the assessment of zero-shot scores on phenotype-associated variants, which was only conducted in Arabidopsis thaliana, we referred to the AraGWAS catalog found at https://aragwas.1001genomes.org. Specifically, we considered all variants annotated as “top association”. Similar to the previous analysis, we focused on biallelic SNPs. To distinguish between coding and non-coding associated variants, we used the gene coordinates from the GFF of the TAIR10 reference genome. Variants that overlapped with gene coordinates were labeled as coding, while those that did not were labeled as non-coding. The LLR score, referred to as the Genomic Pre-trained Network (GPN) score in28, as well as the phastCons and phyloP conservation scores, were also included as additional scores for comparison. The GPN scores were computed as described in28. Specifically, for each SNP, we retrieved a 512 bp sequence centered on the SNP, which corresponds to the context limit of the GPN model, to estimate the loss of the alternate and reference alleles. To assess the reliability of our computed GPN scores, we checked the correlation of the GPN scores that we computed for Arabidopsis thaliana with those provided by the authors at https://huggingface.co/datasets/gonzalobenegas/processed-data-arabidopsis/. We found a strong correlation (r=0.97) between our GPN scores and the pre-computed GPN scores, indicating that the GPN scores computed for Oryza sativa accurately represent the scores intended by the authors. The less than perfect correlation between our computed and the pre-computed GPN values in Arabidopsis thaliana is likely due to the fact that in the study, the loss of the alternate and reference alleles was based on the average of the forward and reverse strands, a step we omitted to speed up computation. The pre-computed phastCons and phyloP conservation scores were downloaded from PlantRegMap (accessible at https://plantregmap.gao-lab.org/). As the two conservation scores relied on multiple alignments across species to compute a score, resulting in some SNPs lacking annotations, we ensured fair comparison across methods by considering only variants with associated scores across all methods.In silico mutagenesis in the cassava genomeTo leverage the capability of AgroNT for obtaining accurate predictions across a variety of regulatory features, we further trained AgroNT to predict enhancer elements and gene expression across multiple tissues in the understudied crop, Manihot esculenta (cassava). For enhancer elements prediction, we used the PRO-seq data generated in Manihot esculenta seedlings from the study conducted by Lozano et al.97. Specifically, we used a set of 9665 intergenic regulatory elements (IRE) regions, which the authors defined as those located at least 1000 bp away from any gene. To maintain consistency and avoid reprocessing the PRO-seq data, we used the Manihot esculenta reference genome v6.1 from Phytozome, which corresponds to the reference genome employed in the study. The training dataset was constructed as follows: For positive sequences, which represent positive enhancer samples, we extracted a 1000bp sequence centered in the middle of the PRO-seq peak. To create matched negative sequences for each positive sequence, we randomly selected a 1000bp intergenic sequence from the same chromosome, ensuring it neither overlapped with any PRO-seq peak nor differed significantly in GC% from the corresponding positive sequence. Specifically, the GC% of the negative sequence was within 5% points of the GC% of the positive sequence. This approach resulted in a balanced and GC-matched dataset. We then split the training and test sets by chromosomes, ensuring they were strictly non-overlapping. Chromosomes 17 was designated as the test set, and chromosome 9 as the validation set. The remaining chromosomes were used as the training set. The model’s performance was evaluated using the AUROC metric.For gene expression prediction, we used the gene expression atlas from the study conducted by Wilson et al.98, which measured gene expression levels across 11 tissue/organ types. We employed the reference genome and annotation from Ensembl Plants version 56 and followed the same pipeline to process the RNA sequencing data described above. As for our previous gene expression models, we used promoter-proximal sequences based on TSS annotations to obtain sequences of 6000 bp and followed the same clustering approach to assign each promoter sequence to a gene family. We then used this information to produce a training and test set, using an 80%/20% split. Based on these trained models, we performed a comprehensive profiling of regulatory potential. Specifically, for a given sequence obtained from the reference genome, we systematically mutated each site to all three possible distinct nucleotides. In our initial analyses to detect motifs, we predicted the impact of all possible single-nucleotide substitutions across 200 bp upstream and 200 bp downstream of the midpoint of the top 1000 best predicted enhancer regions. Similarly, for the gene expression analyses, we conducted all possible single-nucleotide substitutions across promoter sequences located within 200 bp upstream and 200 bp downstream of the TSS. We selected 1000 promoter sequences of genes that had the highest average gene expression levels across all tissues. To measure the impact of a mutation, we computed the log-fold change as LFC = log2(P1/P0). Here, P0 represents the probability (or gene expression level) predicted for the original sequence, and P1 represents the probability (or gene expression level) predicted for the mutated sequence. Given the observation that AgroNT has gained an understanding of the importance of disrupting critical motifs in enhancer and promoter-proximal regions concerning gene expression values, we also assessed all possible single-nucleotide substitutions within 50 bp upstream and 50 bp downstream of the midpoint for all enhancer regions. Similarly, we performed all possible single-nucleotide substitutions within 50 bp upstream and 50 bp downstream of the TSS for all promoter-proximal regions. This resulted in a total of ~ 2.9 and ~ 7.3 million mutations, respectively, which we have made available for future research purposes.We visualized the mutation scores of a given sequence using the ggseqlogo function from the R package ggseqlogo. To identify transcription factor motifs corresponding to patterns predicted to be important for enhancer or promoter activity, we mapped the positions of 656 plant TF motifs from the non-redundant Jaspar CORE 2022 collection (https://jaspar.genereg.net/download/data/2022/CORE) in these sequences. We accomplished this using the matchMotifs function from the R package motifmatchr, with the following parameters: p.cutoff = 1e-04 and bg = “even”. The figure panel highlights TF motif types that match positions with high prediction mutation scores.Statistics and reproducibilityFor the long non-coding RNA prediction analyses, we used a Wilcoxon rank sum test in the R package stats (version 4.2.1) to assess the difference in length between mRNA and long non-coding RNA sequences. For the promoter and terminator strength analyses, when comparing the observed and predicted values, we estimated the coefficient of determination (R2) using the lm function in the R package stats (version 4.2.1). For the chromatin accesibility analyses, we used a Wilcoxon rank sum test in the R package stats (version 4.2.1) to assess the difference in performance of AgroNT with a model based on the DeepSEA architecture. For the gene expression analyses, when comparing the observed and predicted values, we estimated the coefficient of determination (R2) using the lm function in the R package stats (version 4.2.1). For the zero-shot based prediction analyses, the error bars around the point estimates were based on a sampling with replacement strategy using the sample function in the R package base (version 4.2.1). For the correlation between zero-shot scores the Pearson’s product moment correlation coefficient was computed using the cor.test function in the R package stats (version 4.2.1). The area under the receiver operating characteristic (ROC) curve (AUC) and the area under the precision-recall curve (AUPRC) for all analyses related to a classification task were computed using the roc.curve and pr.curve functions in the R package PRROC (version 1.3.1). Data manipulation and processing analyses were performed using the dplyr (version 1.1.1), tidyr (version 1.3.0), purrr (version 1.0.1), tibble (version 3.2.1), and stringr (version 1.5.0) packages in R. The graphs in Figs. 2–8 were created using the ggplot2 (version 3.4.3) package in R. R version 4.2.1 was used throughout all analyses. Additional data manipulation analyses were conducted using the packages Pandas (version 1.5.3) and NumPy (version 1.23.5) in Python (version 3.10.9).

Hot Topics

Related Articles