DNA language model GROVER learns sequence context in the human genome

Unless otherwise specified as being written in R (v.4.2.1), all code is written in Python (v.3.12.2) with Scikit-learn (v.1.4.2).DataWe used the Homo sapiens (human) genome assembly GRCh37 (hg19) and only take into account the sequences that contain A, C, G and T. Tokenization was performed as described below. Each chromosome is split into windows varying between 20 and 510 tokens in size. Specifically, with a 50% probability, the size of a window is 510. With another 50% probability, its size is a random integer between 20 and 510. Eighty percent of the windows are taken as a training set and the remaining windows as a test set.Byte-pair tokenizationBPE or byte-pair tokenization was originally developed as a text compression algorithm14 and is now widely adapted as a tokenization strategy for transformer models on natural languages like GPT-3 (ref. 4). We adapted the tokenizer from ref. 24 for genome sequence and used up to 5,000 cycles of tokenization.Tokens are visualized in a Word cloud with the R package Wordcloud2 (https://github.com/Lchiffon/wordcloud2).The optimal vocabulary was selected through performance assessment with next-k-mer prediction with GROVER.The DLM GROVERGROVER adapts the transformer encoder BERT architecture9. It takes as input tokenized sequences of up to 510 tokens in length. In addition to the vocabulary generated from the genome, GROVER takes five special tokens: CLS, PAD, UNK, SEP and MASK. These special tokens are commonly used in language models: CLS represents the classification token, PAD is used for padding the right side of the sequence in case it is shorter than the maximum input length of the model, UNK is for sequences of nucleotides that do not belong to the vocabulary, SEP is used to indicate the end of a sequence, and MASK represents the masked tokens. The model is trained for masked token prediction.In a given input sequence, 2.2% of the tokens are selected, of which 80% are substituted with a special mask [MASK] token; 10% of tokens are randomly replaced with standard tokens (that is, any token different from the class [CLS], pad [PAD] or mask [MASK] token).To pretrain the model, we gather more than 5 million samples from the genome. Training was carried out on clusters of A100 graphics processing units, on batches of sizes 64 with an Adam optimizer and a learning rate of 4−4, epsilon 10−6, beta 0.99, maximum input length of 50, dropout probability of 0.5, and 0.022 probability of masking.Next-k-mer predictionFor model validation and selection of the optimal vocabulary, we used a fine-tuning task of next-k-mer prediction that we previously developed10. It allows us to compare different foundation models that rely on context learning independent of how their vocabulary was generated, the size of the vocabulary or the learning parameters. The task is not dependent on a specific biological question. The principle of next-k-mer prediction is to take the pretrained language model and fine-tune it to predict the next k-mer, where k is 2, 3, 4, 5 and 6.Chromosome 21 is split into sequences of 510 nucleotides, where we keep the first 56 nucleotides of each sequence. We randomly select 500,000 sequences, 80% for training and 20% for testing.The samples are defined as the first 50 nucleotides of each sequence. For the labels, we take the k nucleotides that follow the 50 nucleotides. The next-k-mer model has 4k different classes: that is, 16, 64, 256, 1,024 and 4,096, respectively, which are all the permutations of k nucleotides. We use an Adam optimizer with learning rate 10−6, epsilon 10−8, beta 0.99, maximum input length of 50, dropout probability of 0.5 and batch_size 64.For NT, HyenaDNA and DNABERT-2, the models are fine-tuned with the pretrained models provided by the authors.From the models, we extract performance metrics and use accuracy of token prediction for the decision of the optimal vocabulary and comparison to other tokenization strategies. For comparison of byte-pair tokenization cycles and visualization with ggplot2 (v.3.4.1) in R (v.4.2.1), we use a loess fit with a 95% confidence interval.
k-mer modelsFor the k-mer models, we use the same parameters and samples as GROVER, differing only by the vocabulary and tokenization. Tokenization is performed with non-overlapping k-mers four, five and six nucleotides in length. The vocabularies consist of all the permutations of k consecutive nucleotides (that is 256, 1,024 and 4,096, respectively) as well as the five special tokens described above.W2VFor comparison of token embeddings, we use W2V as a static word embedding tool20 that maps each word to a single vector. In general, this mapping function does not account for lexical ambiguity: that identical letter sequences can have multiple interpretations or different grammatical roles. We use W2V with a continuous bag-of-words approach for learning representations of words. Continuous bag-of-words predicts the middle word from surrounding words in a sentence with a small number of words as context. The order of the words is not taken into consideration. To generate the W2V embeddings, 300,000 sequences are randomly chosen from the training set. We use the W2V module of Gensim (https://radimrehurek.com/gensim/models/word2vec.html), with the following parameters: min_count = 1, vector_size = 768, window = 5.Model embeddingWe obtain a contextualized word representation that is the token embedding of the BERT model. To obtain the trained token embeddings of the model, we extract the weights of the layer ‘word_embeddings’. Where needed, we derive either the embedding of all 12 transformer layers, a summarized version for each token sequence or a summarized version for sequences of 510 token lengths with the classify token (CLS).Dimensionality reductionWe obtain dimensionality reduction from average token embeddings that are represented as vectors with length 768. PCA and UMAP were performed in R (v.4.2.1) with the packages ‘stats’ (v.4.2.1) and ‘UMAP’ (v.0.2.10.0), respectively, with default parameters. MEV as a measure for context learning21 was extracted as the variance explained by PC1.Clustering of the token embeddings and self-similarity was performed with hierarchical clustering of Euclidean distance with pheatmap (v.1.0.12).Dimensionality reduction for extracted embeddings of genome windows was performed in Python. The whole genome was split in non-overlapping bins of 510 tokens. Then these sequences were used as input to the model to obtain the CLS token embedding of the last layer. The 768 dimensions of the embedding matrix were reduced to two dimensions using UMAP. UMAP was configured with standard parameters, including 15 nearest neighbours and a minimum distance of 0.1.Self-similaritySelf-similarity was assessed as the cosine similarity of different embeddings from the same token sequence, separately for all 12 transformer layers of the BERT architecture.Five thousand embeddings per token were gathered from the test set, and pairwise cosine similarity for each token was computed in every layer.Genome annotationTokens and sequence windows were annotated to token characteristics and functional genomics data in R (v.4.2.1) with the GenomicRanges (v.1.50.2) package, for genome information BSgenome.Hsapiens.UCSC.hg19 (v.1.4.3) and TxDb.Hsapiens.UCSC.hg19.knownGene (v.3.2.2). Sequence was derived with Biostrings (v.2.66.0) to obtain GC, AC, AG and nucleotide content as well as k-mer frequencies. Gene element annotation was performed with ChIPSeeker (v.1.34.1). Regression of GC content was performed using the residuals of a loess regression from the stat (v.4.2.1) package. Strand annotation relative to transcription was obtained from TxDb.Hsapiens.UCSC.hg19.knownGene (v.3.2.2). ChromHMM annotation was used from the GM12878 lymphoblastoid cell line from ENCODE, downloaded via the UCSC genome browser (https://www.genome.ucsc.edu/), where the repeat masker was also obtained for annotating repeats. Repeat classes were differentiated into the displayed categories, which were pooled with the respective ‘?’ category. Categories that are not displayed separately were pooled as ‘other’. Replication strand and timing were obtained from OK-Seq data from K562 chronic myeloid leukaemia cells from the reanalysis by ref. 25 downloaded via GEO (GSE131417). Tokens were annotated by determining the proportion of tokens overlapping with a specific annotation. Genome regions were annotated by determining overlaps. For differential assignment of annotation by strand, only bins are visualized that have overlaps with the annotation in a unique direction. Replication timing is visualized only for early- and late-replicating DNA for clearer visualization than taking all four categories. Of the total ranges, 995 points of the 1,378,385 total ranges (0.07%) were regarded as outliers and not visualized. Their inclusion did not add information but impacted visibility.Fine-tuning task promoter identification, Prom300Promoter sequences were obtained from the EPD database (https://epd.epfl.ch/) and lifted over to the hg19 assembly. Promoters were defined as the TSS −249/+50 bp. The corresponding byte-pair sequences were retrieved from the byte-pair-tokenized genome or converted into k-mer tokenized sequences with k = [4, 5, 6]. For generating negative class samples, each tokenized sequence was split into eight chunks, and the tokens of six randomly selected chunks were shuffled among those six chunks. For the classification task for correct or shuffled sequence, the dataset was split 80%–10%–10% for training, validation and test. For NT, HyenaDNA and DNABERT-2, the models are fine-tuned with the pretrained models provided by the authors.Fine-tuning task promoter scanning, PromScanThe same human promoter annotations as used in the Prom300 task were taken in 10 kb windows around the TSS. The sequence was divided into overlapping 1,001 bp windows with a step size of 300 bp. Training was performed for classification of the presence of a TSS in a respective window. Only the central TSS was considered, even in the presence of more than one TSS. The dataset was split 80%–10%–10% for training, validation and test. The training was done with the ‘Trainer’ method of the ‘Transformers’ library, with the hyperparameters learning_rate = 1 × 10−6, batch size 8, warmup steps 50 and weight decay 0.01. For NT and HyenaDNA, the models are fine-tuned with the pretrained models provided by the authors. DNABERT-2 struggles with the memory required for the larger ranges and sample sizes of this task and was therefore excluded.CTCF motif bindingCTCF ChIP–seq peaks in HepG2 cells were derived from the ENCODE project (https://www.encodeproject.org/experiments/ENCSR000BIE/). The human CTCF motif was retrieved from the JASPAR database (https://jaspar.genereg.net/matrix/MA0139.1/), and motif occurrences in the hg19 genome were derived with FIMO from the MEME suite using the default background model and P < 10−4. Motifs that intersect with a CTCF peak were considered to be CTCF bound. Classification was performed for bound and unbound CTCF motifs for a 1 kb window around the binding motif with a split 80%–10%–10% for training, validation and test. Training was done with the ‘Trainer’ method of the ‘Transformers’ library, with the hyperparameters learning_rate = 1 × 10−6, batch size 16, warmup steps 50 and weight decay 0.01. For NT, HyenaDNA and DNABERT-2, the models are fine-tuned with the pretrained models provided by the authors.Fine-tuning task GUEDatasets were obtained from https://github.com/MAGICS-LAB/DNABERT_2. We selected only the tasks with human data: promoter detection, core promoter detection, transcription factor binding site prediction and splice site prediction. Training was done with the ‘Trainer’ method of the ‘Transformers’ library, with the hyperparameters warmup_steps = 50, weight_decay = 0.01, learning_rate = 1 × 10−4 and adamw_torch as optimizer.Fine-tuning task NTDatasets were obtained from https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks, with the ‘load_dataset’ method and InstaDeepAI/nucleotide_transformer_downstream_tasks as parameter. The test split was used to report the results, and 5% of the training split was used for validation. We selected only the tasks with human data: promoter sequence prediction (promoter_all, promoter_no_tata, promoter_tata), enhancer sequence prediction (enhancers, enhancers_types) and splice site prediction (splice_sites_acceptors, splice_sites_all, splice_sites_donors). Training was done with the ‘Trainer’ method of the ‘Transformers’ library, with the hyperparameters learning_rate = 1 × 10−5 and batch size 8.TF-IDFInitially the sequences are tokenized with non-overlapping k-mers (2, 3, 4, 5, 6) and with the BPE of GROVER. We use ‘TfidfVectorizer’ from scikit (v.1.5) with default parameters to extract the features of each sequence of the training set. From these features, we train a random forest classifier and choose the best number of estimators between 100 and 2,000. This model is purely trained on token frequencies without any sense of grammar, syntax or overall ‘language’ context between tokens. It thus serves as a negative control for tasks that benefit from application of a language model.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles