Distinguishing word identity and sequence context in DNA language models | BMC Bioinformatics

The final source code is available as Sanabria Melissa, Hirsch Jonas, & Poetsch Anna R. (2023). Distinguishing word identity and sequence context in DNA language models—the code to the paper. Zenodo. https://doi.org/https://doi.org/10.5281/zenodo.8407874[13].A tutorial for performing next-k-mer preditiction is available as Sanabria Melissa, Hirsch Jonas, & Poetsch Anna R. (2023). Next-kmer-prediction fine-tuning to compare DNA language models, a tutorial. Zenodo. https://doi.org/https://doi.org/10.5281/zenodo.8407817[14].DNA language model architectureWe train with the Homo sapiens (human) genome assembly GRCh37 (hg19), only taking into account the sequences that contain A,C,G and T. We use the pre-trained models and code of the DNA language model DNABERT [6], provided by the authors (https://github.com/ierryji1993/DNABERT; June 2023). DNABERT is based on a Bidirectional Encoder Representations from Transformers (BERT) [7] model that takes as input tokenized sequences of up to 510 tokens long. Tokenization is performed with overlapping k-mers, 4, 5 and 6 nucleotides in size, selected based on the performance metrics in the original study. The vocabularies consist of all the permutations of k consecutive nucleotides (i.e. 256, 1024 and 4096 respectively) as well as five special tokens: CLS, PAD, UNK, SEP and MASK. CLS represents the classification token, PAD is used for padding the right side of the sequence in case it is shorter than the maximum input length of the model, UNK is for sequences of nucleotides that do not belong to the vocabulary (which in practice does not happen, because sequence is limited to A,C,G,T), SEP is used to indicate the end of a sequence and MASK represents the masked tokens.Masked token predictionTo extract what the model has learned, we extract what the model predicts over the mask. Each chromosome is split into sub-sequences. The length of each sub-sequence varies between 20 and 510 tokens. Specifically, with a 50% probability, the length of a sub-sequence is 510. With another 50% probability, its length is a random integer between 20 and 510. Then 20% of the sub-sequences are taken as the dataset for this task, which is around one million samples.For each of the samples we randomly choose a token and mask tokens of equivalent numbers to the size of the k-mer using the following pattern; 4mer: − 1, 0, 1, 2; 5mer: − 2, − 1, 0, 1, 2; 6mer: − 2, − 1, 0, 1, 2, 3. Position 0 represents the chosen token, “-” represents the tokens previous to the central token and “ + ” the following tokens. This way the central nucleotide of the mask does not overlap with any token outside the mask.Next k-mer predictionTo build a fine-tuning task that allows to compare different foundation models that relies on context learning, and is not dependent on the biological question, we established next-k-mer prediction. We take the pre-trained language models (4mer, 5mer and 6mer) and fine tune every model to predict the next k-mer, where k is 2, 3, 4, 5 and 6.To create the data for this task, chromosome 21 is split into sequences of 510 nucleotides. We keep the first 56 nucleotides of each sequence. These sequences are randomly shuffled. Finally, the dataset is composed of 500,000 sequences, where 80% of them are for training and 20% for testing.The samples are defined as the first 50 nucleotides of each sequence. For the labels, we take the k (2, 3, 4, 5 and 6) nucleotides that follow the 50 nucleotides. The next-kmer model will have 4Ak different classes, i.e., 16, 64, 256, 1024 and 4096, respectively, which are all the permutations of k nucleotides.The models are trained with cross entropy loss on the prediction of the next-k-mer using Adam optimizer with a learning rate of 10A-6, epsilon of 10A-8, and beta of 0.99. The model accepts a maximum input length of 50 tokens. The dropout probability of the classification layer is 0.5. We use batch size of 64, and train for 150 iterations.Performance is assessed with accuracy, which represents an easily interpretable metric, as it represents the accuracy, by which the model picks the correct token. It can thus also be directly compared to the random pick of a token, i.e. 1/4Ak.Promoter identificationThe Prom300 task was adapted with some minor modifications from Ji et al. [6]. The modification was made in regards to the disruption of sequence. In short, we use the human data (hg19) from the Eukaryotic Promoter Database (https://epd.epfl.ch/human/human database.php?db = human) to obtain annotation of 30,000 intact promoter sequences, which we define as 300 bp long ranges from − 249 to + 50 bp around the Transcriptional Start Site (TSS). For the definition of non-promoter samples, we apply a shuffling strategy of nucleotides rather than mutation to prevent changes in the nucleotide composition of the sequence. We divide the sequence into 20 parts of equal size and then shuffle 15. The sequences are tokenized according to each model, i.e. divided in overlapping kmers. For the prediction, we add a classification layer with one neuron. The model is trained with cross entropy loss, using an Adam optimizer with a learning rate of 10A-6, an epsilon of 10A-8, and beta of 0.99. The model accepts a maximum input length of 50. We use batch size of 64, and train for 10 epochs.Word2VecFor comparison of token embeddings, we use Word2Vec, a static word embedding tool [10] that maps each word to a single vector. In general, this mapping function does not account for lexical ambiguity, which means that identical leter sequences can have multiple interpretations or different grammatical roles. We implemented Word2vec with a continuous bag-of-words (CBOW) approach for learning representations of words, which uses the surrounding words in the sentence to predict the middle word. The context includes a window of 5 words with the current word in the center. This architecture is referred to as a bag-of-words model because it does not consider the order of words in the context. To generate the Word2Vec (W2V) embeddings, first each chromosome is split into subsequences. The length of each sub-sequence varies between 20 and 510 tokens. Specifically, with a 50% probability, the length of a sub-sequence is 510. With another 50% probability, its length is a random integer between 20 and 510. Then 300,000 of the sub-sequences are randomly chosen as the dataset for this task. We tokenize each sequence with overlapping tokens, and create three datasets, one for each kmer (4mer, 5mer and 6mer). We use the Word2Vec module of Gensim (https://radimrehurek.com/gensim/models/word2vec.html), with the following parameters: min_count = 1, vector_size = 768, window = 5.Model embeddingUnlike static word embeddings, dynamic word embeddings aim at capturing word semantics in different contexts to address issues like the context-dependent nature of words. We obtain a summarized version of the contextualized word representations that is the token embedding of the BERT model. To obtain the token embeddings of the model, we extract from DNABERT [6] the weights of the layer word_embeddings for each k-mer model.Dimensionality reduction and maximum explainable varianceBoth W2V and DNA Language Model embeddings are represented as vectors of size 768. Average distances between tokens are thus interrogated through the dimensionality reducion algorithms Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projections (UMAP) in R with the packages ‘stats’ (4.2.1) and ‘UMAP’ (0.2.10.0), respectively. As a measure of context learning, Maximum Explainable Variance (MEV) [9] was extracted as the variance explained by the first principal component.

Hot Topics

Related Articles