Generative language models on nucleotide sequences of human genes

The dataset and the code used for implementing the ideas described in this paper is reproducible and available online at https://github.com/boun-tabi/GenerativeLM-Genes for all different techniques used.N-gram language modelsFor N-gram language models40, we experimented with six different values of N, starting from one and going up to six. The perplexity values obtained are given in Table 4. The Natural Language Toolkit (NLTK) library41 was helpful for both using N-grams and calculating perplexity.
Table 4 Results for different N values.N-gram language models with laplace smoothingGetting the perplexity of infinity when N is selected as six was an interesting result since the training set contained thousands of genes having thousands of nucleotides, so encountering an unseen sequence with length six was not expected. Nevertheless, when the cause of the situation is inspected, it is noticed that the problem was related to understanding the beginning or end of the gene, which is a crucial component for modeling.Then Laplace smoothing42 is used to avoid this problem. Different N values are tried starting from six and going up to eight. The corresponding perplexity values obtained are given in Table 5.
Table 5 Laplace smoothing results for different N values.The best model we obtained had about 66,000 parameters.Long short-term memory based language modelAfter experimenting with N-gram language models, we continued with deep learning-based approaches. The first option was using Reccurrent Neural Network (RNN) based models, specifically Long Short-Term Memory (LSTM)43 based models. For getting input, a layer for learning embedding and position information was used at the beginning. Then a single or multiple LSTM layers are used depending on hyperparameter configuration. Lastly, the aim was predicting the next word; because of this, a linear layer with the size of vocabulary many units are used with Softmax to get predictions. The model architecture is visualized in Fig. 3. Adam was chosen as the optimizer to use.Fig. 3Long short-term memory based architecture.Hyperparameter optimization was a crucial component. The basic procedure followed is like the following. We started with a hyperparameter configuration which looked reasonable. After that, if the model was underfitting, relevant hyperparameters are changed so that the model’s capacity is increased. When overfitting was seen, the opposite is done. The details can be seen in Table S1.There were three main hyperparameters to optimize: embedding dimension, the number of LSTM layers to use and LSTM dimension. For any of these hyperparameter values, the model was run for 40 epochs. Then the result where the absolute difference between training and validation perplexities was less than 0.02 was chosen as the result of that hyperparameter trial. The corresponding perplexity values obtained for the best model are given in Table 6.
Table 6 Long short-term memory result.The best model we obtained had 8,140,552 parameters.Transformer based language modelAfter using Long Short-Term Memory based models, transformer-based models were the next candidate. Actually, most of the procedures were very similar to the previous case. For getting input, a layer for learning embedding and position information was used at the beginning. Then a single or multiple transformer blocks are used depending on the hyperparameter configuration. Lastly, the aim was predicting the next word; because of this, a linear layer with the vocabulary size many units are used with Softmax to get predictions. The model architecture is visualized in Fig. 4. Adam was chosen as the optimizer to use.Fig. 4Transformer architecture.Hyperparameter optimization was a crucial component. The basic procedure followed is like the following. We started with a hyperparameter configuration which looked reasonable. After that, if the model was underfitting, relevant hyperparameters are changed so that the model’s capacity is increased. When overfitting was seen, the opposite is done. The details can be seen in Table S2.There were four main hyperparameters to optimize: embedding dimension, the number of transformer blocks to use transformer feed forward dimension and number of heads to use for a certain transformer block. The model was run for 40 epochs for any of these hyperparameter values. Then the result where the absolute difference between training and validation perplexities was less than 0.02 was chosen as the result of that hyperparameter trial. The corresponding perplexity values obtained for the best model are given in Table 7.Lastly, Tensorflow library44, its high-level API Keras45 and its NLP tool KerasNLP46 were very beneficial for implementing our deep learning based models. The best model we obtained had 1,844,988 parameters.
Table 7 Transformer result.The results obtained for different models based on test set perplexity is visualized in Fig. 5 using Matplotlib library47.Fig. 5Comparison of different methods based on test set perplexity.Performances on real life tasksEvaluating the success of the methods is not trivial, so a real-life task could be very useful. For this reason, we tried to find a real-life problem where modeling nucleotide sequences of human genes could be helpful. Distinguishing between real nucleotide sequences of a real gene and a mutated version seemed like an exciting task, since the ability to do such a task well requires some level of understanding of the data we were working on. In fact, what we did is similar to the evaluation of GPT-234 on tasks like Winograd Schema Challenge48, in the sense that our model is trained as a generative model, but it could be adopted to do discriminative tasks when the correct setting is prepared.Our strategy was to obtain pairs such that each pair consists of the nucleotide sequence of an actual human gene and its mutated version. Then obtained models would assign a probability or perplexity to both of them and predict which one is real and which one is mutated based on these values. If it uses probability, the one with a higher predicted probability would be selected as the real one. If the model uses perplexity, the gene with a lower value will be chosen as the real one since the lower, the better for the perplexity metric. Then the accuracy for predictions made will be checked to see the success of a certain model. After deciding on this strategy, the only remaining part was to obtain a dataset of real and mutated gene pairs.Synthetic mutation datasetOne way to obtain a dataset consisting of real and mutated pairs is basically generating the mutated samples from the normal ones with the help of a computer by using a random process.The method followed was like the following. Firstly, some hyperparameters such as the number of samples to obtain for the dataset and the maximum number of changes to be applied to a selected real gene, are determined. We selected the first as 1000. For the second, we experimented with three different values, which are 1, 5, and 10 to see how the performance changes according to this parameter. Lastly, since the process contains random number generation, a random seed was used to obtain reproducible results. For that part, we, again, used three different seeds to be able to use the general behavior.After determining the hyperparameters, for each selected sample from the test set, the number of changes to apply was selected, which is between 1 and the maximum number of changes (both inclusive) with equal probability. Then for each change, an option among three different candidates was chosen with equal probability. These options were adding a new randomly chosen nucleotide to a random position, deleting an arbitrary position in the sequence, and changing the nucleotide on a randomly selected position to another one. The described operations are visualized in Fig. 6 and the mentioned algorithm is also described in Fig. 7.Fig. 6Synthetic mutation dataset operation types.Fig. 7Procedure for generating a synthetic mutation dataset.As stated, this procedure was carried out for all three different values of the maximum number of mutations and three different seeds, leading to nine different synthetic mutation datasets. For synthetically obtained nine different datasets, containing real and mutated gene pairs, the best models from N-gram with Laplace smoothing, recurrent neural networks (i.e., LSTM in our case) and transformer-based language models were chosen. Their predictions were taken for each couple and the accuracy was calculated for each of them. The results obtained based on perplexity for three different random seeds, which were 419432, 623598 and 638453 in the code are given in Table 8.
Table 8 Synthetic mutation dataset accuracy results based on perplexity predictions. The accuracy scores for each of the three seeds are shown in parentheses.The results obtained for different models based on synthetic mutation dataset perplexity is visualized in Fig. 8 using the Matplotlib library.Fig. 8Averaged accuracy results using perplexity for predictions.Real mutation datasetAlthough synthetic datasets are beneficial for measuring the success of different alternatives, checking the performance on the real data is also very important. Consequently, finding mutations for genes in our test set was needed.Human Gene Mutation Database49,50,51 was a nice resource with almost no alternative for our goal. Even though the number of mutations we could find for genes in our test set was only 10, which is very limited, it still helped to show that our models could make nice predictions for real examples as well. Since the database is not publicly available, we will, unfortunately, not be able to share the obtained mutation dataset.Details about the mutations and which model made a mistake in which example can be found in Table 9.
Table 9 Correctness of perplexity based predictions for different models.

Hot Topics

Related Articles