A long-context language model for deciphering and generating bacteriophage genomes

megaDNA allows zero-shot prediction of gene essentialityTo construct the training dataset, we collected bacteriophage genomes with high confidence from three sources including the NCBI GenBank, the metagenomic gut virus (MGV) catalog10, and the gut phage database (GPD)11 (Supplementary Fig. 1). After data cleaning, we constructed a dataset of 99.7 K bacteriophage genomes to pre-train our model (Methods), and the training data was nucleotide-level tokenized, where each nucleotide is treated as a separate token.Traditionally, transformer-based language models only process a few thousand tokens of context because the computational cost of the self-attention mechanism scales quadratically with sequence length. This context window is not sufficient to model nucleotide-level tokenized phage genomes. To overcome this problem, we employed a multi-scale transformer structure9 to model the long-range context information. This architecture consists of three decoder-only transformer layers with multi-head attention, and each layer captures sequence information at different resolutions: the local layer processes embeddings of tokenized sequences within a 16 bp window. Its output serves as the input to the middle layer, which has a context window of 1024 bp. Finally, the global layer utilizes information from the middle layer to model sequence dependencies across the whole input context (96 K bp).We hypothesize that our pretrained language model captures the structural patterns of bacteriophage genomes in our training dataset, allowing the model’s loss to approximate the fitness of genome sequences. To test this hypothesis, we conducted in silico mutagenesis analysis to predict essential genes in the lambda phage genome12 (Fig. 1b). Without any supervised training, we found mutations within the coding sequences of essential genes result in higher losses than non-essential genes (Fig. 1c). Consequently, changes in model loss can be used as a zero-shot predictor of essential genes (AUROC: 0.86, Fig. 1d). Similarly, mutations in the start and stop codons of essential genes lead to higher model losses than non-essential genes (Fig. 1d, Supplementary Fig. 2). We further analyzed the similarity of sequences in the training dataset to the lambda phage genome (Supplementary Fig. 3). We found that 847 training sequences aligned with the lambda phage genome through BLAST analysis13. Among these sequences, 50 show a sequence identity above 0.4, and 8 have an identity exceeding 0.9. This finding indicates that our model’s performance benefits from a broad spectral of related references in the training dataset, predominantly consisting of low-similarity sequences and supplemented by a small proportion of highly similar ones. In addition, about 34% of the phage genomes have more representations in the training sequences than the lambda phage (Supplementary Fig. 3), suggesting that the megaDNA model could be potentially utilized to study essential genes within these genomes.Fig. 1: Foundational capacities of our language model.a Overview of the model applications. b In silico mutagenesis analysis to identify essential genes in the bacteriophage genome. c Model loss variation across the lambda phage genome in the mutagenesis analysis. Upper, essential, and non-essential genes in the genome. Lower: changes in model loss for 50 bp non-overlapping windows across the genome (blue). The step size is 50 bp, and moving averages of model loss across 5000 bp windows are denoted in red. d Zero-shot prediction of essential genes by calculating the effects of mutations in the gene coding region (blue), start codon (orange) and stop codon (green). Area under the ROC curve (AUROC) scores are reported. e Prediction of mutational effects on protein functions using model embeddings. f Prediction of mutational effects for the deep mutational scanning experiment of the infA gene. Spearman correlation coefficients of the predicted and reported fitness from fivefold cross-validation tests are reported (Blue: megaDNA, gray: DeepSequence). n is the number of training samples. g Prediction of the impacts of Single Nucleotide Polymorphisms (SNPs) in the T7 bacteriophage genome. Spearman correlation of the predicted and reported fitness from fivefold cross-validation tests is reported. h Prediction of regulatory element activity using model embeddings. i Prediction of translation efficiencies for non-model organisms and high-throughput gene expression libraries. For K. oxytoca, P. protegens, and E. coli DH10B, we evaluated the model performance on endogenous genes. Fivefold cross-validation tests were used for all calculations. j Classifying taxonomies of unannotated sequences using model embeddings. k UMAP visualization of model embeddings for sequences from bacteriophages, bacteria, and archaea (model middle layer, sample size: n = 5000 per group). For f, g, and i, data are presented as mean values ± SEM from fivefold cross-validation tests (n = 5 folds). Source data are provided as a Source Data file.megaDNA learns the functional properties of proteins and regulatory elementsSequence embeddings are high-dimensional representations of the model input that capture rich contextual information. These embeddings can be used to make predictions about the quantitative property related to the original input. Since our megaDNA model takes DNA sequences as the input, we could harness the model’s learned representations for a wide range of predictive tasks. We first evaluated our model’s ability to predict the effects of sequence mutations on protein functions using a deep mutational scanning (DMS) dataset for the E. coli essential gene infA14 (Fig. 1e). This dataset includes all possible single codon mutations for infA and the corresponding mutational effects measured as fitness values through a growth competitive assay. To model protein fitness, mutated gene coding sequences were used as inputs. Then a linear regression model was trained on sequence embeddings derived from the internal activities of neurons within the megaDNA model to predict the mutational effects. The standard deviation of the prediction error is 0.16, which is much smaller than the full dynamic range of the protein fitness measurement (Supplementary Fig. 4). Our model’s prediction performance closely matched the state-of-the-art model DeepSequence15 (Fig. 1f), including for a protein not existing in the training dataset (Supplementary Fig. 5). Moreover, our model successfully predicted the impact of SNPs across the T7 bacteriophage genome16 (Fig. 1g). In this case, the mutated gene sequences were used as model input and the sequence embeddings were utilized to train regression models that predict SNP impacts, quantified as rates of mutability16. It is worth noting that this dataset covers only about 15% of all possible SNPs (Supplementary Fig. 6), which is substantially smaller than the DMS dataset. Despite this constraint, the Spearman correlation coefficient between predicted and measured impact is 0.71 for gene 1.3 and 0.54 for T7 RNAP (Supplementary Fig. 4). For genes with lower prediction accuracy, such as gene 17, our model still identifies mutations with significant fitness effects.Expression of phage genes relies on the protein synthesis machinery in host bacteria cells. We investigated the potential of the model embeddings to predict regulatory element activity in bacteria genomes (Fig. 1h). The 5′UTR sequences were used as model input to derive embeddings, which were then used to predict their translation efficiencies via linear regression. We leveraged the paired RNA-seq and ribosome profiling datasets to quantify the translation efficiency of 5′UTR across three bacteria genomes17,18. The translation efficiency was defined as the ratio of the normalized ribosome density and the RNA level for the genes. Our model effectively predicted the translation efficiencies of 5′UTR in both model and non-model organisms, including K. Oxytoca, P. Protegens, and E. coli (Fig. 1i). In addition to the endogenous regulatory elements, we also benchmarked the predictive performance of our approach on the high-throughput measurement of the translational activity of a random 5’UTR library in E. coli19 (Fig. 1i). The Spearman correlation coefficients range from 0.62 to 0.82 for these datasets, illustrating our model’s capacity to capture translation-related sequence features. We also found that the model’s performance is robust to the training sample size (Supplementary Fig. 7).Lastly, we extended our model to identify the taxonomy of unannotated sequences at the domain level (Fig. 1j). We collected unannotated sequences from bacteriophage, bacteria, and archaea genomes, which were used as model input to calculate their embeddings. Then these embeddings were mapped into a low-dimensional space, where we observed clear separations among different domains (Fig. 1k). By training logistic regression models based on sequence embeddings and the domain labels, we achieved a high classification accuracy in the cross-validation tests (average AUROC of 0.98, Supplementary Fig. 8). This high level of accuracy was consistent across different layers of our model (Supplementary Fig. 8). The prediction performance of the embeddings from the local and global layer was slightly lower compared to the middle layer. This difference may result from the local layer’s short context window and the global layer’s limited resolution for local sequence features. In contrast, the middle layer’s context window provides an optimal balance of length and resolution, enabling effective distinction of sequences from different domains. We further weighted the model predictions on the test sequences based on their similarity to the training datasets. A weight of 0 excludes test sequences that have at least one matched training sequence, and a weight of 1 includes all test sequences in the AUROC calculation (methods). Our results indicate that completely ruling out similar test sequences only results in a reduction of AUROC by 0.03, 0.01, and 0.02 for different model layers (Supplementary Fig. 9). Since the training data doesn’t contain genome sequences of bacteria or archaea, these results demonstrate the broad applicability of our model.megaDNA generates de novo genomic sequencesOur approach enables de novo generation of the genome sequence of bacteriophage (Fig. 2a). We generated a total of 1024 sequences longer than 1 K bp and applied geNomad20 for functional annotation of the generated sequences. The average sequence length is 43 K bp, and the average number of predicted genes per sequence is 67, which is similar to the training dataset (mean length: 48 K bp, average number of predicted genes: 68). The gene length distribution of the generated sequences is close to that of the training dataset (Fig. 2b, average gene length: 558 bp vs 646 bp), while the gene number distribution shows wider spread (Supplementary Fig. 10). In addition, the gene densities of the generated and training sequences are close with each other (Supplementary Fig. 10). The median virus score of all generated sequences is 0.75 and the maximum score is 0.97. These scores are lower than those of natural bacteriophage genomes, which have a median value of 0.98 (Fig. 2c). 228 out of all generated sequences (22%) are predicted to be Caudoviricetes by geNomad (Fig. 2d). As a comparison, 98% of the genomes in the training dataset were classified as Caudoviricetes. We further analyzed the generated sequences using vContact221 and found that all generated sequences form singletons, indicating a high diversity among them.Fig. 2: Language model generates sequences with functional genomic structures.a The workflow schematic. b Comparison of gene length distributions between randomly sampled subsets of predicted genes in generated sequences and training dataset (sample size: n = 2000). Two-sided Kolmogorov–Smirnov test: p value = 0.15. c Comparison of the predicted virus scores for generated sequences (sample size: n = 1024) and the training dataset (sample size: n = 99,429). Median virus scores are indicated by white dots. Black bars denote the interquartile ranges (25th to 75th percentiles). d Predicted taxonomy for the generated sequences predicted as viral. Only taxonomies with >1 sequence are shown. geNomad20 was used to produce results in (c, d). e Functional annotation of a selected sequence fragment (generated sequence #87). f Predicted promoter activity for all the 5′UTRs in the generated sequence #87 (sample size: n = 44), along with the promoter activity of the random sequences with the same length. Promoter activities were calculated using the Promoter Calculator22. Two-sided Kolmogorov–Smirnov test: p value = 6.3 × 10−8. g Proportions of adenine (A) and guanine (G) nucleotides preceding the start codon of all the predicted genes in the generated sequence #87. h Mean predicted pLDDT scores for the ESMFold predicted structures23. We focused on proteins with geNomad markers from generated sequences (sample size: n = 343; median value: 28) against random peptide sequences of the same lengths (sample size: n = 343; median value: 18). A sample generated protein is shown on the right. Two-sided Kolmogorov–Smirnov test: p value = 6.7 × 10−42. i Top 10 predicted functions of proteins derived from the generated sequences, as identified by phold24. Source data are provided as a Source Data file.We then examined the 5′UTR of the annotated genes in the generated sequences to determine if they contain potential regulatory elements such as promoters and RBS to initiate transcription and translation. We chose the generated sequence #87 for further analysis due to its high predicted virus score (0.96) and its relatively small size (28 K bp). Using a machine learning tool (Promoter Calculator)22, we identified the −35 box and −10 box of the promoter within the 5’UTR of the predicted phage stabilization protein. Notably, their sequences are close to the established consensus motifs: TTGACA and TATAAT (Fig. 2e). Prior to the start codon of the same gene, we observed a region enriched in adenine (A) and guanine (G) nucleotides, indicative of a functional ribosome binding site (Fig. 2e). Analyzing all 5’UTR of the predicted genes from this sequence, we found a significantly higher promoter activity compared to random sequences of the same length (Fig. 2f). Intriguingly, the proportion of A and G nucleotides peaked around 10 bp upstream of the start codon, aligning closely to the optimal position for an RBS to drive translation initiation (Fig. 2g). In another sequence example (#212), we observed similar promoter activities and RBS characteristics, and we found a consensus −10 sequence TATAAT that located upstream of the N-terminus of the major capsid protein (Supplementary Fig. 11). This trend of A/G enrichment and high promoter activity within the 5′UTRs is also consistent across all the generated sequences (Supplementary Figs. 12 and 13). We found a low correlation between the promoter activities and virus scores, which may suggest that the model learns the two features independently (Supplementary Fig. 12). In short, our generated sequences harbor potential regulatory sequences that could enable the expression of the predicted genes.In the generated sequences, 343 annotated genes were predicted to match geNomad markers. These genes share very little homology with the training dataset (Supplementary Table 1). The query genes and their matches do not belong to the same gene family. We employed ESMFold23 to predict their structures and calculated the average predicted Local Distance Difference Test (pLDDT) scores. This score reflects the confidence of ESMFold in the predicted structures. The median pLDDT score for these proteins is higher than that of random peptide sequences (28 vs 18, Fig. 2h). We further randomly sampled 10 K annotated genes from all the generated sequences and found high pLDDT scores for them (median value of 36, Supplementary Fig. 14). These results may suggest inherent similarities between the generated sequences and sequences used in training ESMFold. We further annotated gene functions using phold24. In brief, phold leverages a protein language model25 to derive structural information from protein sequences. This information is compared against a structural database via Foldseek26 to obtain PHROG annotations27. Although the generated sequences do not include a coherent set of genes necessary for the full phage life cycle, our analysis reveals several large protein families associated with phage-related functions, such as head and packaging, and nucleotide metabolism (Fig. 2i). Among these, several proteins were predicted to have DNA-binding activity, and the predicted structure also resembles the canonical helix-turn-helix (HTH) domain in this protein family (Supplementary Fig. 15). Moreover, the predicted structure of a generated protein aligns with the experimental structure in the PDB database, further demonstrating the model’s ability to capture biologically relevant features (Supplementary Fig. 15).

Hot Topics

Related Articles