StructmRNA uses advanced computational techniques to analyze and embed mRNA sequences and structures. It employs a comparative framework for RNA degradation prediction using BERT with a dual-level masking strategy. This approach includes masking thresholds, model architecture, training protocols, dataset configuration, and data loader setup to enhance mRNA sequence analysis.Dual-level masking processThe dual-level masking process in StructmRNA integrates sequence and structural data for accurate mRNA sequence embeddings. This section details sequence-level masking, and structure-level masking. Sequence-level masking is inspired by BERT, in which nucleotides are randomly replaced by a masking token, prompting the model to predict them based on the surrounding context and learn sequence dependencies. It is grounded in the methodology of4. Complementing sequence-level masking, structure-level masking targets elements of the mRNA structure. This approach helps the model learn how sequences fold into structural motifs, highlighting the role of structural context in understanding mRNA function. We set a 25% masking probability for each nucleotide or structural element to balance uncertainty with informative data. Our random sequence masking strategy evaluates each nucleotide against a random number for masking, which can be described as follows:For a sequence of nucleotides \(S = \{s_1, s_2, \ldots , s_n\}\), each nucleotide \(s_i\) is compared against a randomly generated number \(r_i\) uniformly distributed between 0 and 1. If \(r_i < p\) (where p is the masking probability), then \(s_i\) is replaced with a [MASK] token. This is formalized as Eq. (1):$$\begin{aligned} s_i^{\prime} = {\left\{ \begin{array}{ll} \texttt {[MASK]} & \text {if } r_i < p, \\ s_i & \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)
Advanced masking techniques like conditional and dynamic pattern masking address nucleotide-specific significance and replicate RNA variability. There is a moderate positive correlation between sequence and structure masking, indicating that increased sequence masking often leads to increased structure masking, emphasizing the need to integrate both in modeling. The Pearson correlation coefficient \(\rho _{\text {seq, struct}}\) is computed as Equation (2).$$\begin{aligned} \rho _{\text {seq, struct}} = \frac{\text {cov}(\text {seq}_D, \text {struct}_D)}{\sigma _{\text {seq}_D} \cdot \sigma _{\text {struct}_D}}, \end{aligned}$$
(2)
where \(\text {cov}\) represents covariance, and \(\sigma\) denotes standard deviation. The model’s interdependence shows its ability to predict masked parts using context, improving generalization. With dual-level masking and mRNA-specific complexities, this model identifies key patterns such as secondary structure motifs, regulatory elements, splice sites, codon biases, and degradation signals. This capability facilitates RNA structure prediction from sequences alone, which is crucial when structural data is missing, and enhances mRNA sequence and structure analysis to provide a better understanding of their functional roles.Figure 1 illustrates the dual-level masking process applied to a sample mRNA sequence, showcasing the approach we employ to mimic the natural variability in RNA sequences.Fig. 1A sample mRNA sequence and structure after dual-level masking. (A) Before masking. (B) After masking.Conditional maskingThis technique employs variable masking likelihood based on nucleotide type, facilitating conditional masking tailored to molecular structures and functions. It selectively targets nucleotides such as guanine, which are crucial for stability and function, reflecting their biological significance and variability in RNA sequences. This approach enhances realism by simulating natural variability observed in RNA sequences. To formalize this process, we introduce a function \(P(s_i)\) that defines the probability of masking for each nucleotide type. For example, for guanine (G), we have the following relationship between the probabilities: \(P(G) > P(A) = P(C) = P(U)\). The masking decision for a nucleotide \(s_i\) is then based on whether a random number \(r_i\) is less than \(P(s_i)\).Data preparation and processing pipeline for StructmRNAIn applying dual-level masking to our RNA dataset, we generate two key columns: masked_sequence and masked_structure, containing modified RNA sequences and structures. Both use the same masking token. We use the BERT tokenizer to map RNA sequences into token formats for training and prediction. Additionally, we developed a custom PyTorch Dataset class, RNADataset, to manage “PyTorch mRNA Dataset,” which was specifically designed for our mRNA data. It handles masked sequences and structures to align seamlessly with BERT model requirements. To optimize training, we integrate a DataLoader with a custom collate function for batch-wise processing of tokenized RNA sequences and structures, ensuring efficient grouping while preserving BERT input integrity. We use a batch size of 16 to balance computational efficiency with learning capability. Larger batches might speed up training but reduce learning detail, while smaller ones slow training. This data configuration supports streamlined, effective training, ensuring accurate and efficient model predictions.Tokenizer Configuration In our study, we developed a tokenization method for RNA sequence and structural data. Each nucleotide and structural symbol is converted into unique numerical identifiers using a custom dictionary, token2int, which includes a special [MASK] token. This [MASK] token is crucial for training, akin to BERT’s masked language modeling, enabling context-based prediction. This method bridges RNA sequence complexity with transformer models, ensuring effective model training.We optimized the hyperparameters of the StructmRNA model to improve prediction accuracy, as measured by MCRMSE, while ensuring efficient training. We performed automatic hyperparameter tuning using a grid search and conducted an ablation study to evaluate the importance of various model components. The optimal settings were as follows: hidden layer size 256, 8 layers, 8 attention heads, and intermediate layer size 500. More layers or attention heads offered minimal MCRMSE improvement but increased training time. A vocabulary size over 800 led to overfitting, longer training times, and higher memory use. The AdamW initial learning rate of 1e-5, OneCycleLR max learning rate of 1e-4, and 50 epochs with early stopping yielded the best results. This tuning ensures optimal performance and efficiency. The specific hyperparameters are highlighted in Table 1.Table 1 Hyperparameters and training parameters for the StructmRNA model and various baseline models utilizing embedding methods for the RNA degradation prediction task over 400 training epochs.Figure 2 presents the flowchart of the data configuration and model training process used in our StructmRNA research. It begins with the “Original RNA dataset,” which undergoes a “Masking Process” to generate the mentioned data columns, masked_sequence and masked_structure. These modified columns simulate scenarios in which certain nucleotides or structural elements are unknown, thus providing a realistic training environment for our model. “BERT Tokenization” follows this masking process and breaks down sequences and structures into forms that are usable for model training and prediction. The tokenized data are then managed within a custom PyTorch Dataset Class that has been specifically designed to handle the complexities of RNA data and facilitate efficient management during the training phase. The DataLoader, set with a batch size of 16, processes the “Pytorch mRNA Dataset” in batches using a custom collate function, optimizing the batch-wise processing and maintaining the integrity of the sequences. The final step, “Model Training,” involves training the BERT-based deep learning model using the prepared data, translating computational preparations into practical outcomes and advancing our understanding of RNA degradation mechanisms. Figure 3 illustrates the architecture of the BERT model used in the StructmRNA, designed for sequence and structural prediction in a masked language modeling context. It shows the progression from the input of original sequences, through embedding layers and multiple transformer blocks to the final prediction of masked tokens.Fig. 2StructmRNA pipeline from mRNA generation to model evaluation. NCBI and GAN-generated sequences undergo structure prediction via ViennaRNA, followed by sequential, structural, and conditional masking. Tokenized data is organized into a PyTorch dataset, processed through a DataLoader, and used for model training. Evaluation uses the OpenVaccine dataset with MCRMSE for mRNA degradation prediction.Fig. 3StructmRNA’s sequence and structure masking process: (1) Original sequence and structure, (2) Masked, (3) Token embedding, (4) Positional embedding, (5) Concatenation, (6) MLM prediction, (7) Predicted vs. original tokens.Data augmentation with generative adversarial networksIn bioinformatics, limited datasets constrain predictive models. StructmRNA leverages GANs to augment data by replicating real mRNA sequences’ statistical properties, enriching datasets with diverse samples. In this way, dataset scarcity is addressed, and synthetic sequences in research are explored. GAN-generated sequences enhance training data volume and model generalization, improving robustness and biological relevance. Combining BERT’s context-sensitive learning with GANs’ data augmentation makes StructmRNA a pioneering advancement in bioinformatics, highlighting the potential of interdisciplinary strategies in analyzing mRNA sequences and structures. The use of GANs in StructmRNA raises concerns about the biological viability of generated sequences. Rigorous validation is crucial for ensuring these sequences are statistically accurate and biologically plausible46. Examining the biological significance of GAN-generated sequences highlights our commitment to responsibly and effectively harnessing the full potential of synthetic biology45,47.We chose a transformer-based GAN because it can handle sequential data with self-attention, which is crucial for mRNA sequences. It maintains nucleotide order and sequence structure through positional encoding, enhancing biological plausibility over simpler GANs such as CycleGAN. Figure 4a shows the process of integrating GAN data augmentation into our StructmRNA model. Figure 4b details the generator and discriminator architecture in the transformer GAN framework for synthetic mRNA sequence generation.Fig. 4(a) Workflow diagram for augmenting mRNA sequence and structure: Train mRNA classifier, apply to training set, generate synthetic sequences with transformer GAN, evaluate with classifier. (b) Generator and discriminator architecture of the transformer GAN for synthetic mRNA generation.