CircCNNs, a convolutional neural network framework to better understand the biogenesis of exonic circRNAs

Construction of BS and LS exon pairs datasetA recent study highlights significant disparities among the existing circRNA databases13. To enhance our understanding of exonic circRNA biogenesis, we manually curated a dataset of BS and LS exon pairs through the most comprehensive data integration among similar studies (see Fig. 1). For the construction of BS exon pairs dataset (see Fig. 1a), we integrated data from the top 5 largest circRNA databases: circBase21 (92,375 circRNAs based on hg19), circRNADB20 (32,914 circRNAs based on hg19), circAltas22 (580,718 circRNAs based on hg38), circpedia v223 (914,458 circRNAs based on hg19), and CSCD2.024 (1,881,089 circRNAs based on hg38, unique to normal samples or common to both tumor and normal samples), along with 3,517 curated circRNAs from a recent study13. All circRNAs annotated for hg38 were converted to hg19 with pyliftover (https://pypi.org/project/pyliftover/), resulting in 580,059 circRNAs from circAltas and 1,879,701 circRNAs from CSCD2.0 respectively. After removing the duplicates based on the genomic coordinates of splicing sites, a total of 2,471,730 circRNAs remained. To identify exonic circRNAs, we filtered these circRNAs based on the exon boundaries of hg19 obtained from the UCSC table browser25. Specifically, circRNAs were considered exonic if the first splicing site of the circRNA was within a 5-bp distance of any exon starting positions and the second splicing site was within a 5-bp distance of any exon ending position from the same transcript (see Fig. 1a). The longest transcript of a gene was used to annotate each BS exon pairs. Since the majority of circRNAs involve multiple exons, we focused on circRNAs consisting of two or more exons, removing duplicated exon pairs as illustrated in Fig. 1a.
To ensure high confidence in identifying exon pairs potentially involved only in LS (namely, LS exon pairs), we did not randomly select exon pairs not belonging to the previously obtained set of BS exon pairs, because such random selection cannot guarantee true negative data (namely LS exon pairs) if more tissue- or condition-specific RNA-Seq data indicate otherwise. In contrast, we leveraged available LS data for the human genome hg38 from the RJunBase26. As shown in Fig. 1b, we retained the LS sites that were not specific to tumors (to minimize the potential effects of genomic mutations associated with tumors on the splicing site) and converted them to hg19, resulting in 644,719 out of 682,017 reported LS sites. To confirm that these splicing sites indeed connect the 3′ end of the upstream exon to the 5′ end of the downstream exon linearly, we utilized the exon boundaries of hg19 for filtering. We retained the splicing sites and annotated the connecting exon pairs if the first splicing site was within 5 bps distance of any exon ending positions and the second splicing site was within 5 bps distance of the starting positions of any downstream exon from the same transcript (see Fig. 1b, note the splicing sites defined for LS differ from those for BS). The longest transcript of a gene was employed to annotate the connecting exons. Based on these connecting exon pairs, we generated potential LS exon pairs such that the 5′ of the first chosen exon is required to linearly spliced with the 3′ of any upstream exons and the 3′ of the second chosen exon is linearly spliced with the 5′ of any down-stream exons according to the RJunBase26 (see Fig. 1b). The LS exon pairs defined in this procedure are less likely to be involved in back-splicing (BS) because their splicing sites are already engaged in linear splicing (LS) and there is often a competitive relationship between BS and LS processes, where the same splicing sites may be utilized for either BS or LS15.To mitigate potential bias arising from characteristics of different transcripts, we focused on comparing LS and BS exon pairs derived from the same transcript. Between the LS and BS exon pair datasets, we identified a total of 15,700 common transcripts. For each common transcript, we distinguished unique LS exon pairs, existing solely in the LS dataset, and unique BS exon pairs, found exclusively in the BS dataset (see Fig. 1c). Furthermore, we retained the common transcripts that have more than one exon pair of both types. Some exon pairs shared identical genomic coordinates despite being unique when considering transcript ID and exon numbers together, likely due to the overlapping transcript annotation in hg19. These redundant exon pairs were removed from both datasets. Additionally, to meet the minimum sequence length requirement for the CircCNNs, we only kept the exon pairs with at least a 200-bp separation (namely, from the 5′ of the first exon to the 3′ end of the second exon). Finally, to ensure the balanced representation of both LS and BS classes, for a given transcript, an equal number of exon pairs from each splicing type were selected (see Fig. 1c).Figure 1The construction of BS and LS exon pairs datasets by comprehensive data integration. (a): BS exon pairs creation based on existing circRNA databases. The diagram on the top depicts the BS exon pairs and the exonic circRNAs focused by this study. (b): LS exon pair creation based on RJunBase. The diagram on the bottom depicts the methods used for LS exon pairs selection based on LS evidence from RJunBase. (c): two-way filtering to create the final BS and LS exon pairs dataset as positive and negative datasets, respectively.Construction of base modelsAll models in this study were constructed using the PyTorch framework27 (version 1.13.1) with CUDA Toolkit (version 11.4) for efficient GPU acceleration. To facilitate cross-validation (CV), we employed the KFold function from sklearn28 (version 1.0.2) to randomly split the dataset. The Optuna (https://github.com/optuna/optuna) was used for hyperparameter searching.Two distinct sets of models were created to incorporate the information from both the junction sequences of splicing sites and the RNA secondary structure data (e.g., RCM) for classifying BS and LS exon pairs. As shown in Fig. S1b, we focus on RCM data crossing and within flanking regions of the exon pair. Models processing the junction sequences were termed base models, while those handling RCM data were labeled RCM models. The optimal base model and RCM model were integrated as combined models to further improve performance. As shown in Fig. 2a, to effectively train and evaluate these models, the BS and LS exon pairs were randomly divided into non-overlapping training and testing sets, comprising 22,000 and 2226 exon pairs, respectively. To assess the impact of different training sizes on model performance, three distinct training sets were created from the 22,000 exon pairs: training set1 (16,000 exon pairs), training set2 (18,000 exon pairs), and training set3 (20,000 exon pairs). The remaining exon pairs were allocated to the corresponding combining sets: combining set1 (6000 exon pairs), combining set2 (4000 exon pairs), and combining set3 (2000 exon pairs). All testing, training, and combining sets were balanced in the numbers of BS and LS exon pairs. Optimal hyperparameters for base models and RCM models were determined separately based on these three training sets using 3-fold CV, while optimal hyperparameters for the combined models were obtained on the combining sets.Previous studies12,19 have underscored the importance of junction sequences around the splicing sites for distinguishing between circRNAs and lncRNAs. To select effective CNN structures capable of discerning BS from LS exon pairs based on these junction sequences, we devised two base models with distinct configurations. For base model1, mirroring the model structure of DeepCircCode19, a single CNN structure comprising two layers was employed. Its input sequence consists of the concatenated junction sequence around each splicing site in BS or LS exon pairs. Specifically, for each splicing site (two per exon pair), we extracted 100 bps of sequence upstream and downstream of the same splicing site (referred to as junction seq1 and junction seq2 in Fig. 2b). Subsequently, these two junction sequences were concatenated together and one-hot encoded to serve as the input for base model1 (see Fig. 2c for the model architecture). In essence, the formulation of base model1 can be expressed as follows:$$\widehat{{{\varvec{y}}}_{{\varvec{i}}}}= {{{f}^{Sigmoid}(f}^{FC2\_BN\_ReLU}({f}^{FC1\_BN\_ReLU}({{{f}^{Flatten}(f}^{Conv2\_BN\_ReLU\_Maxpool2}(f}^{Conv1\_BN\_ReLU\_Maxpool1}({\varvec{X}}}_{{\varvec{i}}}))))) )$$where \({{\varvec{X}}}_{{\varvec{i}}}\) is the 4 X 400 one-hot encoded input matrix based on concatenated junction seq1 and seq2 shown in Fig. 2b for the ith training example, and \(\widehat{{{\varvec{y}}}_{{\varvec{i}}}}\) is the corresponding predicted label from base model1. For the base model2, separate CNN structures are applied to each splicing site in BS or LS exon pairs, aiming to capture the potential different sequence motifs around each site and mitigate the spurious motifs learned from the artificially concatenated sequences, a potential issue mentioned in DeepCircCode study19. In particular, for base model2, the input for each CNN is the one-hot encoded junction sequence from one of the splicing sites (i.e., junction seq1 for upstream-CNN and junction seq2 for downstream-CNN, respectively, as illustrated in Fig. 2b and d). The outputs from these two CNNs are concatenated before the final classification, as shown in the model architecture depicted in Fig. 2d. Essentially, the base model2 can be expressed as follows:$${\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{{\varvec{u}}}_{{\varvec{p}}}}} = {{{{f}^{Flatten}(f}^{Conv2\_BN\_ReLU\_Maxpool2}(f}^{Conv1\_BN\_ReLU\_Maxpool1}({\varvec{X}}}_{{{\varvec{i}}}_{{\varvec{u}}{\varvec{p}}}})))$$$${\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{\varvec{d}}{\varvec{o}}{\varvec{w}}{\varvec{n}}}} = {{{{{f}^{Flatten}(f}^{Conv2\_BN\_ReLU\_Maxpool2}(f}^{Conv1\_BN\_ReLU\_Maxpool1}({\varvec{X}}}_{{\varvec{i}}}}_{{\varvec{d}}{\varvec{o}}{\varvec{w}}{\varvec{n}}})))$$$$\widehat{{{\varvec{y}}}_{{\varvec{i}}}} = {{f}^{Sigmoid}(f}^{FC2\_BN\_ReLU}({f}^{FC1\_BN\_ReLU}({f}^{Concate}({\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{{\varvec{u}}}_{{\varvec{p}}}}},{\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{\varvec{d}}{\varvec{o}}{\varvec{w}}{\varvec{n}}}}))))$$where up stands for upstream, down represents downstream, \({{\varvec{X}}}_{{{\varvec{i}}}_{{\varvec{u}}{\varvec{p}}}}\) is the 4 × 200 one-hot encoded input matrix based on the junction seq1 shown in Fig. 2b for the ith training example, and \({\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{{\varvec{u}}}_{{\varvec{p}}}}}\) is the corresponding output from flatten layer 1 (see Fig. 2d). Similarly, \({{\varvec{X}}}_{{{\varvec{i}}}_{{\varvec{d}}{\varvec{o}}{\varvec{w}}{\varvec{n}}}}\) is the 4 X 200 one-hot encoded input matrix based on the junction seq2 shown in Fig. 2b for the same ith training example, and \({\widehat{{\varvec{y}}}}_{{{\varvec{i}}}_{{\varvec{d}}{\varvec{o}}{\varvec{w}}{\varvec{n}}}}\) is the corresponding output from flatten layer2. \(\widehat{{{\varvec{y}}}_{{\varvec{i}}}}\) is the corresponding predicted label from the base model2. Rectified linear unit (\({f}^{ReLU}\left(z\right)= \text{max}(0, z)\)) was used as the default activation function and the sigmoid function (\({f}^{Sigmoid}\left(z\right)= \frac{{e}^{z}}{1+ {e}^{z}}\)) was used to make the final prediction for both base models.The base model1 and model2 were optimized separately on each of the 3 training sets, utilizing the available choices of hyperparameters outlined in Table S1. The base models trained on these training sets are denoted as base1_16000, base1_18000, and base1_20000 for base model1; and base2_16000, base2_18000, and base2_20000 for base model2. The highest average 3-fold CV accuracy based on these training sets was utilized to select the corresponding optimal model hyperparameters. Given the computational constraints of exploring the entire hyperparameter space, Optuna (https://github.com/optuna/optuna), a hyperparameter optimization framework, was employed. This involved conducting 500 random trials for each model, where each trial corresponds to a different combination of hyperparameters.Figure 2The constructions of training, combining, and testing sets, base model1 and model2. (a): Separation of training, combining, and testing datasets for base, RCM, and Combined model training and final model evaluation. (b): Schematic representation of the input sequences for base models. (c): The architecture for base model1. (d): The architecture for base model2.Numerical methods to detect reverse complementary matches (RCMs)RCMs crossing the flanking intron regions of BS sites (or BS exon pairs) are believed to facilitate circRNA formation by bringing the two splicing sites from an exon pair into proximity for back splicing, typically from the 3′ end of one exon to 5′ end of an upstream exon29. Additionally, it has been suggested that RCMs within each flanking region of the exon pair (i.e., the upstream or downstream intron of one exon pair) are involved in linear splcing15. Consequently, an RCM within one flanking region may compete with RCMs crossing the flanking regions, potentially influencing the likelihood of BS or LS for a given exon pair (see Fig. S1b).A previous study on the classifying between circRNAs and lncRNAs demonstrated the importance of the RCM abundance crossing the flanking regions of splicing sites11. To integrate the RCM model with the base models, we modified our previously established fast numerical methods in inverted repeat detection30,31,32 to determine RCM kmer pairs crossing the flanking regions as well as within each flanking region of the BS/LS exon pairs. Briefly, by converting nucleotide bases to imaginary numbers coupled with fast vector calculation, we are able to obtain the cumulative scores of kmers of any given K, thus significantly reducing the searching space for potential RCM kmer pairs. We implemented the method in a Python class and utilized the ray library (https://github.com/ray-project/ray) for parallelization.To calculate the cumulative score for all kmers of length K in a given sequence of length L (denoted as S in Fig. 3a), we first convert all the nucleotide bases to a vector of real and imaginary numbers using the following mapping rule: A: 1, T: − 1, C: 1j, G: − 1j, N:0. This transformation is also applied to the corresponding augmented sequence (i.e., “N” + sequence, denoted as AS in Fig. 3a) of length L+1. Then, we compute the cumulative sum vectors for both S and AS. The cumulative sum vector for sequence S represents the sum of each element up to that position, and similarly for sequence AS. Next, we calculate the cumulative score for all kmers of length k. We obtain a slice of the original cumulative sum vector starting from position K to position L, denoted as vector V1, and a slice of the same length (L-K+1) in the augmented cumulative sum vector starting from position 1 to position L-K+1, denoted as vector V2. The element-wise difference between V1 and V2 (i.e., V1-V2) generates the cumulative score vector (denoted as C in Fig. 3a) with length L-K+1 for all kmers of length K for this given sequence S.
To find the RCM kmer pairs of length K crossing sequence 1 of length L1 (namely, upstream intron of the exon pair) and sequence 2 of length L2 (namely, downstream intron of the exon pair), we first obtain the cumulative score vectors for both L1 (length of L1-K+1, denoted as C1 in Fig. 3b) and L2 (length of L2-K+1, denoted as C2 in Fig. 3b) separately. These cumulative score vectors represent the potential presence of RCMs of length K at each position along the respective sequences. Next, we reshape these cumulative score vectors into column vectors of dimension (L1-K+1) by 1 for C1 and row vectors of dimension 1 by (L2-K+1) for C2, respectively. Then, we add these two vectors together by broadcasting to find the sum of all possible pairwise kmer scores in the matrix with dimension (L1-K+1) by (L2-K+1) (see Fig. 3b). To find candidate RCM kmer pairs crossing sequence 1 and sequence 2, we add the absolute values of the real part and imaginary part of each element in this summation matrix (see Fig. 3b). For each entry in this summation matrix if the sum is less than a predefined mismatch score, we define this kmer pair as a candidate RCM kmer pair crossing sequence 1 and sequence 2. We then extract the position indices of the corresponding candidate RCM kmer pairs and verify them base by base to get the final RCM kmer pairs crossing the two sequences (see the diagram in Fig. 3b). Similarly, to find the RCM kmer pairs within a sequence of length L, we follow the same steps with sequence 1 and sequence 2 replaced by the same sequence.After identifying RCM kmer pairs crossing two sequences or within a sequence, we can extend these kmer pairs inwardly to find all extended subsequences that are reverse complementary to one another, constrained by some predefined overall mismatch scores. Nevertheless, a previous study indicated the absolute number of RCM kmer pairs is more informative than the longest RCM subsequences in the classification between circRNAs and lncRNAs11. Therefore, for each BS or LS exon pair, we obtained the RCM kmer pairs crossing the flanking intron regions as well as those within each flanking intron region of the splicing sites. For comprehensiveness, the flanking regions were defined with various window sizes ranging from 100 to 3000 bps (see the diagram in the left part of Fig. 3c). For a given window size with a certain kmer length, we determined the distribution of RCM kmer pairs crossing the flanking region and within each flanking region by defining 5 equal-distance segments along the sequence and tallying the frequency of detected RCM kmer pairs in each region represented by a 5 X 5 matrix (see Fig. 3c). In the previous study, to find RCM kmer pairs, the number of required comparisons is exponential as kmer length increases (i.e., 4k), making the determination of longer RCM kmer pairs computational prohibitive11. In contrast, our fast numerical methods enable us to handle larger kmer lengths easily. We used kmer lengths of 5, 7, 9, 11, and 13 bps without allowing any mismatches. To incorporate the distribution information of RCM kmer pairs into the model effectively, we concatenated the RCM kmer pair distribution matrices crossing the flanking regions of the splicing sites for each window size and kmer length, resulting in a 5 X (25 X # of windows size) matrix (5 different kmer lengths per window, denoted as RCM kmer crossing scores in Fig. 3c). Similar concatenations were performed for the RCM kmer pair distribution matrices in the upstream and downstream flanking regions of the splicing site separately (see RCM kmer upstream scores and downstream scores respectively in Fig. 3c). To facilitate the model training, all RCM kmer scores were log-transformed after adding 1 to avoid logarithm of 0.Figure 3The numerical methods to determine the distribution information of RCM kmer pairs crossing the flanking regions as well as in each flanking region of BS or LS exon pairs. (a): Cumulative score vector creation for one input sequence. (b): Determination of RCM kmer pairs based on the cumulative score vectors. (c): The creation of RCM kmer crossing, upstream, and downstream scores based on the RCM kmer pair distribution matrix.Construction of RCM modelsTo evaluate the relative importance of the RCM distribution within different window sizes of the splicing sites, 3 separate sets of window sizes were used. The first set includes the flanking sequences with lengths of 100, 200, 300, 400, and 500bps (denoted as small), the second set with lengths of 1,000, 1,500, 2,000, 2,500, and 3,000bps (denoted as large), and the last set with lengths defined in both small and large sets (denoted as all). To investigate the relative importance between RCM crossing and within each flanking region of the exon pairs, two different RCM models were created for each set of window sizes: (1) the first model used only the RCM kmer crossing scores as the input (denoted as RCM_crossCNN, see Fig. 4a) and (2) the second model used RCM kmer crossing, upstream and downstream scores together as the input (denoted as RCM_triCNN, see Fig. 4b). The hyperparameters for these 6 different settings (3 different sets of window size X 2 RCM models) were optimized separately on each of the 3 training sets. This optimization was done via 500 random trials using Optuna, following the same training strategy as shown in Table S1. The optimal hyperparameter set was chosen for the model with the highest average 3-fold CV accuracy.Figure 4RCM models and combined models’ construction. (a): Architecture for RCM_crossCNN model that only takes RCM crossing scores as the input. (b): Architecture for RCM_triCNN model that takes RCM crossing, upstream and downstream scores as the input. (c): Architecture to integrate the base model with the RCM_triCNN model.Motif extraction from retrained optimal base modelsAfter optimizing the hyperparameters for each base model on each training set, the optimized hyperparameter set was used to retrain the models on the corresponding entire training set. Subsequently, sequence motifs potentially important for the classification task were extracted from each retrained model. This was done by applying the model weights to BS or LS exon pairs in the corresponding entire training set separately. To extract the motif features learned by each base model in the first CNN layer, we applied the motif visualization techniques used in the previous study33. Briefly, by using the pytorch forward_hook27 functionality, we obtained the activation values from the first convolutional layer for all the channels with the BS exon pairs in each training set as the model input. For each channel, we obtained subsequences corresponding to the maximal positive activation value with a length equal to the optimal kernel size in each model. Since we used the same-padding approach, we restricted our subsequences to the junction sequences to avoid interference from the padded regions. These subsequences were used to generate position probability matrices (PPMs), representing the proportion of 4 nucleotides in each position along the subsequences for each channel. These PPMs were then searched against the Ray2013 Homo sapiens RNA motif database using Tomtom (https://meme-suite.org/meme/tools/tomtom) for similarity search, utilizing the Pearson correlation coefficient. Significant matches were selected based on an E-value < 0.05 as done in the previous study19. The learned motifs from each retrained base model were then compared. The same motif extraction strategy was also applied to LS exon pairs to identify motif-associated genes potentially important for LS.Integration of retrained optimal base models with RCM modelsAfter obtaining the optimal base model structures to process junction sequences and the optimal RCM model to process RCM information, the base and RCM models were integrated to further improve the model performance. Specifically, after retraining on each training set (see Fig. 2a), the model weights in the optimal base and RCM models were fixed from the convolutional layers to the second fully connected (FC) layer. These optimized model weights capture the higher-level information on junction sequences and RCMs, respectively (see Fig. 2 and 4 for the structures of the base and RCM models). Subsequently, the outputs of the FC layer 2 from both base and RCM models are concatenated. This concatenated output is then passed through two additional FC layers and a final classification layer, as illustrated in Fig. 4c. These additional layers are crucial for integrating the higher-level information extracted by the previously optimized model weights. As shown in Table S1, the structures of these 2 additional FC layers and the corresponding dropout rates are optimized using the same training strategy. 3-fold CV with 500 random Optuna trials is conducted on the corresponding combining sets (see Fig. 2a) to select the optimal combining strategy for combined models that achieves the highest average CV accuracy.Evaluation of model performance on testing setTo ensure an unbiased evaluation of the model performance, a separate testing set (as depicted in Fig. 2a) that has not been involved in the model training, retraining, and model combining was utilized. To quantify the performance of retrained base models, RCM models, and combined models, a range of widely used performance metrics were employed. These metrics include ACC (prediction accuracy), specificity, recall, precision, FPR (false positive rate), FNR (false negative rate), MCC (Matthews’s correlation coefficient), and F1 score. The calculation of these metrics is based on the TP (true positives), TN (true negatives), FP (false positives), and FN (false negatives) derived from the confusion matrix on the testing set as follows: \(\text{ACC}=\frac{\left(TP+TN\right)}{\left(TP+TN+FP+FN\right)};\) \(\text{Specificity} = \frac{TN}{TN + FP};\) \(\text{Recall} = \frac{TP}{TP + FN};\) \(\text{Precision} = \frac{TP}{TP + FP};\) \(\text{FPR}=\frac{FP}{FP + TN};\) \(\text{FNR} = \frac{FN}{FN + TP};\) \(\text{MCC} = \frac{TP \times TN – FP \times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}; \text{F}1 = 2 \times \frac{Precision \times Recall}{Precision+Recall}\); In addition, AUROC (area under the ROC curve) and AUPRC (area under the precision-recall curve) which are based on different classification thresholds were also used for performance comparison.

Hot Topics

Related Articles