SpliceTransformer predicts tissue-specific splicing linked to human diseases

Data representationThe input for the model is pre-mRNA sequences. The nucleotides A, C, G, and T are one-hot encoded, represented as [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], respectively. These encoded nucleotides are combined to form a 4 × N matrix, where N stands for the length of the sequence. [0,0,0,0] are used for padding sequences with insufficient length or to present “unknown” nucleotides in unclear regions. The model utilizes this input matrix to capture sequence features. Specifically, we denote the length of the input sequence as N = Ncontext + Ntarget + Ncontext, where Ntarget represents the length of the target region that we aim to predict, and Ncontext represents the length of the flanking sequence to each side of the target region, which we consider as “sequence context”. For the training and testing data, each nucleotide in the target region is assigned a splice label set [SN, SA, SD] and a numerical tissue usage label set [S1, S2, …, St]. Splice labels SA, SD, SN represent the possibility that a position is an “acceptor”, “donor” or “neither” (not a splice site), respectively. Tissue usage labels S1, S2, …, St, where t is the number of tissues, indicate the possibility that a position is used as a splice site in a certain tissue. These labels range in [0,1]. Consequently, the model produces a matrix with the shape of (3 + t) × Ntarget as its output. We applied Ntarget = 1000, Ncontext = 4000, and t = 15 in our work. Specifically, for an RNA sequence of length 9000 nt, SpTransformer utilized the sequence context to predict splicing sites and their corresponding usage in 15 tissues for the central 1000 nt sequence.Data collection and labelingWe collected splicing data from various sources, including the GTEx V822 database and an independent large-scale RNA-seq dataset covering various species. GTEx V8 (dbGaP Accession phs000424.v8. p2) consists of 838 donors, providing 17,382 samples from 53 tissues and two cell lines. To obtain meaningful splice sites and the corresponding tissue usage ratio, we processed the exon-exon junction read counts file from the dataset for 15 representative tissue types. Initially, sequences around splice junctions in the GRCh38 reference genome were extracted. The base preceding and following each splice junction (i.e., the 5’ and 3’ ends of exons) were defined as splice sites. Only samples from the 15 selected target tissues were considered. A splice site position was labeled as “acceptor” or “donor” label if it was supported by any sample and had no conflict. The splice site at the exon start site was labeled as “acceptor”, and the splice site at the end site was labeled as “donor”. All other positions were labeled as “neither”. Subsequently, the tissue usage label was calculated for each splice site, representing the proportion of samples belonging to the tissue that contained corresponding splice junctions, i.e.,:$$usag{e}_{t}=\frac{{\Sigma }_{sample\in t[readcoun{t}_{sample}\ge 0]}}{{\Sigma }_{sample\in t}},\, t\in tissues$$
(1)
The tissue classes analyzed in this study included: adipose tissue, blood, blood vessels, brain, colon, heart, kidney, liver, lung, muscle, nerve, small intestine, skin, spleen, and stomach, which were frequently investigated for splicing events or had abundant samples in GTEx V8 dataset. The relationship between the class labels in SpTransformer and detailed tissue types in the GTEx was documented in Supplementary Data 1. The SpTransformer code frameworks also support other combinations of tissue types. During the training and testing processes, any splice site with a maximum usage label of less than 0.05 across all tissue classes was excluded and re-labeled as “neither” class.The independent RNA-seq dataset underwent similar processing steps. We utilized mammalian organ transcriptomes, including samples from humans, rhesus macaque, mice, and rats. For additional species, the genes that show orthology or paralogy to human genes in the test dataset were excluded. A splice site was identified if it was in the gene body and supported by at least one split read in each of at least two different samples.The dataset only included tissue usage labels for 4 tissues: Brain, Heart, Liver, and Kidney. Hence, the label for each nucleotide was represented as [SN, SA, SD, S1, S2, S3, and S4], in accordance with the aforementioned definition. Despite the presence of additional tissue samples in the data source, such as forebrain or hindbrain, they were omitted due to their absence in all species. Since there were directly accessible RNA-seq data, the tissue usage label was estimated by splice site strength calculated with SpliSER59. The dataset was partitioned following the same strategy as the GTEx dataset. The part of the training data was considered an extension of the training dataset, while the test data segment remained unused.We chose sequences from chromosomes 2, 4, 6, 8, 10–22, X, and Y to create the training dataset. Sequences belonging to chromosomes 1, 3, 5, 7, and 9 were compiled as the testing dataset. To ensure the test data was not exposed in training data, we cross-referenced the sequence strands with the Ensemble database (release 110)60. We excluded gene sequences that have paralogs from the testing dataset. Despite splitting the two datasets independently, we made sure that there was no overlap between the training and testing data after the steps to split data by chromosomes and paralogs.To compile the datasets, the pre-mRNA sequences of each gene were extracted. For genes with multiple transcripts, the extracted sequence began from the most upstream site observed across all transcripts and ended at the most downstream site observed across all transcripts. Subsequently, each sequence was divided into blocks of length 1000 nt. Blocks that did not contain any splice sites were discarded. For each remaining block, the flanking sequence with 4000 nt + 4000 nt and the corresponding 1000 nt label was packaged as a single training (testing) data entry.SpTransformer model structureSpTransformer is a large deep neural network model that consists of two main modules: an encoder module and a transformer module. The encoder module incorporates a series of residual networks, known as ResBlock, with inspiration originating from classic ResNet structure61 in computer vision tasks. Previous studies have shown the effectiveness of convolution-based methods in capturing sequence or “motif” features in DNA/RNA sequences7,62. However, as the depth of the network increases, a brutal stack of convolution layers may lead to a rapid decrease in the network’s accuracy due to the vanishing or exploding gradient problem61. A residual block is composed of a series of convolution layers, interspersed with several batch-normalization layers and RELU activation function layers. Stacking these residual blocks can help mitigate the gradient issue and enhance the model’s ability to handle complex features61.The second component of our model mainly consists of a Sinkhorn Transformer module and a multi-layer perception module. Transformers have become a fundamental component in the field of modern NLP63. Their “Attention” mechanism efficiently solved sequence-based tasks such as the Named Entity Recognition problem, which is similar to our challenge of recognizing splicing sites based on RNA sequences. The Sinkhorn Transformer64 is a variant of the transformer model that is designed to handle sequences of considerable length (8192 nt in our model).The architecture of our model is shown in Supplementary Fig. 1. The input is an RNA sequence of length N = Ncontext + Ntarget + Ncontext, where Ntarget denotes the length of the target sequence, and Ncontext represents the sequence context of the target region (Ncontext = 4000 in our pipeline). The parameter L dictates the number of channels in each convolution layer, with L1 = 192 and L2 = 64 used in this study. The convolution layer in ResBlocks is characterized by parameters L, W, and D, which represent the number of channels, kernel width, and dilation, respectively. Following the calculation of the encoder module, a concatenation, and a truncation operation are applied to ensure that the input to the attention module does not exceed a length of 8192. The Sinkhorn Transformer module has 256 channels, 8 attention heads (including two local attention heads) per layer, and eight layers of depth. The final output is a (N × 3) shaped matrix and a (N × 15) shaped matrix, representing splice site prediction and tissue usage prediction, respectively.The hyperparameters were selected by multiple processes: 1) Ntarget and Ncontext: the determination of these parameters was influenced by the model’s architecture. While our goal was to allow the model to process long sequences, practical constraints—such as GPU memory limitations—necessitated a compromise. In that case, we set Ntarget = 1000, Ncontext = 4000 so that each target nucleotide has at least 4000 + 4000 available context information, taken considering that 8000 nt context is enough for the model (Supplementary Fig. 3). 2) The dimension of encoder layers was gained by grid-search in {32, 64, 128, 256}. Multiple hyperparameters of the transformer module were also tried, including: layers of transformer = {4, 6, 8, 10}, and number of attention heads = {6, 8, 10}. During this process, batch size = 12 and learning rate = 0.001 were used in order to keep consistent with SpliceAI. 3) Other parameters was selected from: batch size = {6, 12, 18, 24}, learning rate = {0.01, 0.005, 0.001, 0.0001}, decay factor of learning rate = {0.9, 0.7, 0.5}. Those options were established in reference to previous publications, suggestions from Transformer models, and the memory limitations of GPUs. From all those options, the combination with the best performance on the validation dataset was subsequently used.Further details of input and output have been provided in the “Data representation” section. Different measurement was applied to the output scores. For splice site prediction, we applied a Softmax activation function to produce probability prediction of “Acceptor”, “Donor”, and “Neither”. For tissue usage prediction, we applied a sigmoid activation for each tissue type.Model training and testingSpTransformer underwent two stages of training. In the first stage, two convolution-based encoder modules with identical structures were trained separately. One module learned weights from the GTEx training dataset, while the other module learned weights from the excessive training dataset, due to the fact that the two datasets have different label formats. To accommodate the different data formats of the two datasets, temporary convolution layers with 1 × 1 kernel size and particular output channels (18 and 7, respectively) were appended to each convolution module to align the outputs with the labels. The output of the temporary layers was used directly for loss calculation and back propagation. The two modules were assigned 128 and 64 channels (i.e., the parameter “L” in Supplementary Fig. 1), respectively, based on the difference in tissue numbers. In the second stage, the temporary layers were removed. The two modules with frozen parameters were directly integrated into a 192-channel convolution network, as shown in Supplementary Fig. 1. The whole network was then trained on the GTEx training dataset to get the final model. This approach enabled SpTransformer to learn from multiple datasets with similar biological meanings but different data formats, minimizing the need for extensive coding or conversion when differently sourced data was received. Despite both datasets being under the sequence model, the diverse splicing representations and distinct data content encourage the deep model to comprehend latent sequence features from various aspects, akin to a visual model examining a human face in multiple ways. The strategy improved the model’s performance compared to only using one dataset, demonstrating the potential to integrate diverse bioinformatics data for a single task.Special loss functions are used in the backpropagation of deep learning. In the first stage, the loss function is:$${{\rm{Loss}}}={\Sigma }_{i=1}^{n}CE({s}_{i},\, {A}_{i})+{\Sigma }_{t=1}^{T}{\Sigma }_{i=1}^{n}{{\rm{BCE}}}({u}_{ti},\, {B}_{ti})$$
(2)
Each sequence in the training dataset is a contiguous nucleotide sequence of length n. The i-th position has a splicing label Ai and a tissue-usage label Bti for T different tissues. For this position, the model outputs si for splice site prediction, and outputs uti for tissue-usage prediction. For splicing prediction, we compute the categorical cross-entropy loss, as it is a multi-class classification task. In contrast, for tissue-usage prediction, which is a multi-label classification task (meaning one sample can belong to multiple classes), we calculate the Binary Cross-Entropy loss. We apply mean reduction for the two loss functions.The loss function above is sufficient for the encoders to learn basic sequence patterns. However, as the number of supported tissues increases, new challenges emerge. First, models struggle to learn the features of different tissues in a balanced manner during the training process, occasionally demonstrating superior predictive performance for specific tissues only. Second, there is a relative scarcity of samples with strong tissue specificity compared to those with weak tissue specificity. This imbalance leads to models tending to produce similar tissue usage scores for splice sites. During the second stage, we apply a special loss function, in order to persuade the transformer module to overcome these difficulties:$${{\rm{Loss}}}={\Sigma }_{i=1}^{n}{{\rm{CE}}}({s}_{i},\, {A}_{i})+\root T \of {{\Pi }_{t}^{T}({\Sigma }_{i=1}^{n}{w}_{i}\times BCE({u}_{ti},\, {B}_{ti}))}$$
(3)
$${w}_{i}=\frac{1}{T}{\Sigma }_{t=1}^{T}{({B}_{ti}-{{\rm{average}}}({B}_{i}))}^{2}+0.001,\, t\in T$$
(4)
The formula calculates the weighted geometric mean for loss of tissue usage. The idea is motivated by previous computer vision research65. We use this method to encourage the model to balance the performance on multiple tissues and pay attention to those tissues that harder to classify. For the i-th position of the sequence, a weight wi was multiplied to encourage the model to pay more attention to splice sites with stronger tissue specificity. wi related with the variance of tissue-usage labels of i-th position.The model was trained for 12 epochs in each stage. Adam optimizer was used to minimize the combined loss. The learning rate was set to 0.005, and was multiplied by 0.7 after every epoch.Model performance evaluation and ablation testWe evaluated the performance of SpTransformer on two tasks using the compiled test dataset: 1) splice site prediction in long sequences: the model took each pre-mRNA sequence as input and identified every splice acceptor and donor within a target region of 1000 nt. Given that most positions in the sequences are not splice sites, we computed the top-k accuracy and the area under the precision-recall curve (AU-PRC) for splice site prediction. The top-k accuracy was defined as follows: if a sequence has k positive positions that truly belong to the class, a threshold is selected so that exactly k positions are predicted to be positive. The fraction of these k predicted positions that truly belong to the class is reported as the top-k accuracy. We calculated the top-k accuracy and AU-PRC value for the acceptor and donor classes separately, and reported the average performance of the two classes. 2) Tissue usage level prediction: the model was tasked with predicting the usage level in each of the 15 tissue classes for each position of the sequence. Given the absence of a widely accepted “tissue usage” protocol, we divided all splice sites in the test dataset based on their tissue usage label. Since most tissue usage labels were close to (or equal to) 0 or 1, we defined a tissue usage label greater than 0.5 as “high usage”, while the remaining sites were classified as “low usage.” The usage was set to 0 if a position was not a splice site. The model was then tasked to classify the usage of each position in the test dataset. Furthermore, positions that did not pass the top-k threshold in task 1 were forcibly masked as negative in the prediction result. We calculated AU-PRC for each tissue class. According to the two tasks outlined above, we conducted a comparative analysis, including an ablation test to highlight the advantages of our method. Firstly, we prepared three different versions of the SpTransformer model: SpTransformer-noextra, which was trained only on the GTEx training dataset; SpTransformer-extra1, which used an extra training dataset of two species, human and macaque; and SpTransformer, which used the full training dataset of four mammalian species. All versions were trained with the same configuration.Secondly, we applied the published version of SpliceAI to the tasks for comparison. Furthermore, we prepared a similar CNN-based network, named as SpliceAI-modified. The key difference was a specific alteration: the output channel number of the final convolution layer was adjusted from 3 to 18. This modification enabled the model to predict tissue usage at each splice site. The modification was a permissible adjustment within the conventional design framework of CNN. This network was trained on the GTEx dataset with the same hyperparameter and loss functions as the first stage of SpTransformer. We also retrained SpliceAI on our training dataset using the same hyperparameters as the original version, named SpliceAI-retrained. We then included the published version of Pangolin. Pangolin supports the prediction of four tissues (brain, heart, liver, and testis) and does not distinguish between acceptor and donor. Thus, the Brain, Heart, and Liver models of Pangolin were used. The maximum splice scores of them were used in task 1. The tissue scores of them were used in task 2. It is worth noting that SpliceAI-modified has a remarkably similar structure to Pangolin, despite the differences in their last output layers. SpliceAI-modified predicts splicing effects across 15 tissues using a single model, whereas Pangolin utilizes four distinct models, each dedicated to a specific tissue.Finally, we adapted the task for earlier methods that were not entirely compatible with our tasks. MMSplice, HAL, and MaxEntScan were designed for classifying a single position with short flanking sequences. We modified the input format to enable them to predict each position of the long sequences individually. For each position, the lengths of input sequence were carefully selected based on the recommendations provided by their respective publications. Since HAL does not score 3’ splice sites, we excluded the “Acceptor” class when evaluating HAL. The “MMSplice_MTSplice” tool is a combined model where “MMSplice” predicts splice sites and “MTSplice” predicts tissue-specific usage of those splice sites. During testing, we evaluated MMSplice in two different scenarios: the full dataset, as well as a simpler task where it was restricted to predicting positions within a 20 nucleotide range of each splice site, rather than the whole sequences. The performance on the simpler task was marked as “MMSplice-short”. MTSplice was able to predict usage scores in 56 detailed tissue types. We took the maximum score of corresponding types as the prediction of 15 classes in our dataset. Since “MMSplice-short” exhibited an advantage against “MMSplice”, MTSplice was also tested on the restricted task.We further investigated the importance of context in flanking RNA sequences of splice sites. All splice sites in chromosome 1 were extracted from the GTEx test dataset. Several test routines were performed after masking flanking sequences over certain values of the range. AU-ROC was calculated for each test routine in the same way as performance testing before (Supplementary Fig. 3).Estimate sequence weights by in-silico mutagenesisTo identify important regions within the input sequences, we performed a procedure referred to as “in silico mutagenesis”. The “mutagenesis weight” of a nucleotide with respect to a splice site is defined as follows: Let sref denote the splice (or tissue usage) score of the target splice site. The score is recalculated by replacing the nucleotide under consideration with A, C, G, and T. Let these scores be denoted by sA, sC, sG, and sT, respectively. The mutagenesis weight of the nucleotide is estimated as:$$w={s}_{{\rm{ref}}}-\frac{{s}_{A}+{s}_{C}+{s}_{G}+{s}_{T}}{4}$$
(5)
Under our hypothesis, a larger mutagenesis value indicates that “the splice site is more likely to be lost if the position is mutated”. We calculated the difference between the maximum and mean tissue usage values across all 15 tissues for each splice site in the GTEx dataset. Those with the top 1% difference value were selected, including 2754 splice donors and 2670 splice acceptors. For each selected splice site, in silico mutagenesis was performed for the sequence from 80 nt upstream to 20 nt downstream of the site. For each sequence, let maxsref denote the maximum sref value among all the 15 tissues. For each tissue and each sequence, all subregions with 8 nt length and with average mutagenesis weight greater than maxsref were extracted. The motif analysis tool XSTREME28 was utilized to enrich motifs from those extracted subregions. Multiple associating motifs were enriched for each tissue through the outlined methodology. Despite the different PWM matrices, the motifs were treated as the same term if their IUPAC codes (given by XSTREME) have a Levenshtein Distance not greater than 1.Quantifying splicing change resulting from variantsIn order to quantify the splicing alterations caused by SNVs, we calculated the difference in scores between the original sequences and the alternated sequences. Firstly, we predicted the 2R + 1 length of the reference sequence surrounding the mutation (R represents the length of flanking nucleotides on each side, and was set to 100 by default in our analysis) using SpTransformer. We then used the alternative sequence for prediction, resulting in prediction scores for each position of the sequence. Regardless of the “not a splice site” class, scores for each class were represented by vectors in the shape of (2R + 1) × 1. For reference and alternated sequence, there were scores named Refacceptor, Refdonor, Altacceptor, Altdonor, Refbrain, Refheart, Altbrain, Altheart, …, for two splice site types and 15 tissue types. We evaluated the following quantities, using brain tissue as an example:$$\Delta Acceptor=max(abs(Al{t}_{{\rm{acceptor}}}-Re{f}_{{\rm{acceptor}}}))$$
(6)
$$\Delta Donor=max(abs(Al{t}_{{\rm{donor}}}-Re{f}_{{\rm{donor}}}))$$
(7)
$$\Delta Brian=max(abs(Al{t}_{brain}-Re{f}_{brain}))$$
(8)
Δscore for other tissues was calculated in the same way, where abs() refers to the function to calculate the absolute value for each position, and max() is the function to find the max value in the vector. We defined ΔSplice as the max value between ΔAcceptor and ΔDonor to quantify the effect of splice alteration caused by a variant:$$\Delta Splice=max(\Delta Acceptor,\, \Delta Donor)$$
(9)
The Δscore for each tissue, for example, ΔBrain, was processed to quantify the change in tissue usage.Quantify tissue-specificity by tissue z-scoreTo quantify tissue-specificity, we designed a z-score. We created a reference mutation set with GTEx SNVs data. Upon checking the GTEx Genotype calls vcf file, we identified a total of 734,509,842 variants. We selected SNVs that met the following conditions: 1) For each tissue, there should be at least one available RNA-seq data from an individual carrying the SNV. 2) For each tissue, there should be at least one available RNA-seq data from an individual not carrying the SNV. 3) Within a 100nt range of the SNV location, a significant change in splice site usage, either increase or decrease, should be observable across all tissues when comparing RNA-seq data between individuals with and without the SNV. The “low” and “high” were under the same definition as those used in the training dataset. Finally, we filtered out 27,843 mutations that cause splice alterations in all tissue classes. These mutations were expected to have minimal impact on tissue specificity, and their Δscore was used to build a reference distribution hereafter. SpTransformer predicted the ΔSplice score and tissue Δscores for each mutation. A tissue z-score was then calculated based on the following formula. Here we use adipose as an example.$$s{p}_{i}=\Delta Adipose-\mu (\Delta {t}_{i}),\, t\in tissues,\, i\in [1,\, n]$$
(10)
$${\mu }_{{\rm{adipose}}}=\frac{1}{n}{\Sigma }_{i=1}^{n}s{p}_{i}$$
(11)
$${\sigma }_{{\rm{adipose}}}=\sqrt{\frac{1}{n}{\Sigma }_{i=1}^{n}{(s{p}_{i}-{\mu }_{{\rm{adipose}}})}^{2}}$$
(12)
$${Z}_{i}=\frac{s{p}_{i}-{\mu }_{{\rm{adipose}}}}{{\sigma }_{{\rm{adipose}}}},\, i\in [1,\, n]$$
(13)
Here we denote μ as the average values, σ as the standard error, and Zi representing the adipose tissue z-score for the i-th mutation.The distribution of the tissue z-score for each tissue was defined as the reference distribution mentioned above (Supplementary Fig. 7). For any new SNVs analyzed, we similarly calculated its z-score using previously calculated μtissue and σtissue. For each tissue, we consider any real SNVs with a tissue z-score greater than X% of the reference distribution to be tissue-specific. In all analyses in this work, X = 99 was used by default.Quantifying gene expression level by ASCOT databaseWe employed the normalized area under the curve (NAUC) as a measure of gene expression levels. The introduction of NAUC by ASCOT66 enables users to explore gene expression in a wide range of cell types and tissues. The dataset was derived from the GTEx project, which included 9662 human RNA-Seq samples across 53 tissues. In our study, we calculated the average NAUC value across the corresponding tissue types for each of the 15 tissue classes. The correlation between the tissue classes and their corresponding detailed tissue types is presented in Supplementary Data 1. We classified gene expression into “Low” (0–1 NAUC), “Moderate” (1–20 NAUC), and “High” (over 20 NAUC) according to the recommended standard. This classification was used during analysis to exclude genes with “Low” expression in the tissues of interest from our investigation. All the gene expression values presented in the figures were based on the NAUC values.Finding splicing altering SNVs in the ClinVar databaseWe utilized SpTransformer to predict splicing altering variants in the ClinVar database (the 20220917 version with hg38 genome annotation). We analyzed a total of 1,273,053 SNVs. These SNVs were annotated using SnpEff67 and only variants with consequence annotations “splice acceptor variant”, “splice donor variant”, “intron variant”, “synonymous variant”, “missense variant”, and “nonsense” were considered in all analyses. We employed ΔSplice to distinguish variants of different pathogenicity (Fig. 3a). The SNVs were categorized based on their consequence annotations (e.g., “Intron”, “Synonymous”, etc.) and clinical significance labels (e.g., “Benign”, “Pathogenic”, etc.). SNVs with ambiguous labels such as “Conflicting interpretations of pathogenicity” or “Likely risk allele” were excluded.We further utilized SNVs from the ClinVar dataset to establish a score threshold for SpTransformer (Supplementary Fig. 5). The “Strict” panel presents the performance of our SpTransformer model in distinguishing between pathogenic and benign mutations, while the “Soft” panel presents the performance in distinguishing between pathogenic/likely pathogenic/uncertain mutations and benign/likely benign mutations.The classification cutoff was determined by considering the F1-score, recall, and precision in these tasks. We employed a cutoff value of 0.27, ensuring that the model demonstrated equal precision and recall in the “Strict” task, based on the following three considerations: 1) to specifically investigate splicing events. 2) The F1-score showed only slight differences across a wide range of delta splice values, while recall dropped off more significantly (Supplementary Fig. 5c). 3) In a clinical setting, we had a preference for higher recall over higher precision, in order to identify critical variants. This cutoff value allowed us to discern variants that impact splicing. When analyzing Figs. 4 and 5, the cutoff value served as a filter to identify mutations that have the highest likelihood of being implicated in human disease through their impact on splicing. We also provided other cutoffs for users. E.g., Δsplice = 0.54 for the max F1-score, and Δsplice = 0.09 for distinguishing “likely pathogenic” and “likely benign” mutations. Similarly, a cutoff = 0.32 was chosen for the SpliceAI-retrained model, to reproduce SpTransformer’s result as a comparison (Supplementary Fig. 6).Enriching genes in tissue-specifically splice-altering SNVsTo identify genes in ClinVar associated with tissue-specific splicing, we performed an enrichment pipeline for SNVs. This pipeline incorporated three filters to identify tissue-specific splicing SNVs: ΔSplice score = 0.27, tissue z-score > 99%, and ASCOT gene expression NAUC > 1. Subsequently, we conducted a hypergeometric test on the genes that contained those predicted SNVs. To validate the association between gene-tissue disease, we referred to the Human Phenotype Ontology (HPO) database68. For each gene enriched with splicing alterations SNVs, we examined HPO terms for tissue-related phenotypes in the corresponding Mendelian disorders.Process and analysis of three different brain disordersWe collected variants related to three brain disorders: ASD, SCZ, and BD from three different public databases ASC, SCHEMA, and BipEx, respectively. The ASD data was obtained from the analysis of de novo variants called in family-based data consisting of 6430 probands with ASD and 2179 unaffected siblings, as well as rare variants called in 5556 ASD cases and 8809 ancestry-matched controls. The SCZ data was obtained from the GATK analysis of WES of 24,248 cases and 97,322 controls, as well as de novo mutations from 3402 parent-proband trios. The BP data was obtained from variants called from WES of 14,210 cases and 14,422 controls, post-quality control. Due to limited raw sample data, we were unable to segregate mutations into case and control groups using the GATK pipelines and calculate p-values. Instead, we calculated the odds ratio (OR) using the allele count and allele frequency of SNVs in the case and control samples. SNVs with an OR greater than 3.5 were classified as case SNVs. Following this pipeline, we employed SpTransformer to predict the splicing effect. After prediction, we applied a series of filters: 1) SNVs with ΔSplice greater than 0.27 (the cutoff determined in Supplementary Fig. 5) were considered splice-affecting SNVs. 2) SNVs with brain tissue z-score above 99% and the corresponding genes had moderate expression (NAUC > 1) were finally categorized as brain-specific splicing variants. The two-proportion z-test used in Fig. 5c is a statistical test designed to determine whether the proportions of categories in two group variables significantly differ from each other. In this study, we employed this test to examine whether there exists a larger proportion of brain-specific splicing variants among case SNVs that are related to splicing. Specifically, Group A consisted of all SNVs with OR > 3.5 and ΔSplice ≥ 0.27. while Group B included all SNVs with OR ≤ 3.5 and ΔSplice ≥ 0.27.We sorted all filtered SNVs by brain tissue z-score and selected the top 300 related genes for each disorder. We applied the online tool Metascape69 for GO terms enrichment analysis, disease–gene association enrichment analysis, and corresponding network figures. We categorized tissue-specific expression patterns based on the ASCOT NAUC values across 15 tissue classes. A gene was designated as having ’brain-specific expression’ if its NAUC value exceeded 1 solely in brain tissue. Similarly, if a gene’s NAUC value exceeded 1 in any other tissue, it was classified as expressed in that specific tissue. We further manually verified whether each gene listed in Fig. 5g had been previously reported to be associated with the respective disorders. Genes were classified as “previously reported” if any publication in PubMed explicitly stated in the abstract, results, or discussion that the gene is associated with the disorders.DN variant enrichment and RNA-seq validationWe processed WES data from DN patients for downstream analysis. Genomic DNA isolated from whole blood was used for genotyping. Ninety-five DN patients were genotyped using Agilent SureSelect Human All Exon V6 kit (Agilent Technologies) and sequenced on an Illumina Hiseq3000 with Hiseq3000SBS&Clusterkit. The human kidney tissues used were obtained from DN patients with renal biopsy-proven, sourced from Nanjing Glomerulonephritis Registry, National Clinical Research Center for Kidney Diseases, Jinling Hospital, Nanjing, China. RNA sequencing was performed on micro-dissected tubulointerstitial tissue from 95 patients diagnosed with DN. Total RNA was extracted using an RNAeasy Micro kit (Cat#74,004, QIAGEN Science) according to the manufacturer’s instructions. RNA quality was assessed with the Agilent Bioanalyzer 2100. Only samples with RIN scores above seven were used for cDNA production. Qualified total RNA was further purified by RNAClean XP Kit (Cat A63987, Beckman Coulter, Inc. Kraemer Boulevard Brea, CA, USA) and RNase-Free DNase Set (Cat#79254, QIAGEN, GmBH, Germany). Strand-specific, polyA-enriched RNA-seq libraries were generated using the VAHTS Stranded mRNA-seq Library Prep Kit for Illumina. FastQC (v0.11.9) was used for quality control. STAR (v2.7.8a)70 was used for alignment against the human Grch38 assembly. Transcripts of low quality were filtered out, and the read counts were normalized to transcripts per million (TPM) within each sample. These values were then log2 transformed (log2 (TPM + 1)). To identify kidney-specific splice-altering SNVs, we performed the following steps: 1) applied the GATK variant caller and standard quality control procedures; 2) retained the SNVs with an MAF of less than 0.01 in the Asian population; and 3) retained the SNVs with an ΔSplice greater than 0.27, a kidney tissue z-score above 95% and the corresponding genes had expression NAUC score > 1 in the kidney. These procedures resulted in 47 kidney-specific splice-altering SNVs corresponding to 46 genes expressed in the kidney tissue (Fig. 6b). We validated these retained SNVs using RNA-seq data, excluding those without sufficient coverage at the mutation site and adjacent exon regions. Following this pipeline, 12 SNVs corresponding to 12 genes were validated. Each SNV was manually checked in IGV browser71 and visualized with the R package ggsashimi72.Statistical informationStatistical tests are described in the figure captions, main text, and corresponding “Method” sections. To summarize, we used a one-tailed hypergeometric test to enrich tissue-specific splicing-altering SNVs in genes from the ClinVar database. We used a two-sided z-test for two groups to measure the enrichment of tissue-specific SNVs in the three brain disorder datasets. For tissue z-scores defined in the main text, we use sentences like “99% significance” to represent tissue z-scores that are greater than 99% of mutations in the reference distribution. The GO term enrichment analysis (related to Figs. 5e, f and 6e, f) was calculated by Metascape based on the hypergeometric test and Benjamini–Hochberg p-value correction algorithm.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles