SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies | BMC Bioinformatics

Inputs descriptionSimSpliceEvol2 takes the same inputs as SimSpliceEvol1. Namely, it requires a guide gene tree with branch lengths in the NHX or Newick format. Here, a guide tree refers to the topology of a gene tree, representing the evolution of a gene family with gene labels at the leaves of the tree and including branch lengths. The branch lengths are used to calculate the number of mutation events acting on the gene sequences, the exon-intron structures of genes, and on the sets of transcripts produced from genes during the evolution on each branch of the gene tree [22, 23].There are four categories of optional user-defined parameters. For space reasons, in this sequel, we recall only the notations for the optional parameters that are used in the simulation of transcript evolution. A comprehensive explanation of each user-defined parameter and their impact on the simulated data is available at https://simspliceevol.cobius.usherbrooke.ca. The first category of optional parameters includes constant factors used to compute the number of mutation events on branches of the gene phylogeny. In particular, the user-defined constant \(\texttt {k\_tc}\) helps to determine the number of events that change the set of transcripts produced from a gene.The second and third categories of optional parameters include those that are used in the simulation of exon sequence evolution and the simulation of exon-intron structure evolution.We now describe in detail the fourth category, which includes parameters used to compute the number of each type of event acting on the set of transcripts produced from a gene. There are seven such types of events including five types of alternative splicing events, plus a transcript loss event, and a transcript gain event. The five types of alternative splicing events, which explain the differences between any two transcripts of a gene are alternative 3’ (a3) or 5’ (a5) splice-site selections; Exon skipping (es); Mutually exclusive (me) exons; and Intron retention (ir). Thus, the seven parameters in this category are denoted by \(\texttt {tc\_a5}\), \(\texttt {tc\_a3}\), \(\texttt {tc\_es}\), \(\texttt {tc\_me}\), \(\texttt {tc\_ir}\), \(\texttt {tc\_tl}\), and \(\texttt {tc\_rs}\), corresponding respectively to the seven type of mutation events that can change the set of transcripts produced from a gene. They define the relative proportion, respectively, of alternative splicing events, namely of types a5, a3, es, me, and ir, of transcripts randomly selected (rs) or in others terms gained among all possible isoforms, and of transcripts lost (tl). For instance, the anticipated total number of transcripts undergoing es events for a gene with n transcripts on a branch of the gene tree with a length of \(\texttt {c\_s\_r}\) is given by the formula \(n \times \texttt {tc\_es} \times \texttt {k\_tc} \times \texttt {c\_s\_r}\).Like SimSpliceEvol1, SimSpliceEvol2 maintains a large number of user parameters to enable users to simulate and test various frequencies for the evolutionary events.Transcript evolution frameworkSimSpliceEvol2 introduces improvements in simulating the evolution of sets of transcripts compared to SimSpliceEvol1.Fig. 1Illustration of the transcript evolution simulation framework. The figure depicts the phylogeny resulting from the simulated evolution of transcripts in a guide gene tree. The guide gene tree depicted as 3 cylinders consists in the evolution of two extant genes, Gene2 and Gene3, from an ancestral gene, Gene1. The bottom surfaces of the cylinders represent the two leaves (Gene2 and Gene3) of the guide gene tree and their ancestor (Gene1). The legend at the bottom of the figure shows the meaning for each graphical element. The exon-intron structures of each gene is diplayed, as well as the exon composition of each transcript. The evolution history consists of evolutionary stages. The root nodes of the transcript phylogeny correspond to transcript gains. The values of user input parameters are (\(\texttt {tc\_rs} \ne 0.0\), \(\texttt {tc\_tl}=0.1\), \(\texttt {tc\_a5}=0.1\), \(\texttt {tc\_a3}=0.1\), \(\texttt {tc\_es}=0.2\), \(\texttt {tc\_me}=0.1\), and \(\texttt {tc\_ir}=0.1\)) and constants factors values (\(\texttt {k\_tc}\) and \(\texttt {c\_s\_r}\)) are given such that \(\texttt {k\_tc} \times \texttt {c\_s\_r} = 1\) where \(\texttt {c\_s\_r}\) is the length of the branch. For instance, regarding transcripts in Gene2, the number of transcripts undergoing the intron retention event is equal to 1, which corresponds to the ceiling result of \(\texttt {k\_tc} \times \texttt {c\_s\_r} \times n \times \texttt {tc\_ir}\), where \(n=7\) represents the number of source transcripts at this particular evolutionary stage ({1#1, 2#0, 2#2, 2#3, 2#4, 2#5, 2#6})Simulation of the set of transcripts at the root of the guide treeConsidering the guide tree, the simulation of the set of transcripts at the root of the tree is operated in two steps. In the first step, transcripts are generated by randomly selecting them from all potential transcripts isoforms (all \(2^m -1\) possible combinations of exons, where m is the number of exons in the gene structure) at the root. If \(tc_{rs}\) is null, then only one transcript, composed of all exons in the gene structure, is selected. Otherwise, a heuristic is used for the random selection of transcripts in the pool of \(2^m\) possible isoforms. This heuristic first involves the random selection of a number n of transcripts, following a normal distribution with a mean of 1.45 and a standard deviation of 1.08. These parameter values were determined in [20] and correspond to the mean number of transcripts per gene and the standard deviation of transcripts per gene in the Eukaryote kingdom based on data from the Ensembl Compara database [24]. In the second step, the remaining transcripts are produced by applying AS events to the transcripts randomly selected in the first step. Note that, a transcript can undergo more than one AS event during the simulation. Figure 1 provides an illustration. At the root of the tree, the randomly selected transcripts are 1#1 and 1#2, and both underwent a sequence of AS events, for example, giving rise to transcripts 1#4, 1#5 and 1#7 from 1#1, and to 1#3 and 1#6 from 1#2. For example, the transcript 1#6 is derived from transcript 1#2 by applying first an es event to yield 1#3, and then an a3 event to yield 1#6. The loss of transcripts does not occur at the root of the tree, assuming the root represents the starting point of evolution of a set of homologous transcripts. Consequently, all relative proportions of AS are directly applied to the overall count of existing transcripts at a given evolutionary stage. An evolutionary stage is defined as the period between a time where a set of source transcripts are available for a gene, and the time where a new set of sink transcripts are available after applying mutation events to the source transcripts, as illustrated in Fig. 1. Each sink transcript is then linked to at most one source transcript from which it is derived. For example in Fig. 1, at the root of the gene tree, the number of transcripts undergoing the es event is determined by taking the ceiling value of \(n \times tc\_es\), where \(n=2\) represents the number of source transcripts at this particular stage of evolution, and \(tc\_es=0.3\). In this case, only one transcript is subject to es event and the next stage of evolution then has a total of 3 source transcripts. This approach is used to determine the quantity of transcripts undergoing a given type of AS event at each evolutionary stage throughout the simulation.Simulation of the set of transcripts at other nodes than the rootConsidering an internal node or a leaf node of the guide gene tree, the exon-inton structure of this node is compared to the exon-intron structure of its parent node. First, a transcript loss is inferred for all transcripts of the parent node for which one or multiple exons were lost on the branch between the parent gene node and the child gene node. The remaining transcripts are denoted as conserved transcripts. For example, in Fig. 1, the transcripts 1#1 and 1#7 from gene Gene1 are conserved on the branch leading to Gene2, while transcripts 1#2, 1#3, 1#4, 1#5, and 1#6 are lost because of the absence of exon4 in the structure of Gene2.Subsequently, the conserved transcripts continue their evolution. A conserved transcript may be lost, since another way to loose transcripts is regulated by the user-defined parameter \(\texttt {tc\_tl}\). For instance, in Fig. 1, the transcript 1#7 is initially conserved at the start of the branch between Gene1 and Gene2, but it is finally lost in the next evolutionary stage. After these steps, transcripts randomly selected from all potential transcript isoforms are added to the set of conserved transcripts. The remaining transcripts are then generated by applying AS events to the set of available transcripts, following a process similar to the one used at the root of the guide tree. However, the process now takes into account the length \(\texttt {c\_s\_r}\) of the branch between the parent and child node in the guide gene tree, and it also accounts for the constant factor \(\texttt {k\_tc}\) provided by the user. For instance, in Fig. 1, the number of transcripts undergoing es on the branch from Gene1 to Gene2 equals the ceil of \(\texttt {k\_tc} \times \texttt {c\_s\_r} \times n \times \texttt {tc\_es}\), where n represents the number of source transcripts at the given evolutionary stage. The result is 1 given \(\texttt {k\_tc} \times \texttt {c\_s\_r} = 1\), \(\texttt {tc\_es}=0.2\) and \(n=3\).The simulation framework adheres to the Dollo parsimony principle, which is that an exon cannot be gained during the evolution after being lost. At the end of the simulation process, SimSpliceEvol2 yields a forest of transcript trees that describe the evolutionary history of transcripts in the gene family. In contrast to SimSpliceEvol1, which mainly focussed on generating sequences of alternative transcripts at the leaves of the guide tree, SimSpliceEvol2 explicitly infers the transcript phylogeny.Software designWeb serverThe current web server was implemented by deploying a Linux-based Apache 2.4.41 Web server on a Ubuntu 20.04.6 LTS system. It is fully compatible with standard desktop PC systems. We have updated the user interface of the web server provided for SimSpliceEvol2, to make it more user-friendly. Now, the Web interface is constructed using the ViteJS 4.1 framework. Users can now manually enter their input guide tree, in addition to the option to upload it as a file. Additionally, users now have the possibility to save their query parameters for future reference and analysis. Once users have completed the parameters, they can launch the simulation by using the “compute” button. When the simulation is finished, users can download the simulated datasets available in the “results” section, as illustrated in the Fig. 2. The web server is available at https://simspliceevol.cobius.usherbrooke.ca. Fig. 2Web server screenshot. The screenshot presents two main sections of the web server. The input section, highlighted in red, allows users to set parameter values. Users can launch the program (illustrated by a green arrow) and save the query parameters for future use (illustrated by a purple arrow). The results section, highlighted in blue, displays the available data for download once it is ready, as indicated by the blue arrowStandalone software and graphical user interfaceWe have improved the software by integrating a model of transcript evolution, as detailed in Sect. 2.2. We have included it within a standalone Graphical User Interface (GUI) in order to distribute a single software, as shown in Fig. 3. The software is available for download at https://simspliceevol.cobius.usherbrooke.ca. It is a standalone application designed to run the simulation program locally without any dependencies. Currently, it has been developed and tested on Linux (Ubuntu 22.04.3 LTS and later versions) and Windows 11 operating systems, and no user login is required. SimSpliceEvol2 allows to generate figures that display the transcript phylogenies, together with the corresponding multiple sequence alignment. For example, in Fig. 4 bottom, the phylogenetic tree is diplayed with leaf colors that highlight, in yellow, a group of orthologous transcripts, including transcript_4-0, transcript_5-0, transcript_6-0, transcript_7-0, alongside the multiple sequence alignment of the transcripts present in the tree. Referring to Fig. 1, two transcripts are orthologs if they are derived from distinct genes and there are no AS events in the path of branches connecting them, i.e., the path contains only black edges in the figure. A group of orthologous transcripts is a set of transcripts which are pairwise orthologs. In Fig. 1, the only group of orthologous transcripts is \(\{{\textbf {2\#1}}, {\textbf {3\#1}}\}\). For more detailed documentation on the software, the reader is invited to consult https://simspliceevol.cobius.usherbrooke.ca.Fig. 3SimSpliceEvol2 GUI screenshot. The graphical user interface of SimSpliceEvol2 enables users to browse the file system for selecting the input guide tree file (indicated by the red arrow) and the output directory (indicated by the green arrow). It enables also to set parameter values (indicated by the blue arrow). Upon clicking the “generate command” button, the corresponding command line is produced (indicated by the orange arrow). Users have the option to copy this command for future use by clicking the “copy command” button. Help related to the program or its options is provided at the top of the interface (indicated by the black arrow). The outputs generated from running the simulation (indicated by the purple arrow) correspond to the figures shown in Fig. 4SimSpliceEvol2 improves the output of SimSpliceEvol1 by incorporating a visualization component through its GUI, and by providing more relevant data for transcript phylogeny analysis. Within the GUI, a carousel window is provided to assist users in exploring phylogenies. Additionally, the software displays the alignment of transcripts to illustrate the exon structure conservation between all simulated transcripts.Fig. 4Outputs of SimSpliceEvol2 using the GUI with the default options (k_tc=5, tc_rs=1, tc_es=0.25, tc_me=0.15, tc_a5=0.15, tc_a3=0.15, tc_ir=0.15, tc_tl=0.05). (Top left) Visualization of the simulated transcript phylogeny (multiple transcript trees that form a transcript forest) through a carousel interface within the GUI (figures generated using ETE 3 [25]). (Top right) An exon alignment for all transcripts generated at leaves of the phylogeny is shown alongside the guide gene tree. (Bottom) The software outputs figures that show the multiple sequence alignment of transcripts generated at the leaves of each transcript tree (figure also generated using ETE 3 [25])

SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies | BMC Bioinformatics

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Cornell researchers develop machine-learning diagnostic models that use cell-free molecular RNA

Hot Topics

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Popular Articles

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’