SmithHunter: a workflow for the identification of candidate smithRNAs and their targets | BMC Bioinformatics

Fourteen candidate smithRNAs were identified in the Manila clam [19]. Subsequently, the activity of two of these genes was validated in vivo by [31], thus establishing this species as the reference organism for smithRNA studies. Given the absence of alternative software for smithRNA detection, evaluating the reproducibility of the results obtained by [19] for the Manila clam, with special attention given to the two validated smithRNAs, appears to be the most effective way to assess and present the functionalities of the SmithHunter pipeline.We reproduced the analysis performed by [19] using the same data, with minimal modifications due to specific SmithHunter functionalities and parameters. Specifically, raw reads of the small transcriptome of R. philippinarum were downloaded from the NCBI Short Read Archive (SRA; accession numbers SRR3662624-SRR3662629), while male and female mitochondrial genomes, as well as the nuclear genome of the species, were recovered from GenBank (accession numbers AB065375.1, AB065374.1, and GCA_026571515.2, respectively). The original, unannotated transcriptome used in [19] was downloaded from GenBank (accession numbers JO101212-JO124029; [56]). In the absence of the original UTR annotations, updated annotations, produced by the same authors, were retrieved from [57]. The SmithHunter analysis was performed, independently and in parallel, on the male and female small transcriptome and mitochondrial genomes (see Commands in Supplementary Materials). Note that the species name Venerupis philippinarum, used in some records, is a synonym of Ruditapes philippinarum (WORMS database; [58]).Additional testing was performed on data from P. streckersoni, focussing on a male-related smithRNA that, while not biologically validated in strict terms, received substantial support as a masculinizing agent and regulator of the nuclear transcript for GCNT1 [37]. Raw reads of the small transcriptome were obtained from SRA (accession numbers SRR23195578-SRR23195582, SRR23195559), while the male and female mitochondrial genomes, as well as the nuclear genome, were recovered from GenBank (accession numbers ON881148, MW413895, and JAEAOA01, respectively). Transcriptome UTR sequences were in turn retrieved from Supplementary Materials to [37] (GitHub: https://github.com/raqmejtru/mitonuclear-sd). The SmithHunter analysis was performed, as above, independently and in parallel on the male and female small transcriptome and mitochondrial genomes, under parameters designed to replicate [37] as well under a more stringent parameter set stemming from our parameter optimization (see below).Trimming and remappingA total of approximately 69 and 72 million reads belonging to male and female individuals of R. philippinarum, respectively, (21 to 27 million in individual replicates) were analyzed using the procedure implemented in the first module of SmithHunter. Reads were trimmed in SE-mode (option −T SE) and aligned to the mitochondrial genomes of both sexes, as well as to the nuclear genome of the species. There were 207,149 and 276,748 reads mapped on the mitochondrial genome for the male and female, respectively, representing 0.31% and 0.39% of the trimmed reads. Among these, 187,735 and 259,719, representing 0.28% and 0.36%, respectively, of the trimmed reads, were identified as mitochondrial-unique reads: i.e., they did not align to the nuclear genome (Table S1, Fig. 2).Fig. 2Remapping of the small transcriptome over the male and female mitochondrial genomes. Coverage along the mitochondrial genome (all replicates, combined) is shown based on mitochondrial reads (red) and mitochondrial-unique reads (blue). The black horizontal line represents the T1 coverage threshold (−S 0.50). Image from the standard SmithHunter outputThe addition of a remapping step to the nuclear genome, which was not available to [19], confirmed what was hypothesized herein, i.e., that the source of the vast majority of mitochondrion-remapping reads is the mitochondrial genome and not the nuclear genome. In fact, only 9.37% of the mitochondrial reads mapping to the male mitochondrial genome, and 6.15% of reads mapping to the female genome, map to the nuclear genome as well. Despite suggestive evidence in [59], these reads are cautionary considered to be of uncertain origin in the SmithHunter pipeline, possibly originating from nuclear mitochondrial pseudogenes (Numts; [45]), and are discarded.For each of the six sequencing libraries and for the two sexes, the number of raw reads, the number/percentage of reads after trimming, the number/percentage of trimmed reads mapping to the mitochondrial genome and the number/percentage of trimmed reads uniquely mapping to the mitochondrial genome (i.e., not mapping to the nuclear genome), is reported in Table S1.ClusteringMitochondrial-unique reads were clustered, and each cluster was filtered for size using the procedure implemented in the first module of SmithHunter. In line with [19], clustering was performed at 99% identity (option −I 0.99). The stringency was set as the 50th percentile of unique cluster depth (option −S 0.50), which corresponds to 125 and 116 reads for males and females, respectively. The minimum number of replicates was not enforced (option −M 1). A total of 89 and 97 clusters were observed that passed all the thresholds for the male and female genomes, respectively. All the clusters appear to be on the plus strand, the same strand from which all the mitochondrial genes are transcribed in the species (Fig. 3; [60]).Fig. 3Distribution and depth of clusters along the male and female mitochondrial genomes. Grey peaks in the background represent total coverage. The horizontal black line represents the T1 coverage threshold (−S 0.50). All clusters are observed on the plus strand and are shown in red (clusters on the minus strand would be shown in blue). To improve readability, only clusters giving rise to candidate smithRNAs are numbered. Image from the standard SmithHunter outputAmong these clusters, 42 and 32 exhibited conserved 5′ transcription ends in the male and female genomes, respectively, and their centroid sequences were considered presumptive smithRNAs. All 14 smithRNAs selected by [19] (hereafter referred to as reference smithRNAs) were found to be presumptive smithRNAs in our analyses, and their coverage at the 5′ and 3′ transcription ends was almost identical to that previously reported (Fig. 4).Fig. 4Transcription end conservation of the 14 reference smithRNAs. Names of smithRNAs are shown following the nomenclature of SmithHunter as well as following [19]. Red and blue bars represent the distribution of unique start/ends of reads mapping on the mitochondrial genomes, respectively. The black dotted line represents overall, per base, coverage. Genome positions on the horizontal axis refer to sequences AB065375.1 (male mitochondrial genome) and AB065374.1 (female). Image from the standard SmithHunter outputTarget predictionPresumptive smithRNAs were searched against UTR regions of the R. philippinarum transcriptome using the procedure implemented in the second module of SmithHunter. A total of 39 and 30 presumptive smithRNAs from the male and female genomes, respectively, were hypothesized to target at least one nuclear transcript and are here referred to as candidate smithRNAs. Of the total of 69 candidates, 12 were reference smithRNAs (see [19]), and 10 of these were associated with the same 13 nuclear target hypothesized herein (Table 1). Most importantly, the two reference smithRNAs that were validated in vivo by [31] (M_smithRNA106t and M_smithRNA145t) passed through all the filters and were associated with the previously documented targets. In particular, M-smithRNA106t was hypothesized to target the Manila clam homolog of the human Histone-lysine N-methyltransferase and M-smith145t was hypothesized to target the Manila clam homolog of the human polymerase epsilon (Table 1). Incidentally, 11 out of 24 smithRNA-target pairs reported by [19] were not identified in the current analyses. These sequences were considered individually, and the smithRNA sequence was searched over the UTR regions of the R. philippinarum transcriptome using BLAST. A total of 10 out of the 11 smithRNAs actually matched the previously reported target in the plus-plus orientation, suggesting a problem with a directionality filter in the original analysis; these were subsequently excluded from the analysis.Table 1 Reference smithRNAs-target pairs from [19] recovered using SmithHunterIn its basic implementation, SmithHunter was therefore shown to largely duplicate the results of [19]. In the following section we aim at additional testing and the deployment of additional features implemented in SmithHunter.Parameter optimizationThe effects of user choice on selectivity related parameters, a new implementation of SmithHunter that was not available to [19] and [31], were evaluated by repeating the analysis under 27 different parameter combinations as well as allowing 0 to 2 mismatches in the seed region (see below).All combinations of cluster identity (option −I; tested: 0.99, 0.95, 0.90), stringency (option −S; tested: 0.2, 0.5, 0.8) and minimum number of replicates (option −M; tested: 1, 2, 3) were studied on both the male and female data. Predictably, the number of presumptive smithRNAs identified by the first module was directly correlated with the identity (I) parameter and inversely correlated with the replicates (M) and stringency (S) parameters. By adopting the less selective parameter combination (−I 0.99 −S 0.2 −M 1) a total of 71 and 65 presumptive smithRNAs were identified for males and females, respectively. Out of this set, 67 candidate smithRNAs, targeting 320 nuclear genes, were identified for males and 59 candidate smithRNAs, targeting 323 nuclear targets for females (Table S2, Fig. 5). In contrast, by using the most selective parameter combination (−I 0.90 −S 0.8 −M 3), eight and six presumptive smithRNAs were found in male and female individuals, respectively. Six and five of these were identified as candidate smithRNAs and were associated with 25 and 16 nuclear targets, respectively, in males and females (Table S2, Fig. 5).Fig. 5Number of candidate smithRNAs and of their associated nuclear targets recovered under different parameter combinations. Parameter combinations are indicated on the horizontal axis as follows: I[identity parameter]_S[stringency parameter]_M[replicates parameter]Taking the number of presumptive smithRNAs produced at the end of the first module as a proxy for selectivity, it was possible to visualize the effects of different parameters. Stringency (S) had the most substantial effect, leading to the filtering, in the range examined, of 74.5% of clusters. The number of replicates (M) had a more limited effect, with the filtering of 48.9% of clusters. The cluster identity parameter (I), on the other hand, displayed a minimal effect, at least on the Manila clam data, leading to the filtering of 11.5% of clusters (Figure S1).Notably, reference smithRNAs from [19] were often identified even under the most selective criteria. Eight out of twelve reference smithRNAs were identified under all combinations tested, while the remaining four were identified under a minimum of 21 parameter combinations (Fig. 6). Moreover, candidate smithRNAs identified under the most selective criteria were also identified under all the tested combinations. This finding suggested that the most selective parameter combinations were optimal starting points and that the resulting candidate smithRNAs, in turn, constitute a reasonably reliable group and starting point for subsequent validation analyses.Fig. 6Candidate smithRNAs recovered under different parameter combinations for the male and female genomes. Rows represent all candidate smithRNAs recovered under all parameter combinations. Reference smithRNAs are indicated by an asterisk. Columns represent parameter combinations and are labelled as follows: I[identity parameter]_S[stringency parameter]_M[replicates parameter]. White color in cross cells means that the candidate smithRNA was recovered under the specific parameter combination, black color means that it was not recoveredLoosening of the selectivity criteria, in turn, leads to an expansion of the results from the highly confident core. Given the possibility of customizing selectivity parameters, users have the flexibility to balance the number of resulting candidates with their level of confidence by acting on the parameters. However, the level of selectivity should be consciously evaluated case by case, depending on the biological system under scrutiny, the data, the number of biological replicates and, in turn, the intended use of the results.In the end, smithRNAs and their associated nuclear targets identified under the most selective parameter combination (−I 0.90 −S 0.80 −M3) are reported in Table 2 and Table S3.Table 2 SmithRNAs and their targets identified under the most stringent parameter combination in males and femalesTheir length distribution, compared with those of the total reads mapped to mitochondrial genomes, is shown in Figure S2. The length distribution compares well with the results of [19] (compare to Fig. 1 therein) over the 22–35 bp range. This finding further suggested that, unlike miRNAs, smithRNAs may exhibit broad variation in length, with substantial peaks in the 20–34 bp range. Noteworthy, this may have a relevance in the context of smithRNA maturation and AGO2 binding, as proposed by [61].At least six candidate smithRNAs obtained under the most selective parameter combination were predicted to form bona fide pre-miRNA-like harpins (Fig. 7). Their free energy is generally lower than that presented in the original study and their shape, in most cases, better conforms to the expectation of a long hairpin. This, in turn, supports the tentative identification of pre-smithRNAs based on position rather than on the span of tRNA and unassigned region annotations in the mitochondrial genome.Fig. 7Putative secondary structure of pre-smithRNAs for smithRNAs obtained under the most selective parameter combination. Structure labels report the name of the gene, dG of the folded structure, sex and, if available, the corresponding name in [19]. SmithRNA sequences are highlighted in red. Image modified from the standard SmithHunter outputIn order to evaluate the effects of allowing non-perfect alignments in the seed region in the target identification step, target identification was repeated under all parameter combinations allowing no mismatch in the seed region (as above) as well as allowing 1 or 2 mismatches (option −m).The number of candidate smithRNAs did not increase significantly when mismatches were allowed in the seed region. With one mismatch allowed, up to two additional candidate smithRNAs were identified under the less selective parameter combinations, and no additional candidate was identified by allowing two mismatches. Conversely, the number of targets identified for each candidate smithRNA increased significantly if nonperfect alignments in the seed region were considered (Fig. 8). Compared to the case in which no mismatch was allowed, allowing for one mismatch led to the identification of almost twice as many targets (average 1.76 × in male data and 2.03 × in female data across different parameter combinations). Allowing two mismatches, in turn, did not lead to a further increase in the number of identified targets (average 1.01 × and 1.05 ×, respectively; Fig. 8). According to this evaluation, allowing mismatches in the seed region appears to have a marginal effect on candidate smithRNA identification. On the other hand, in the context of target identification, allowing mismatches in the seed region leads to a—possibly unwarranted—increase in targets that are less, or marginally, supported. As such, our advice is not to use this option in standard applications of SmithHunter, i.e. where the purpose is to identify a restricted number of high confidence candidate smithRNAs and their targets. This option, in turn, may be used in specific contexts, such as the evaluation of imperfect alignments in identifying one specific target that has previously been validated based on external evidence, or a study set of known targets for parameter optimization. Moreover, given that no difference was observed between one or two mismatches allowed, we suggest using one mismatch, if any, to reduce the run time of the second module.Fig. 8Number of targets identified as a function of the number of mismatches allowed in the seed region. Parameter combinations, on the horizontal axis, are labelled as follows: I[identity parameter]_S[stringency parameter]_M[replicates parameter]Additional testing on P. streckersoni
Additional testing in P. streckersoni entailed the use of two different sets of analytical parameters and was conducted on both male and female data, focusing on the retrieval of smithRNA M-9, that received substantial support in [37]. The first parameter combination was designed to be grossly similar, in terms of stringency, and acknowledged the differences among the SmithHunter procedure and the procedure applied in [37], to the analysis performed in [37]. Stringency parameters were applied as: −I 0.99 −S 0.5 −M 1, the nuclear filter was not applied, and end conservation was evaluated following the guidelines in [37], that appear to be more relaxed than in our protocol. Compared to our testing regime, this corresponds to a medium/low stringency. A total of 28 out of the 33 smithRNAs identified in [37], namely 8/9 in males and 20/24 in females, were recovered by SmithHunter as candidate smithRNAs.The second parameter combination applied (−I 0.90 −S 0.8 M 2) corresponds the most stringent parameter set in our testing regime. Here, and at variance with [37], the nuclear filter was also applied to exclude small RNAs of uncertain (nuclear or mitochondrial) origin, and end conservation was evaluated following the guidelines in [19]. A total of 10 out of the 33 smithRNAs identified in [37], 5/9 in males and 5/24 in females, were recovered as candidate smithRNAs by SmithHunter. Most importantly, the focal smithRNA M-9, that was singled out in [37] as the best candidate based on differential, sex related, expression, as well as its putative role in the sex determining pathway, was recovered by SmithHunter even under this most stringent parameter combination in association with its proposed target (GCNT1) according to [37].Based on a comparison among the two runs, the reduction of the number of candidate smithRNA from the medium/low stringency to high stringency was due to the nuclear filter (5 candidates), coverage thresholds and replicate filter (4) and end conservation (4) or a combination thereof (5).

Hot Topics

Related Articles