DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms | BMC Bioinformatics

Our method is > 1000× faster than assembly-based approach while being more accurateWe set out to build a quick-and-dirty differential gene expression analysis pipeline by avoiding transcriptome assembly and replacing a slower alignment step with quasi-mapping. As can be seen from the D. grimshawi and D. ananassae plots in Fig. 4, even when replacing sensitive full alignment step with the quasi-mapping step, our pipeline for identifying differentially expressed genes outperforms assembly-based approach. This gain comes with dramatic speed-up. Transcriptome assembly of a dataset containing a total of roughly 120 million pairs of reads, with 4 cores and 8 threads engaged took more than 7 days to complete. Our mapping takes less than 10 minutes for handling the same data, making it more than 100 times faster.When compared to other aligners/mappers, our method provides a trade-off between speed and sensitivityAs seen in the run-time plot of Fig. 3, our method runs the fastest among all the tools – being > 2.5× faster than the next fastest Kaiju, > 4× faster than DIAMOND, and > 100× faster than LAST. This might not be surprising as DIAMOND and LAST compute alignments using seed-and-extend approach, and LAST additionally computes appropriate alignment scoring scheme as well as alignment column probabilities. On the other hand, our method and Kaiju rely on finding exact matches. However, our speed-up comes at the cost of lesser mapping accuracy. As can be seen in Fig. 4, for the case of using the proteome of close relatives D. ananassae or D. grimshawi as reference, our method is overall less sensitive and precise than LAST and DIAMOND, while performing roughly similar as Kaiju. When using the proteome of the distant relative A. gambiae, the performance of our method worsens to the greatest degree compared to other aligners/mappers. The results of Fig. 4 essentially capture the differences in mapping performance shown in Fig. 2. There is a stark performance gap compared to LAST and DIAMOND, with our method correctly mapping 10–15% fewer reads than LAST or DIAMOND. Compared to Kaiju, the difference in mapping performance is not too apparent for close relatives, but Kaiju performs better when the reference becomes more distant.We note that while we chose various reference proteomes to demonstrate the effect of varying levels of evolutionary divergence, a part of the differences we see in Figs. 2 and  4 might be attributed to the differences in the assembly and annotation pipeline employed to generate those reference sequences.Reduced alphabetWe implemented quasi-mapping on the reduced amino-acid alphabet proposed by DIAMOND: {K,R,E,D,Q,N}, {C}, {G}, {H}, {I,L,V}, {M}, {F}, {Y}, {W}, {P}, {S,T,A}, where characters in the same set are treated to be equivalent. The results are shown in Fig. 5. We observed that for the reduced alphabet, coverage threshold value of 40 – corresponding to rightmost points in each curve – result in a substantial increase in incorrect mappings compared to the non-reduced alphabet. At the coverage threshold of 50, \(k=11\) for reduced alphabet is close to the performance of \(k=7\) for non-reduced alphabet, while being almost 2.5× faster.Fig. 5Comparing the performance of our method on the full amino acid alphabet versus a reduced one of size 11. For each curve, the three points from left to right correspond to coverage thresholds of 60, 50, and 40, respectivelyEffect of reference proteomeThe availability of an accurate reference protein database is key to the performance of all methods evaluated in this paper, but more so for our method which relies on exact k-mer matches. This is seen, perhaps unsurprisingly, in our evaluations, where our method seems to be most sensitive to evolutionary divergence. Another factor that could affect performance is the presence of large number of highly similar sequences in the reference. This could be as a result of high level of recent gene/genome duplications, high amount of repetitive sequences in transcripts due to transposable elements, or because the reference spans multiple species and contains orthologs (e.g. UniProtKB). For all methods, it would be interesting to investigate the extent of performance degradation due to these factors at the level of mapping and in downstream functional analysis. As a practical remedy for the case of multi-species reference, it might be better to first reduce redundancy by running sequence clustering tools (e.g. CD-HIT [26] or MMSeq [27]) or find orthogroups (e.g. using OrthoFinder [28]).Future directionsIt would be interesting to reduce the gap in sensitivity compared to traditional seed-and-extend methods without raising computational cost. Some promising direction include using spaced k-mers [29], syncmers [30], and minimally overlapping words [31]. Using spaced k-mers, for example, has been shown to be more sensitive than contiguous ones in other alignment-free sequence comparison applications [32], and are in fact also implemented in DIAMOND and LAST.We also note that we chose the augmented suffix array data structure with memory usage not in mind, especially since our method uses negligible memory compared to de-novo transcriptome assembly. It might be interesting to explore other compact text indexes for cases where memory requirement is a concern.

Hot Topics

Related Articles