METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data | BMC Bioinformatics

Simulated dataA key to the success of our approach lies in the ability to collect the best possible set of metagenome reads for each of the seed sequences. On the one hand we need to collect enough read pairs to be able to assemble and extend the seed into a near full-length 16S, but on the other hand, collecting too many incorrect reads, i.e., reads that do not truly come from the 16S of the seed, will turn the assembled sequence into a mosaic of little value. To investigate the potential pitfalls here, we started with simulated data. This allows us to annotate every single read with the exact genome from which it originated, facilitating us in assessing how many correct and incorrect reads were collected for each seed, given necessary filtering options.Figure 2 depicts the variation in the number of reads collected correctly and incorrectly with respect to the minimum alignment identity. Note that since in the simulated data, all reads are tagged by which genome it originated, making it possible to see if a read is ‘correct’ or ‘incorrect’ with respect to the seed it matches. Apparently, more incorrect reads align with the seeds than with correct reads, and when the identity surpasses 98%, there is a decrease in the number of correct reads. Thus, a 98% identity threshold was used below unless otherwise stated. We also notice the number of reads that aligned to the seeds for low -diversity samples was greater than that for high-diversity samples.Fig. 2The effect of sequence identity on the collection of metagenome reads matching the seeds. Different minimum identity thresholds (percent) were tested (x-axis) and the numbers of correctly and incorrectly collected reads were counted for all seeds in all 40 simulated samples. All metagenome samples contained 10 million read pairs. The green and red boxplots at each minimum identity indicate the spread between the samples. The left panel shows 20 high-diversity samples, while the right panel shows for 20 low-diversity samplesWe found that many reads matched two or more seeds equally well, a potential explanation for the results shown in Fig. 2. We then tried to collect only reads who aligned uniquely to one seed only. In the left panel of Fig. 3, we show the effect of employing this uniqueness criterion. A comparison of the boxplots for no such filtering (None) to requiring uniqueness (Unique) revealed a dramatic decrease in the number of incorrectly collected reads (red boxplots). However, we also observed a substantial decrease in correctly collected reads (green boxplots), indicating that this criterion is perhaps a bit too strict. Then we explored the relaxation of the uniqueness criterion by using the seed abundance information from the 16S rRNA gene data. If a read matches two or more seeds equally well, we always assign the read to the seed with the highest abundance in the sample. The logic is simply that the read is more likely to come from a more abundant seed, given that it matches several equally well. In the right panel of Fig. 3, we highlight how the number of collected reads increases by allowing reads to match up to 2, 3, …,10 different seeds. Allowing up to three matches, increases the number of correctly collected reads without significantly increasing the number of incorrect ones.Fig. 3The number of correctly and incorrectly collected reads under various filtering regimes. In the left panel (A) we show the effect of collecting all reads (filtering None) versus only collecting reads who give a unique match against one single seed (filtering Unique). Again, we split the samples by diversity. In the right panel (B) we relax the uniqueness by allowing reads to match from 1 up to 10 different seeds, but then assign the read to the seed with the largest abundance, based on the amplicon data. The result for 1 match in the right panel is identical to the uniqueness criterion in the left panelAfter setting the criterion where we allow up to 3 matches, and assigning reads by seed-abundance, we collected the read pairs and assembled them for each seed and sample. By comparing the resulting reconstructed sequence to the 16S in the genomes from which we simulated, we found that by far most of the reconstructed gave an excellent match (> 99% identity, alignment covering > 99% of the sequence). In Fig. 4 we instead show how the number of collected read pairs affect the reconstruction of the 16S. It is obvious that we need more than 10 read pairs, and in most cases more then 40, to be able to reconstruct a 16S with some substantial length. As expected, the more (correct) reads we collect, the longer the reconstructed sequence.Fig. 4The number of read pairs required to assemble the 16S. Each dot is an assembled 16S contig from one of the samples of simulated data. Along the x-axis is the number of read pairs collected and the y-axis is the length of the resulting assembled contig. Contigs of length 0 indicate the assembly failed, and we observe this is the case when the number of read pairs decreases to less than approximately 40. Whenever a contig is assembled, its length tends to grow by the number of read pairsMetagenomic dataset of seafloor sedimentsTo demonstrate the 16S reconstruction on real data we started out with 16S amplicon sequencing (Illumina MiSeq) of 286 samples taken from seafloor sediments, an example of an environment with rather large diversity and with many ‘unknown’ 16S variants. Our 16S pipeline produced a total of 70,578 OTUs (99% identity clustering) from these data. A subset of 24 samples were then selected based on having amplicons not matching (> 99% identity) known 16S in the SILVA database [13]. These 24 samples cover a total of 9296 of the OTUs. The eDNA from these samples were then subject to full metagenome sequencing (Illumina Novaseq 6000). From this we selected the top 300 most abundant OTUs as our seed sequences, from which we want to reconstruct as much as possible of the full 16S. In real life, only the seed sequences not matching anything in the public databases would be of interest, but in this case, we included all the top 300 seeds regardless of this. This is because, it gives us preliminary indications about whether we reconstructed something comparable to what the full-length sequence in the public database suggests.Reconstructing 16S rRNA gene from amplicon and shotgun data using METASEED pipelineWe employed the strictness criteria learnt from the simulated data results above, i.e. we collect metagenome reads who align with at least 98% identity to some seed, and allow alignment against up to 3 different seeds, but then assign the read to the seed with largest abundance in the sample in question. In Fig. 5, we show the log of the number of read pairs collected, and the length of the reconstructed 16S for each seed. As we observed for the simulated data, the length of the reconstructed 16S sequence tended to increase as we collected more reads, with substantial variation. The METASEED resulted in 282 of the 300 seeds having a reconstructed sequence. Of these, 185 had a minimum length of 800 bp. We use this as a minimum length of interest, since this means it covers > 50% of the full length. Also, the seeds are themselves already 400–450 bases long. The chimera check showed that none of the reconstructed 16S were chimeric.Fig. 5The figure shows the log number of metagenome read pairs (x-axis) collected for each seed (dots) along with length of the reconstructed 16S for each seed (y-axis). The black dots represent reconstructed 16S with a minimum length of 800 bp. Dots at zero length indicates seeds where no reconstruction was possibleSince we did not filter the seeds for already known 16S sequences, some of them matched full-length 16S sequences in the SILVA database. BLASTing the seeds behind the 185 reconstructed 16S against the SILVA database revealed a near-exact match (> 99% identity) for 125 of them. For a sanity check, these 125 seeds were also blasted against the corresponding reconstructed 16S sequences, to verify the reconstructed 16S sequences indeed had the best match to their corresponding original seeds. A second blast analysis using the corresponding reconstructed 16S sequences from these 125 seeds against the SILVA database, revealed that these also had the best match to the same sequences, or to a similar SILVA sequence of the same species, where the seeds matched. Figure 6 shows the BLAST identities for these 125 cases. The green dots indicate the seed, and the red dots represent the corresponding reconstructed 16S. The position along the x-axis for each green–red pair indicated the length of the reconstructed sequence. The figure demonstrates how well the seeds, and its reconstructed sequences matched the same SILVA sequence (colored dots) in terms of sequence identity. In addition, the trend line indicates there is little loss in identity for longer reconstructed sequences. We also tried to vary the uniqueness criterion when collecting reads, and the results of this are shown in supplementary Table S1.Fig. 6In total 125 seeds achieved a reconstructed sequence of at least 800 bases length and at the same time had a match (> 99%) against at least one full-length SILVA 16S sequence. The 125 green dots indicate the identity of this match (y-axis). The red dots indicate how well the corresponding reconstructed sequence matched the corresponding SILVA sequence. The location along the x-axis marks the length of the reconstructed sequence for each seed/reconstructed sequence pair. Additionally, the trend line indicates a very weak drop in identity for longer reconstructed sequencesAlternative ways to obtain 16S rRNA gene from shotgun dataOnce we have the metagenome data for the 24 samples, it is of course natural to assemble these and see if we find many 16S sequences in the resulting contigs. The assembly pipeline resulted in a large number of contigs for each sample, in total 26,133,575 contigs. We used the Barrnap software to search all contigs for 16S. This resulted in the detection of 195 bacterial sequences and 186 archaeal sequences. All these 381 sequences were non chimeric and above 800 bases in length. However, since we must expect several of these being 16S variants from the same species, the Barrnap sequences were clustered using VSEARCH with 99% identity. This resulted in 150 distinct 16S sequences. To see to what extent these are already known 16S sequences, we did a BLAST analysis of the Barrnap sequences against the SILVA database. From this we found that 128 matched something (> 99% identity). We then compared the Barrnap sequences to the OTUs from our amplicon data and found only 51 matches (> 99% identity). We then made a similar study based on our simulated data sets, since here we know exactly how the full-length 16S sequences look like. The simulated shotgun data were assembled in the same way as the real data, and Barrnap was used to collect 16S from the contigs. Of the 253 distinct 16S sequences found this way, only 1 gave a > 99% match to some of the actual 16S sequences, indicating very clearly that a short-read metagenome assembly of 16S easily becomes a mosaic of several sequences.The de-novo assembly using the MATAM tool was performed on the 24-shotgun metagenome samples. In MATAM a contig of at least 500 bp is considered as a reconstructed 16S. The MATAM tool found 14 non-chimeric 16S sequences of this minimum length, the longest of which was 649 bases long.Comparing METASEED to existing toolsWe compared METASEED to EMIRGE and MATAM using the simulated data mentioned above. In total EMIRGE reconstructed 1589 sequences, MATAM 992 and METASEED 1631 from all samples. Comparing these to the actual 16S behind the data we required > 99% identity for count it as a successful reconstruction. This resulted in discarding 471 of the EMIRGE, 160 for MATAM and 97 from METASEED. In the upper panels of Fig. 7 we display the number of successfully re-constructed for the various methods in each sample. The lower panels show the length distribution of the successfully re-constructed sequences.Fig. 7Displays the count and average length of reconstructed 16S sequences obtained from three methods, each with at least 99% identity to the actual 16S sequences across various samples. The upper panel illustrates sample numbers on the X-axis and the corresponding count of reconstructions on the Y-axis. The lower panel represents the methods utilized on the X-axis and the mean length of the reconstructed 16S sequences for each sample on the Y-axis. Different methods are denoted by color in both panels

Hot Topics

Related Articles