A catalog of small proteins from the global microbiome

Collection of global metagenomic datasets and prediction of smORFsIn total, 63,410 publicly available global assembled metagenomes from the SPIRE database21 collection were used. Briefly, the assembled metagenomes have been generated through the following methods: publicly available data (as of 1 January 2020) were downloaded from the European Nucleotide Archive (ENA) and short reads that were at least 60 bps after trimming positions with quality <2565 were assembled into contigs using MEGAHIT 1.2.966. Additionally, we downloaded 87,920 high-quality isolate microbial genomes from the ProGenomes2 database34.We then used the modified version of Prodigal35 in Macrel 0.536 to predict ORFs ≥ 30 base pairs (bps) on the assembled contigs as well as those from Progenomes2 database. This version of Prodigal uses the same algorithm as the standard version of Prodigal, but with a lower limit on the size of genes. We used command line parameters to only predict closed genes, to not predict genes with N as a base, to perform a full motif scan, in metagenomics mode (-c -m -n -p meta). The ORFs encoding small proteins (here defined as those up to 100 amino acids) were considered smORFs.We recorded the habitats of smORFs according to their source samples using the habitat microontology introduced in SPIRE database21. We further grouped the habitats into 8 broad categories: mammal gut, anthropogenic, other-human, other-animal, aquatic, human gut, soil/plant, and other. We used GeoPandas67 to present geographic coordinates of samples.Non-redundant smORFs catalog construction and method validationAll the smORFs were first deduplicated at 100% amino-acid identity and 100% coverage. Then we hierarchically clustered the non-singletons at 90% amino-acid identity and 90% coverage using Linclust55,68 with the following parameters: -c 0.9, –min-seq-id 0.9. Linclust is a single-linkage approach, whereby sequences are clustered together if they share a common representative with candidate representatives being chosen heuristically.Of these clusters, 47.5% contain a single sequence (singleton clusters). To rule out the possibility that this was due to the fact that Linclust55,68 is a heuristic method that is not specifically designed for short sequences, we estimated the rate of false negatives (i.e., sequences that were marked as singleton even though they should have been clustered with another one). We aligned a randomly selected 1000 singleton clusters against the representative sequences of non-singleton clusters (i.e., those containing ≥2 sequences) using SWIPE69 with the following parameters: -a 18 -m ‘8 std qcovs’ -p 1. The alignment threshold was E value < \({10}^{-5}\), identity ≥90%, and coverage ≥90% (Supplementary Fig. 1a).In addition, to estimate the rate of false positive clusterings (sequences that were assigned to a cluster even though they do not share the required identity with the cluster representative), 1000 sequences were randomly selected and aligned against the representative sequences of their clusters using SWIPE69 with the following parameters: -a 18 -m ‘8 std qcovs’ -p 1. The alignment threshold was E value < \({10}^{-5}\), identity ≥90%, and coverage ≥90% (Supplementary Fig. 1b).When clustering, we initially discarded the singletons because singletons are enriched in artifactual smORFs36. However, we considered that singletons that are homologous to larger clusters should not be discarded as the homology itself provides further evidence of biological relevance. Therefore, we aligned singletons to the representative sequences of clusters with 90% sequence identity and 90% coverage using DIAMOND54 using parameters: -e \({10}^{-5}\) –id 90 -b 12 -c 1 –query-cover 90 –subject-cover 90. We combined the homolog singletons and the non-singleton sequences identified earlier and termed them the smORFs catalog containing 964,970,496 smORFs.Sample-based smORFs rarefaction curvesSamples were randomly permuted 24 times to calculate the total number of non-redundant smORFs captured as the number of samples increased. We took the average across the permutations as the final estimate.Quality control of the catalogWe conducted several in silico quality tests and matched genomic predictions to other publicly available experimental data.A smORF predicted at the start of a contig that is not preceded by an in-frame STOP codon risks being a false positive originating from an interrupted fragment. Therefore, we checked for the presence of an upstream in-frame STOP. For smORFs without an upstream in-frame STOP, however, we could not determine whether there were other genes present upstream of them (Supplementary Fig. 3a).To avoid spurious smORFs, we used HMMSearch70 with the –cut_ga option to search smORFs against the AntiFam 7.0 database71, which contains a series of confirmed spurious protein families.We used RNAcode37, a tool to predict the coding potential of sequences based on evolutionary signatures, to identify the coding potential of 25,744,932 smORF families containing ≥8 sequences. The smORF families with P value < 0.05 were considered to have coding potential, as in a previous study33 (Supplementary Fig. 4a).Furthermore, we searched for evidence that these smORFs are transcribed and/or translated. For this step, we downloaded 221 publicly available metatranscriptomic datasets from the NCBI database paired with the metagenomic samples we used in our catalog (Supplementary Data 3). These samples are from the human gut, peat, plant, and symbionts. To keep the procedure computationally feasible, we mapped reads against the representative sequences of smORF families by BWA72. Then we used NGLess65 with ‘unique_only’ for the ‘multiple’ argument of the count built-in function to only count uniquely mapped inserts. A smORF family was considered to have transcriptional evidence if its representative has reads mapped to it in at least 2 samples (Supplementary Fig. 4b). Furthermore, we mapped reads against the smORFs in paired metagenomic and metatranscriptome samples, separately. On average, 58.6% of the smORFs in each paired sample are mapped.We downloaded 142 publicly available Ribo-Seq datasets from the NCBI database (Supplementary Data 3). We also mapped reads against representative sequences of smORF families by BWA72. Then we used NGLess65 with ‘unique_only’ for the ‘multiple’ argument of the count built-in function to only count uniquely mapped inserts. A smORF family was considered to have translation evidence only if its representative has reads mapped to it in at least 2 samples (Supplementary Fig. 4c).Moreover, we downloaded peptide datasets from 108 metaproteomic projects from the PRIDE database73 (Supplementary Data 3). We matched GMSC smORFs to the identified peptides of each project. If the total k-mer coverage of peptides on a smORF is greater than 50%, then the smORF is considered translated and detected, as in a previous study74 (Supplementary Fig. 4d).Sequences that passed all in silico tests above as well as matching transcriptional or translational data were regarded as high-quality predictions.Comparison with reference small protein datasetsWe downloaded bacterial and archaeal protein sequences from RefSeq in March 202338, consensus sequences of NMPFamsDB61 and sequences for each FESNov gene family28. The sequences with fewer than 100 amino acids are considered small proteins, and redundancy was subsequently removed with 100% amino-acid identity and 100% coverage. A total of 16,333,323 bacterial small proteins, 368,769 archaeal small proteins from RefSeq, 56,786 small proteins from NMPFamsDB, and 630,375 small proteins from FESNov families were included in the comparison. We also included the 444,053 small protein clusters and 4539 conserved small protein families from Sberro’s human microbiome study33. We compared our smORFs catalog to these datasets using DIAMOND with the ‘–more-sensitive’ mode, retaining significant hits (E value < \({10}^{-5}\)). In addition, we compared our smORFs catalog with small protein sequences provided in current small protein database mainly about eukaryotic organisms. We downloaded small proteins from human, mouse, yeast, rat, E. coli, C. elegans, fruitfly, zebrafish, and small proteins from LiteratureMining, KnownDatabase, and MSfragments from SmProt2 database62; small proteins from human, mouse, rat, zebrafish, fruitfly, C. elegans of sORF.org database64; and all predicted refprots, altprots, and isoforms sequences with all annotations from human, chimp, rat, mouse, zebrafish, fruitfly, C. elegans, and yeast from OpenProt2.0 database63. After filtering small proteins by length and removing redundancy as above, 788,586 small proteins from SmProt2 database, 4,377,422 small proteins from sORF.org database, and 1,781,907 small proteins from OpenProt2.0 database were included in the comparison. As above, we compared our smORFs catalog to these datasets using DIAMOND with the ‘–more-sensitive’ mode, retaining significant hits (E value < \({10}^{-5}\)).Conserved domain annotationWe downloaded the CDD39 from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/Cdd_LE.tar.gz in September 2022, which contains models from CD curated at NCBI, Pfam45, SMART75, COGs48, PRK76, and TIGRFAMs77. All the representative sequences of small protein families were searched against the CDD by RPS-BLAST40,41. In order to establish a comparison baseline, we additionally randomly selected 10,000 prokaryotic proteins from the global microbial gene catalog v1.020 and searched them against the CDD by RPS-BLAST40,41. Hits with an E-value maximum of 0.01 and at least 80% of coverage of PSSM’s length were retained and considered significant. Pfam accessions were grouped by Pfam clan78 or the first phrase before the comma in their short description.Taxonomic annotation and taxonomic breadth analysisThe taxonomy of assembled contigs encoding the small proteins was annotated using MMseqs2 taxonomy42 against the GTDB database43 release r95. However, in figures and text, we used updated taxon names (e.g., Bacillota instead of Firmicutes). We characterized the taxonomy of predicted smORFs based on the taxonomy of contigs and microbial genomes34 from which the smORFs were predicted. We subsequently assigned taxonomy for GMSC smORFs and families using the lowest common ancestor, ignoring the un-assigned ranks to make them more specific.The small protein families with at least three members were subsequently used to perform taxonomic breadth analysis. Each family was classified according to (i) whether it is single or multi-habitat; (ii) whether it is single or multi-genus; and (iii) whether it is annotated with a Pfam domain45. Multi-genus families are more common in multiple habitats than the entire families (61.8% vs. 52.0%; P value < \({10}^{-308}\), hypergeometric test). Multi-genus families are annotated with Pfam domains at a higher rate (9.91% vs 7.52%; P value < \({10}^{-308}\), hypergeometric test). As these results could have been confounded by differences in family size distributions, we randomly downsampled the data to keep the same number of families at each size between multi-genus families and the whole families. In that case (as presented in the main text), the difference was 61.8% vs. 57.5% (P value < \({10}^{-308}\), hypergeometric test) for the proportion of families in multiple habitats and 9.91% vs. 8.15% (P value < \({10}^{-308}\), hypergeometric test) for the proportion of Pfam annotated families.Density calculationThe density of smORFs was defined as \(\rho\) = \({n}_{{smORFs}}\) / \(L\), where \({n}_{{smORFs}}\) is the number of redundant smORFs and L is the assembled megabase pairs (Mbps)32,46. The density was calculated by summing all assembled base pairs for contigs assigned to each taxonomic rank. We assume a scenario where the starting positions of smORFs in an assembled large contig are independent and uniformly random. Therefore, the standard sample proportion error was calculated as \({ST}{D}_{{err}}=\sqrt{\frac{\rho*\left(1-\rho \right)}{L}}\) and was used to calculate the margin of error at a 95% confidence interval (Z = 1.96). We did not further consider the calculated values with a margin of error >10%.Cellular localization predictionTo detect potential transmembrane proteins, we ran TMHMM-2.049 on the representative sequences of small protein families. Then, to identify potentially secreted small proteins, we used SignalP-5.050 on the representative sequences of small protein families. For families classified as archaea, we used ‘-org arch’, while for the others we combined the outputs of ‘-org gram + ’ and ‘-org gram-’ modes.Construction and evaluation of GMSC-mapperGMSC-mapper supports assembled contigs, smORF sequences, or protein sequences as inputs. It uses Pyrodigal35,53, which is a faster implementation of the Prodigal algorithm, to predict ORFs potentially coding for small proteins (those with fewer than 100 amino acids) from contigs. Gene prediction is skipped when inputs are smORF or protein sequences. Then DIAMOND54 or MMseqs255 are used for homologous alignment against GMSC. Finally, it combines all the alignment hits information and provides detailed annotation of small proteins.To determine the optimal default sensitivity mode, we tested different sensitivity parameters for DIAMOND and MMseqs2. We aligned 10,000 randomly selected sequences back to the family representatives and counted the number of recovered sequences while monitoring the computational time. We use the “–sensitive” mode as the default sensitivity parameter for DIAMOND, which provides the best balance between sensitivity and speed. The use of more-sensitive modes resulted in little or almost no increase in the number of recovered sequences, but a substantial increase in time usage. For MMseqs2, we keep the original default sensitivity parameter (5.7) considering that the number of recovered sequences and the time both increase with the increase of sensitivity (Supplementary Fig. 6a-d).We then tested the time costs among different numbers of input sequences using the “–sensitive” mode of DIAMOND and the default sensitivity parameter (5.7) of MMseqs2. GMSC-mapper can annotate 100,000 input sequences in approximately 20 minutes with 20 threads.Furthermore, we compared the number of recovered sequences with different identities using different alignment tools. We randomly selected and mutated 10,000 sequences of different lengths (20, 30, 40, 60, and 80) from the family representatives, with different identities from 10% to 90%. We aligned them back to the family representatives using DIAMOND, MMseqs2, and BLAST56, respectively. When the query sequence and the target sequence are the same, we consider them as the recovered sequences.Timing measurements were performed using a server equipped with an AMD EPYC 7763 64-Core processor and 2TB of RAM memory.Statistics and reproducibilityStatistical analyses were carried out in Python 3.8.5, using Pandas79 1.1.3, NumPy80 1.24.4, and SciPy81 1.10.1. No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.GMSC web resourceGMSC webserver is hosted at the address https://gmsc.big-data-biology.org, where an implementation of GMSC-mapper can be accessed. The website implementation is based on Elm-Lang. The API implementation is based on Python.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles