Small proteins – big potential, the global microbial small ORFs catalog

There is a bigger unknown world beyond the world we see.
Our research group has established the Global Microbial Gene Catalog (GMGC) to explore the distribution patterns of microbial genes worldwide. Starting from 2020, inspired by Sberro et al.’s research on small proteins of the human microbiome, we turned our attention to microbial small proteins that are often overlooked in previous research. Few people have widely and deeply explored the field of microbial small proteins. So when I first came into this field, the first question I wanted to know was what small proteins are. I found that the definition of small protein varies in different studies, with some studies using 50 amino acids as the size cut-off, while others use 70 or 100 amino acids as the size cut-off. Here, we define small proteins as proteins containing less than 100 amino acids. While some small proteins are peptides produced by the hydrolysis of larger precursor proteins, and here, our research focuses on small proteins directly translated from small open reading frames (smORFs).
Furthermore, I found that small proteins are widely present in all three domains of life. In eukaryotes, the small proteins in model organisms such as Arabidopsis thaliana and Mus musculus have been identified and validated systematically. In prokaryotes, small proteins have also been accidentally discovered in some bacteria, where they can perform important physiological functions, such as regulating gene expression, stabilizing large protein complexes, and participating in signal transduction pathways. In addition, small proteins can also exhibit antibacterial activity or serve as part of toxin/antitoxin (TA) systems.
These have given me great interest in the study of small proteins. However, I found that traditional gene annotation methods and experimental methods have many limitations on small protein research. Specifically, to prevent false positive predictions, gene prediction tools often ignore small proteins. Because of the size of small proteins, they are also hard to enrich through experimental methods, resulting in incomplete small protein databases. Meanwhile, methods based on mass-spectrometry have been challenged due to incomplete small protein databases.
Therefore, we aim to expand the Global Microbial Gene Catalog (GMGC), an integrated, consistently-processed, gene catalog of the microbial world, combining metagenomics and high-quality sequenced isolates. We constructed the Global Microbial smORFs Catalog (GMSC) from 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes from 75 global habitats. In the catalog, we provided comprehensive annotations for smORFs, including taxonomy classification, habitat assignment, quality assessment, and conserved functional domain annotations. Based on this resource, we also developed the GMSC-mapper tool, which can provide detailed smORFs annotations for microbial genomes to facilitate understanding of microbial gene diversity.

Furthermore, we analyzed the ecological distribution pattern of smORFs. We found that archaea harbor more smORFs proportionally than bacteria, and the ratio of archaeal small proteins predicted to be transmembrane or secreted is higher than bacterial small proteins. In addition, we explored the functions of small protein families from multiple habitats and phyla. We found that even conserved small protein families still lack functional domain annotation. The annotated small proteins are mainly related to ribosomal proteins and DNA binding.

In summary, we constructed a global microbial small protein catalog (GMSC) to study the presence, distribution, prevalence, and potential ecological roles of microbial small proteins on a global scale. In this study, we revealed the vast and unexplored diversity of microbial small proteins in different habitats and taxa. In another study, we focused on a class of small proteins with specific functions – antimicrobial peptides (AMPs), and successfully validated the antibacterial activity of some of these AMPs.
In the future, we could answer more questions about the world of small proteins, and there is still great potential for exploration in this field. We could answer a series of basic biological questions, for example, we could understand whether the distribution and roles of small proteins vary with the lifestyle or habitat of microorganisms, and we could explore the similarities and differences in evolution between small proteins and larger proteins. Based on their structural characteristics, as small proteins generally only contain one domain, they are potential molecules for studying protein folding. In addition, some small proteins have transmembrane properties which can provide potential targets for biomedicine applications. 

Hot Topics

Related Articles