Physics-informed machine learning predicts protein function

Understanding protein function is essential for unraveling biological processes, disease mechanisms, and evolutionary pathways at the molecular level. Despite advances in sequencing and computational methods, accurately annotating protein functions, particularly at the residue level, remains challenging. The majority of proteins lack detailed functional annotations, hindering comprehensive insights into their roles in cellular activities. Classical methods for function annotation are limited by sequence complexity, prompting the development of computational approaches, including deep learning, which excel in predicting protein structures but struggle with function prediction. In a recent publication Nature Communications, we developed a novel physics-informed learning approach, leverages evolutionary data through graph convolutional networks to enhance the precision of function annotation at the residue level. We showed that, by capturing coevolutionary relationships between residues, PhiGnet not only identifies functional sites within proteins but also quantifies the significance of individual residues in specific biological functions.
The Problem
Proteins are the workhorses of biological systems, playing indispensable roles in virtually every cellular process, from catalyzing reactions to transmitting signals and providing structural support. Understanding their functions is crucial for deciphering fundamental biological mechanisms, addressing diseases, and engineering novel therapeutics. A protein’s amino acid sequence contains the necessary information for its three-dimensional structure and governs how it interacts with other molecules, thereby enabling it to carry out its specific functions within cells. Despite the monumental efforts in genome sequencing that have yielded an immense database of protein sequences, functional annotation remains a significant challenge. As of recent estimates, the UniProt database contains over 356 million protein entries, with approximately 80% lacking detailed functional annotations beyond their primary sequences. This gap underscores a critical bottleneck in translating genomic data into actionable biological knowledge.
Computational approaches have emerged as promising alternatives to address these limitations. Deep learning methods have revolutionized protein structure prediction by learning from vast datasets without a priori assumptions about sequence-structure relationships. These methods leverage neural networks with millions of parameters to predict protein structures with unprecedented accuracy, often rivaling experimental methods. Yet, accurately predicting protein functions remains elusive, primarily due to the complex and multifaceted nature of functional diversity encoded within protein sequences.
The challenge lies not only in predicting functions accurately but also in interpreting the biological significance of these predictions. Computational tools confront the challenging task of distinguishing between residues crucial for protein function and those that are merely structurally conserved. This delineation is crucial for understanding the mechanisms underlying protein activity, identifying disease-associated variants, and engineering proteins with desired functionalities for biotechnological applications. Moreover, the disparity between the abundance of sequenced proteins and the scarcity of experimentally determined structures further complicates function prediction efforts. While computational models can predict structures with high accuracy, the reliability of these predictions in translating into accurate function annotations varies significantly. Factors such as confidence scores of predicted structures and the inherent variability in computational modeling contribute to the challenge of achieving consistent and reliable function predictions across diverse protein families.
Our Method 
To address these challenges, we introduce PhiGnet, a physics-informed learning approach devised to annotate protein functions at the residue level. PhiGnet leverages evolutionary couplings between residues across diverse protein sequences, which reflect coevolutionary relationships shaped by functional constraints over evolutionary time scales. These coevolutionary signals are indicative of residues that interact or collaborate to maintain protein structure and function, even across evolutionary distances. PhiGnet centers around two stacked graph convolutional networks (GCNs) that are specifically designed to capture intricate relationships within evolutionary couplings and hierarchical couplings within residue communities. In the context of PhiGnet, the first GCN extracts features from the protein sequence and its evolutionary couplings, encapsulating the coevolutionary patterns that underpin functional relationships. The second GCN then integrates its hierarchical couplings for identifying functional sites. Combining these features, PhiGnet learns to generalize across diverse protein sequences and accurately predict functional annotations.
Furthermore, PhiGnet introduces interpretability into its predictions by quantifying the significance of each residue with respect to specific biological functions. This capability not only aids in prioritizing functionally important residues for further experimental validation but also provides insights into the molecular mechanisms governing protein activity.
Overall, PhiGnet represents a novel approach that bridges the gap between sequence and function by harnessing evolutionary insights. By combining advanced machine learning techniques with evolutionary data, PhiGnet offers a promising pathway towards enhancing our understanding of protein function diversity and complexity, thereby advancing biomedical research and biotechnological applications.
Our Results
PhiGnet demonstrates remarkable performance in accurately assigning function annotations to proteins. Through rigorous evaluation on benchmark datasets and comparison with existing methods, PhiGnet consistently outperforms state-of-the-art (SOTA) approaches in predicting functional annotations. This improvement is attributed to PhiGnet’s ability to leverage evolutionary data. Moreover, PhiGnet can identify functionally relevant residues within proteins. This capability provides valuable insights into the molecular basis of protein activities, making it useful for pinpointing crucial residues involved in catalytic sites, ligand-binding pockets, and allosteric sites—fundamental aspects in drug discovery and enzyme engineering. Overall, by harnessing evolutionary information effectively, PhiGnet not only improves the accuracy of function prediction but also contributes to quantifying the significance of individual residues.
Outlook
Looking forward, future developments could focus on enhancing PhiGnet’s interpretability, scalability, and application across various biological contexts. Integrating multi-omics data and refining evolutionary insights could further boost its predictive power and expand its applicability in understanding complex biological systems. 

Hot Topics

Related Articles