Quantum chemical calculation dataset for representative protein folds by the fragment molecular orbital method

The three-dimensional structures of biological macromolecules such as proteins and nucleic acids are crucial for understanding their functions. These structures can be determined experimentally using X-ray crystallography, nuclear magnetic resonance spectroscopy, and cryo-electron microscopy. The results of this study make more than 200,000 structures available from the Protein Data Bank (PDB) on the websites of the wwPDB group members1,2,3. Recently, AlphaFold24 has made it possible to generate accurate protein model structures even in the absence of experimental information. Uniprot5 provides a database of AlphaFold2 model structures, called the AlphaFold Protein Structure Database (AlphaFold DB)6. Because new insights obtained from such reliable structures are useful, the accumulation of computational data from simulations is expected to become increasingly important.There are two major computational methodologies for biomacromolecules: molecular dynamics (MD) simulations7 for investigating dynamic behavior and quantum mechanical (QM) calculations for the precise electronic states. MD simulations are used to study loop flexibility, molecular conformation in solvents, and especially the interactions with ligand molecules. Although MD simulations account for the dynamic structural changes, they typically employ fixed charges. Biological macromolecules also perform their functions by forming specific atomic networks, including hydrogen bonds, ionic bonds, and nonpolar interactions, all of which involve the structure-dependent electronic state. QM is a promising non-empirical method though which the electronic state of a given molecular conformation can be determined. In general, the computational cost of QM calculations is approximately proportional to the fourth to sixth power of the number of basis functions; therefore, QM is mostly applied to small molecules. Several methods have been developed to overcome this limitation. QM/MM techniques such as ONIOM are hybrid approaches that logically partition molecules, enabling quantum chemical calculations in targeted regions and molecular force field calculations in others. Such methods have also been used to study chemical and enzymatic reactions8.Currently, the fragment molecular orbital (FMO) method9 is the promising full-QM method applicable to biological macromolecules. The FMO method divides biological macromolecules such as proteins and nucleic acids into residual fragments and performs quantum chemical calculations (Fig. 1a). The FMO method has been implemented in software programs such as GAMESS10,11,12, and ABINIT-MP13,14,15 and is still under development.Fig. 1Summary of the dataset of QM-based energies of protein structures by the FMO method. (a) The structure of a protein can be divided into fragments based on amino acid units. (b) IFIE/PIEDA data are calculated based on interactions between fragments. (c) The dataset includes protein atomic coordinates and its IFIE/PIEDA energy data.The data obtained from the FMO method includes the inter-fragment interaction energy (IFIE also called pair interaction energy (PIE)), total energy, and atomic charge. IFIE/PIE has the advantage of describing residue-by-residue interactions and facilitating the energy interpretation of inter- and intramolecular interactions (Fig. 1b). Pair interaction energy decomposition analysis (PIEDA)16 is a method for analyzing the interaction between fragments that decomposes IFIE into electrostatic interaction (ES), exchange repulsion (EX), charge transfer with higher-order mixed-term interactions (CT + mix), and dispersion interaction (DI) components, and can be used to quantitatively determine which of these components is strongly involved in the binding between fragments. For example, hydrogen bonds, which frequently occur in the main and side chain interactions of amino acid residues, can be evaluated using in terms of the ES and CT + mix components. The DI component is particularly suitable for evaluating nonpolar interactions and contributes strongly to CH/π and π–π bonds17,18,19,20,21. Computational simulations for protein-ligand binding based on experimental structures have been reported22,23.The IFIE and PIEDA in the FMO method have the following relationships. The total energy of a molecule can be calculated using the following equation9:$${E}_{{\rm{total}}}\approx {\sum }_{I > J}^{N}\left({E}_{{IJ}}^{{\prime} }-{E}_{I}^{{\prime} }-{E}_{J}^{{\prime} }\right)+{\sum }_{I > J}^{N}{\rm{Tr}}\left({\triangle D}^{{IJ}}{V}^{{IJ}}\right)+{\sum }_{I > J}^{N}{E}_{I}^{{\prime} }$$
(1)
where ${E}_{{IJ}}^{{\prime} }$, ${E}_{J}^{{\prime} }$, and ${E}_{J}^{{\prime} }$ are the energies without environmental electrostatic potential between fragments I and J, fragment I, and fragment J, respectively, N is the number of fragments in the molecule, ${\triangle D}^{{IJ}}$ is the difference density matrix, and ${V}^{{\rm{IJ}}}$ is the electrostatic potential of the surrounding fragments. The IFIE is defined using the following equation:$${\triangle E}_{{IJ}}=\left({E}_{{IJ}}^{{\prime} }-{E}_{I}^{{\prime} }-{E}_{J}^{{\prime} }\right)+{\rm{Tr}}\left({\triangle D}^{{IJ}}{V}^{{IJ}}\right)$$
(2)
The components of the PIEDA16 can be obtained from the following equation:$${\triangle E}_{{IJ}}=\triangle {E}_{{IJ}}^{{\rm{ES}}}+\triangle {E}_{{IJ}}^{{\rm{EX}}}+\triangle {E}_{{IJ}}^{{\rm{CT}}+{\rm{mix}}}+\triangle {E}_{{IJ}}^{{\rm{DI}}}$$
(3)
where the IFIE is described by four types of energy terms.As a quantum chemistry dataset, QM9 dataset is well known, which contains quantum chemical calculation values for molecular structures consisting of nine non-hydrogen atoms24. Our group also provides FMO calculation data from database, FMODB, containing the electronic states of biological macromolecules25. Currently, FMODB includes 37,450 entries constructed by the unique 7,783 PDB entries in 23 Jul 2024. Such datasets are used for machine learning applications, and all-electronic data on proteins are already being used for the construction of artificial intelligence platforms and other purposes26. The data registered in the FMODB depend on the interests of researchers. For example, there are many calculations for the Protein Kinase family (e.g., CDK2, p38 MAP, and Aurora), the nuclear receptor family (e.g., ERα and ERβ), the related proteins of SARS-CoV-227, and apoproteins of X-ray crystal structure data25,28. The authors aim to make the FMO calculation results available for all structures deposited in the PDB for a wide range of applications of the FMO method. As of Sep 2024, there were more than 220,000 entries in the PDB; however, analyzing all entries is only possible if sufficient computing resources, such as supercomputers, could be used without restrictions. Because the convergency of FMO calculations depend on the atomic coordinate of proteins and can be unpredictable for individual proteins owing to variations in amino acid sequences, and crystallization conditions such as resolution, it is advisable to gather data on the convergence rate and distribution of FMO-based energies for representative structures before performing FMO calculations for all proteins in the PDB.SCOP2, which is a database of protein folds, was selected as the dataset in this study to provide FMO calculation data for a wide range of proteins29,30. SCOP2 is a hierarchical classification of protein folds based on their structural and evolutionary relationships. It was derived from a subset of experimentally determined protein structures deposited in the PDB. The database is updated periodically to incorporate new families and structures. As of June 29, 2022, SCOP2 comprised 5,936 families. In this study, we present a comprehensive FMO computational dataset that encompasses all the experimentally characterized protein folds. This dataset, derived from protein structures associated with SCOP2 families, serves as a valuable resource for assessing the current capabilities of FMO methods, and enables researchers to readily access quantum chemistry data for folds of interest.In the FMO method, as in any QM calculation, the judicious choice of calculation methods and basis sets is paramount for obtaining reliable and accurate results. The Hartree–Fock (HF) method is a fundamental ab initio quantum chemical method that utilizes the Hamiltonian operator and Slater determinant to approximate the ground state wave function of a molecular system. Although the STO-3G minimal basis set offers computational cost advantages, it requires at least double-zeta basis and the polarization functions in order to describe various interaction in biomolecules. In the context of FMO calculations, the MP2/6-31 G* level of theory (FMO-MP2/6-31 G*) is preferred because of the balance between accuracy and computational cost. This is because, in contrast to the HF method, the MP2 method (second order Møller–Plesset perturbation theory)31,32,33 can account for electron correlation, and the 6-31 G* basis set incorporates polarization functions for non-hydrogen atom polarization. The FMO-MP2/6-31 G* is frequently application in the study of relatively medium-sized organic compounds and the analysis of intermolecular interactions, including hydrogen bonding, CH/π34, and π–π interactions, between small molecules and proteins35,36. In addition, all of the data published in the FMODB uses this level of theory25. The validation of energy values derived from the FMO method, employing various combinations of calculation methods and basis sets, has been confined to a limited number of systems37. However, the recent development of supercomputers has enabled the use of higher levels of theory.Basis functions are mathematical representations that approximate the spatial distribution of electrons within atomic orbitals. The characteristics of the basis sets used in this study are listed in Table 1. These functions are employed to express the molecular orbitals as linear combinations of atomic orbitals. In this study, we augmented the 6-31 G basis set by incorporating polarization functions for non-hydrogen atoms only and hydrogen atoms, denoted as 6-31 G* and 6-31 G**, respectively, thereby enhancing the accuracy of the electronic structure calculations. In addition, we used the correlation-consistent polarized valence double-zeta (cc-pVDZ) basis set, which was specifically designed to account for electron correlation effects. Consequently, our dataset now encompasses the FMO-MP2/6-31 G*, FM0-MP2/6-31 G**, and FMO-MP2/cc-pVDZ levels of theory. While MP2/6-31 G* only includes polarization functions (i.e., additional p-orbital functions) for non-hydrogen atoms, both MP2/cc-pVDZ and MP2/6-31 G** include them for hydrogen atoms. The cc-pVDZ basis set is distinguished by its utilization of Dunning-type functions and its design as a correlation-consistent basis set38. Since the formation of CH/π and π-π interactions through dispersion forces related to electronic correlations as well as hydrogen bonds contribute to protein folding, the use of either 6-31 G** or cc-pVDZ is considered necessary to properly evaluate the polarization of hydrogen atoms.Table 1 Properties of the basis sets used in this study.In summary, there is currently no quantum chemical dataset encompassing over 5000 protein structures classified into diverse families computed using multiple quantum chemical levels of theory. This dataset is not only instrumental for protein function and interaction analysis but is also anticipated to serve as training data for the development of machine learning models for protein charge prediction. Notably, providing energy values calculated using three distinct basis sets for the same fragment pairs facilitate the analysis of the effects of hydrogen atom polarization and electron correlation on intermolecular interactions.

Quantum chemical calculation dataset for representative protein folds by the fragment molecular orbital method

Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants

SPLANG—a synthetic poisson-lognormal-based abundance and network generative model for microbial interaction inference algorithms

Predicting non-responders to lifestyle intervention in prediabetes: a machine learning approach

sChemNET: a deep learning framework for predicting small molecules targeting microRNA function

Massive lost mountain cities revealed by lasers

Hot Topics

Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants

SPLANG—a synthetic poisson-lognormal-based abundance and network generative model for microbial interaction inference algorithms

Predicting non-responders to lifestyle intervention in prediabetes: a machine learning approach

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants

SPLANG—a synthetic poisson-lognormal-based abundance and network generative model for microbial interaction inference algorithms

Predicting non-responders to lifestyle intervention in prediabetes: a machine learning approach

sChemNET: a deep learning framework for predicting small molecules targeting microRNA function

Popular Articles

Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants

SPLANG—a synthetic poisson-lognormal-based abundance and network generative model for microbial interaction inference algorithms

Predicting non-responders to lifestyle intervention in prediabetes: a machine learning approach