Quantum Topological Atomic Properties of 44K molecules

Comparison to QM9 databaseSince the AIMEl database described here is a subset of the QM9 database, validation is carried out by comparing typical molecular properties: total mole weight, count of H acceptors, H donors, electronegative atoms, rotatable bonds and small rings, partition coefficient between n-octanol and water cLogP, and total surface area. The analysis is presented in Fig. 3. The median and mean values are represented in red lines and white triangles. Since the AIMEl subset was randomly selected, substantial overlap was expected compared to the QM9 property set. Interestingly, the data sets present nearly identical median and mean values for most properties. In turn, the number of rotatable bonds is the most notable discrepancy. Although the mean values are very close, the median in AIMEl is zero, while a value of one is presented in QM9. However, the AIMEl data set shows diverse structures that feature molecules with two or more rotatable bonds. The original QM9 database included unstable structures. As a first filter, only chemically sound molecules were maintained. For instance, non-bonded atoms and overly strained ring structures were eliminated. This refinement led to the elimination of 13, 402 molecules, leaving a total of 45, 900. However, after eliminating the molecules that included NNA, the data set comprises 44,470 molecules.Fig. 3Comparison of molecular properties between QM9 and AIMEl databases.Validation of QTAIM calculationsTo check the integrity of the data generated using AIMAll, we performed an error analysis for the energy of the system. We compare the total molecular energy E(mol) obtained from the calculation of the electronic molecular structure with the sum of the atomic energies for each molecule, \({\sum }_{\Omega }^{{N}_{\Omega }}E(\Omega )\). NΩ represents the total number of atoms. The difference between E(mol) and \({\sum }_{\Omega }^{{N}_{\Omega }}E(\Omega )\) shows the quality of the atomic integration process. For this reason, the error (E(mol) − \({\sum }_{\Omega }^{{N}_{\Omega }}E(\Omega )\)) is a useful quantity to validate the calculated atomic properties. In Fig. 4 the error distributions are presented. Figure 4A shows a direct comparison between E(mol) and \({\sum }_{\Omega }^{{N}_{\Omega }}E(\Omega )\). The results reveal that molecules with larger errors (>∣0.80∣ kcal/mol) are observed within the range of −3000.0 to −2000.0 kcal/mol. The complete collection of molecules contained in the AIMEl data set presents error values below 1.0 kcal/mol. These errors follow a normal distribution, as shown in Fig. 4B.Fig. 4Error distributions for the molecules in the AIMEl dataset. The absolute error is presented in (A). Here, molecules with larger energy differences appear, ranging from −3000.0 to −2000.0 kcal/mol. In (B), the distribution shows that the highest number of occurrences oscillates around 0.00 kcal/mol.S. Senthil et al.28 have studied the lack of chemical sense in molecules within the QM9 dataset. They found that the use of ωB97XD/6-31G(2df, p) leads to geometrically stable structures. In this work, we have found that using a \(\bar{x}\pm 4\sigma \) approximation refines the chemical structures, filtering out molecules with geometric instabilities. Besides, to compare our results with this level of theory, we carried out single-point calculations on a validation subset of 4,397 molecules and obtained their QTAIM descriptors. The results are shown in Table 3. There are no meaningful discrepancies in this comparison. Although more significant differences are observed in the magnitude of the dipole and quadrupole moments, the metrics for atomic population and energy are close. The higher differences in the dipole and quadrupole magnitudes can be attributed to the nature of these properties, as a small perturbation in electronic density can lead to a significant change in those properties.Table 3 Comparison of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) of QTAIM properties between the wB97XD/6–31G(2df, p) and B3LYP/6–31G(2df, p) levels of theory for a subset of 4,397 molecules.Database characterizationFigure 5 presents the correlation matrix between the atomic properties. This matrix shows a strong correlation between atomic energy E(Ω) and the atomic population N(Ω). This relationship is interesting because we can obtain the energy of an atomic basin considering only its atomic population; that is, the stabilization of the atom correlates directly with its electronic population. The weakest correlations occur with the quadrupole moment, ∣Q(Ω)∣ followed by ∣μ(Ω)∣, which implies that these properties can be used to characterize the distribution of the data set.Fig. 5Correlation matrix between atomic properties presented in the database.We present the normalized histograms of the properties studied for atoms H, C, O, and N in Fig. 6. The values are presented by filtering the entire database, where each property of each atom is selected using a \(\bar{x}\pm 4\sigma \) approximation. \(\bar{x}\) and σ represent the mean and standard deviation of the sample. In the case of the atomic population, H and C present many instances centered around 1 and 6 a.u. corresponding to their respective atomic numbers. In contrast, for O and N, the distributions are broader and skewed toward larger atomic populations, which could be related to their electronegative nature. Similarly, lower atomic dipoles are observed for H and C, for which most cases range from 0 to 5 a.u. For the case of N and O, the atomic dipoles span a wider range of values (0 to 15 a.u.). This observation can be attributed to the generally higher reactivity of the N and O atoms29. Concerning the atomic quadrupole moment, the histograms span different range values depending on the atom, and no clear structural distributions are observed. However, in the cases of N and O, the values exhibit distinct regions in the broader range, reflecting the possible diversity of structures within the database. Finally, the atomic energies briefly resemble the population distributions. Hydrogen and carbon are centered around approximately −0.62 Ha and −38.0 Ha, respectively, accounting for the small spread in population values for these two atoms. Nitrogen and oxygen exhibit a more diverse distribution with at least two observed peaks, ranging from approximately −55.5 to −54.5 Ha for N and approximately −76.2 to −75.6 Ha for O.Fig. 6Histograms of properties. Atomic population, N(Ω); Magnitude of the total dipole moment, ∣μ(Ω)∣; Magnitude of the total quadrupole moment, ∣Q(Ω)∣; d) Atomic Energy, E(Ω) The bins have been adjusted to ensure that the total sum of bar heights equals 1.Finally, a three-dimensional visualization is presented in Fig. 7, which shows three electronic properties: N(Ω), ∣Q(Ω)∣, and ∣μ(Ω)∣. The colors represent atom types. As expected, the atomic population groups atoms by kind. In this regard, N(Ω) can characterize atoms within a molecule, since there is no wide data distribution. In contrast, the dipole and quadrupole magnitudes show a wider distribution, which captures the diversity of atoms in each group. Therefore, ∣Q(Ω)∣ and ∣μ(Ω)∣ can be used to characterize the diversity of the database and illustrate the broad reactivity of the analyzed molecules.Fig. 73D distribution of the calculated atomic properties, grouped by atom kind. All properties are presented in atomic units.This study introduces a novel data set of atomic properties for approximately 44K organic molecules. The data provide fundamental information on the atomic properties based on the Quantum Theory of Atoms in Molecules (QTAIM). In particular, the data set includes atomic basin energies, populations, dipole moments, and quadrupole moments. The data set can enable powerful new machine-learning models for predicting atomic properties and chemical reactivity directly from molecular structure. The public availability of this large database could facilitate new studies in chemical informatics, machine learning applied to chemistry, and computational molecular design.

Hot Topics

Related Articles