Quantum Chemistry Dataset with Ground- and Excited-state Properties of 450 Kilo Molecules

Kim, S. et al. Pubchem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).Article 
PubMed 

Google Scholar 
Fink, T., Bruggesser, H. & Reymond, J.-L. Virtual exploration of the small-molecule chemical universe below 160 daltons. Angewandte Chemie International Edition 44, 1504–1508 (2005).Article 
CAS 
PubMed 

Google Scholar 
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society 131, 8732–8733 (2009).Article 
CAS 
PubMed 

Google Scholar 
Fink, T. & Reymond, J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: Assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. Journal of Chemical Information and Modeling 47, 342–353 (2007).Article 
CAS 
PubMed 

Google Scholar 
Ruddigkeit, L., Van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling 52, 2864–2875 (2012).Article 
CAS 
PubMed 

Google Scholar 
Sterling, T. & Irwin, J. J. Zinc 15–ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337 (2015).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Tingle, B. I. et al. Zinc-22– a free multi-billion-scale database of tangible compounds for ligand discovery. J. Chem. Inf. Model. 63, 1166–1176 (2023).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Zdrazil, B. et al. The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).Article 
PubMed 

Google Scholar 
Davies, M. et al. Chembl web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620 (2015).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Pence, H. & Williams, A. Chemspider: An online chemical information resource. Journal of Chemical Education 87 (2010).Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).Article 
CAS 
PubMed 

Google Scholar 
Cheng, T., Pan, Y., Hao, M., Wang, Y. & Bryant, S. H. Pubchem applications in drug discovery: a bibliometric analysis. Drug Discovery Today 19, 1751–1756 (2014).Article 
PubMed 
PubMed Central 

Google Scholar 
Miller, M. A. Chemical database techniques in drug discovery. Nature Reviews Drug Discovery 1, 220–227 (2002).Article 
CAS 
PubMed 

Google Scholar 
Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: A molecular modeling perspective. Medicinal Research Reviews 16, 3–50 (1996).Article 
CAS 
PubMed 

Google Scholar 
Himanen, L., Geurts, A., Foster, A. S. & Rinke, P. Data-driven materials science: status, challenges, and perspectives. Advanced Science 6, 1900808 (2019).Article 
PubMed 
PubMed Central 

Google Scholar 
Tripathi, M. K., Kumar, R. & Tripathi, R. Big-data driven approaches in materials science: A survey. Materials Today: Proceedings 26, 1245–1249 (2020). 10th International Conference of Materials Processing and Characterization.CAS 

Google Scholar 
Cai, J., Chu, X., Xu, K., Li, H. & Wei, J. Machine learning-driven new material discovery. Nanoscale Adv. 2, 3115–3130 (2020).Article 
ADS 
PubMed 
PubMed Central 

Google Scholar 
Zou, S.-J. et al. Recent advances in organic light-emitting diodes: toward smart lighting and displays. Mater. Chem. Front. 4, 788–820 (2020).Article 
CAS 

Google Scholar 
Salehi, A., Fu, X., Shin, D.-H. & So, F. Recent advances in oled optical design. Advanced Functional Materials 29, 1808803 (2019).Article 

Google Scholar 
Zhao, Q., Stalin, S., Zhao, C.-Z. & Archer, L. A. Designing solid-state electrolytes for safe, energy-dense batteries. Nature Reviews Materials 5, 229–252 (2020).Article 
ADS 
CAS 

Google Scholar 
Bruno, I. J. & Groom, C. R. Crystallographic perspective on sharing data and knowledge. Journal of Computer-Aided Molecular Design 28, 1015–1022 (2014).Article 
ADS 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics 15, 095003 (2013).Article 
ADS 
CAS 

Google Scholar 
Kim, H., Park, J. Y. & Choi, S. Energy refinement and analysis of structures in the QM9 database via a highly accurate quantum chemical method. Scientific Data 6, 109 (2019).Article 
PubMed 
PubMed Central 

Google Scholar 
Ramakrishnan, R., Hartmann, M., Tapavicza, E. & Von Lilienfeld, O. A. Electronic spectra from TDDFT and machine learning in chemical space. J. Chem. Phys. 143, 084111 (2015).Article 
ADS 
PubMed 

Google Scholar 
Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1, 140022 (2014).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters 108, 58301 (2012).Article 
ADS 

Google Scholar 
Nakata, M. & Maeda, T. PubChemQC B3LYP/6-31G*//PM6 data set: The electronic structures of 86 million molecules using B3LYP/6-31G* calculations. J. Chem. Inf. Model. 63, 5734–5754 (2023).Article 
CAS 
PubMed 

Google Scholar 
Nakata, M., Shimazaki, T., Hashimoto, M. & Maeda, T. PubChemQC PM6: A dataset of 221 million molecules with optimized molecular geometries and electronic properties. Journal of Chemical Information and Modeling 60, 5891–5899 (2020).Article 
CAS 
PubMed 

Google Scholar 
Nakata, M. & Shimazaki, T. PubChemQC Project: A large-Scale first-principles electronic structure database for data-driven chemistry. Journal of Chemical Information and Modeling 57, 1300–1308 (2017).Article 
CAS 
PubMed 

Google Scholar 
Chen, G. et al. Alchemy: A quantum chemistry dataset for benchmarking ai models. arXiv arXiv:1906.09427 (2019).Pereira, F. et al. Machine learning methods to predict density functional theory b3lyp energies of HOMO and LUMO orbitals. Journal of Chemical Information and Modeling 57, 11–21 (2017).Article 
CAS 
PubMed 

Google Scholar 
Liang, J., Xu, Y., Liu, R. & Zhu, X. QM-sym, a symmetrized quantum chemistry database of 135 kilo molecules. Scientific Data 6, 213 (2019).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Liang, J. et al. QM-symex, update of the QM-sym database with excited state information for 173 kilo molecules. Scientific Data 7, 400 (2020).Article 
ADS 
PubMed 
PubMed Central 

Google Scholar 
Zou, Z. et al. A deep learning model for predicting selected organic molecular spectra. Nature Computational Science 3, 957–964 (2023).Article 
ADS 
CAS 
PubMed 

Google Scholar 
Kayastha, P., Chakraborty, S. & Ramakrishnan, R. The resolution- vs. -accuracy dilemma in machine learning modeling of electronic excitation spectra. Digital Discovery 1, 689–702 (2022).Article 
CAS 

Google Scholar 
Pengmei, Z., Liu, J. & Shu, Y. Beyond MD17: The Reactive xxMD Dataset. Scientific Data 11, 1 (2024).Vinod, V. & Zaspel, P. CheMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules. arXiv. http://www.arxiv.org/abs/2406.14149 (2024).Glavatskikh, M., Leguy, J., Hunault, G., Cauchy, T. & Da Mota, B. Dataset’s chemical diversity limits the generalizability of machine learning predictions. J. Cheminformatics 11, 69 (2019).Article 

Google Scholar 
Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Scientific Data 9, 273 (2022).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Kokkinos, I. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6129–6138 (2017).Zhang, D. et al. Dpa-2: Towards a universal large atomic model for molecular and material simulation. arXiv arXiv:2312.15492 (2023).Grimme, S., Ehrlich, S. & Goerigk, L. Effect of the damping function in dispersion corrected density functional theory. J. Comput. Chem. 32, 1456–1465 (2011).Article 
CAS 
PubMed 

Google Scholar 
Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, 1177–1178 (Association for Computing Machinery, New York, NY, USA, 2010).O’Boyle, N. M., Morley, C. & Hutchison, G. R. Pybel: a python wrapper for the openbabel cheminformatics toolkit. Chemistry Central Journal 2, 1–7 (2008).
Google Scholar 
O’Boyle, N. M. et al. Open babel: An open chemical toolbox. J. Cheminformatics 3, 1–14 (2011).
Google Scholar 
Bannwarth, C., Ehlert, S. & Grimme, S. Gfn2-xtb—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput 15, 1652–1671 (2019).Article 
CAS 
PubMed 

Google Scholar 
Bannwarth, C. et al. Extended tight-binding quantum chemistry methods. Wiley Interdisciplinary Reviews: Computational Molecular Science 11, e1493 (2021).CAS 

Google Scholar 
Frisch, M. J. et al. Gaussian 16 Revision C.01 (2016). Gaussian Inc. Wallingford CT.Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. Inchi, the iupac international chemical identifier. J. Cheminformatics 7, 1–34 (2015).Article 
CAS 

Google Scholar 
Pulay, P. & Fogarasi, G. Geometry optimization in redundant internal coordinates. J. Chem. Phys. 96, 2856–2860 (1992).Article 
ADS 
CAS 

Google Scholar 
Peng, C., Ayala, P. Y., Schlegel, H. B. & Frisch, M. J. Using redundant internal coordinates to optimize equilibrium geometries and transition states. J. Comput. Chem. 17, 49–56 (1996).Article 
CAS 

Google Scholar 
Zhu, Y., Li, M., Xu, C. & Lan, Z. QCDGE dataset. Figshare https://doi.org/10.6084/m9.figshare.c.7259125.v1 (2024).The HDF Group, N., Koziol, Q. & of Science, U. O. HDF5-version 1.12.0, https://doi.org/10.11578/dc.20180330.1 (2020).Ertl, P. An algorithm to identify functional groups in organic molecules. J. Cheminformatics 9, 36 (2017).Article 

Google Scholar 
Schaub, J.Development and implementation of in silico molecule fragmentation algorithms for the cheminformatics analysis of natural product spaces. Ph.D. thesis, Friedrich-Schiller-Universität, Jena https://doi.org/10.22032/dbt.59051 (2023).Haider, N. Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach. Molecules 15, 5079–5092 (2010).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
ChemAxon. Marvin. http://www.chemaxon.com (2024).

Hot Topics

Related Articles