Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning

Schaller, R. Moore’s law: past, present and future. IEEE Spectrum 34, 52–59 (1997).Article 

Google Scholar 
Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Deep learning for computational biology. Molecular Systems Biology 12, 878, https://doi.org/10.15252/msb.20156651 (2016).Article 
PubMed 
PubMed Central 

Google Scholar 
Sahakyan, A. B. et al. Machine learning model for sequence-driven DNA G-quadruplex formation. Scientific Reports 7, 14535, https://doi.org/10.1038/s41598-017-14017-4 (2017).Article 
ADS 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Avsec, Å et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods 18, 1196–1203, https://doi.org/10.1038/s41592-021-01252-x (2021).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129, https://doi.org/10.1093/bioinformatics/btu277 (2014).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806, https://doi.org/10.1126/science.1254806 (2015).Article 
MathSciNet 
CAS 
PubMed 

Google Scholar 
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831–838, https://doi.org/10.1038/nbt.3300 (2015).Article 
CAS 
PubMed 

Google Scholar 
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods 12, 931–934, https://doi.org/10.1038/nmeth.3547 (2015).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Toneyan, S., Tang, Z. & Koo, P. K. Evaluating deep learning for predicting epigenomic profiles. Nature Machine Intelligence 1–13 https://doi.org/10.1038/s42256-022-00570-9 (2022).Zheng, A. et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nature Machine Intelligence 3, 172–180, https://doi.org/10.1038/s42256-020-00282-y (2021).Article 
PubMed 
PubMed Central 

Google Scholar 
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26, 990–999, https://doi.org/10.1101/gr.200535.115 (2016).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: Accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology 18, 67, https://doi.org/10.1186/s13059-017-1189-z (2017).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Rogers, M. F., Gaunt, T. R. & Campbell, C. Prediction of driver variants in the cancer genome via machine learning methodologies. Briefings in Bioinformatics 22, bbaa250, https://doi.org/10.1093/bib/bbaa250/5935499 (2021).Article 
PubMed 

Google Scholar 
Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nature Communications 9, 3887, https://doi.org/10.1038/s41467-018-06169-2 (2018).Article 
ADS 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Kirkpatrick, J. et al. Pushing the frontiers of density functionals by solving the fractional electron problem. Science 374, 1385–1389, https://doi.org/10.1126/science.abj6511 (2021).Article 
ADS 
CAS 
PubMed 

Google Scholar 
Jumper, J. et al. Highly accurate protein structure prediction with Alphafold. Nature 596, 583–589, https://doi.org/10.1038/s41586-021-03819-2 (2021).Article 
ADS 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Russo, N., Toscano, M. & Grand, A. Theoretical determination of electron affinity and ionization potential of DNA and RNA bases. Journal of Computational Chemistry 21, 1243–1250, https://doi.org/10.1002/1096-987X(20001115)21:14 (2000).Article 
CAS 

Google Scholar 
Close, D. M. Calculation of the ionization potentials of the DNA bases in aqueous medium. J. Phys. Chem. A 108, 10376–10379, https://doi.org/10.1021/jp046660y (2004).Article 
CAS 

Google Scholar 
Saito, I. et al. Photoinduced dna cleavage via electron transfer: demonstration that guanine residues located 5’ to guanine are the most electron-donating sites. J. Am. Chem. Soc. 117, 6406–6407, https://doi.org/10.1021/ja00128a050 (1995).Article 
CAS 

Google Scholar 
Fleming, A. M., Zhu, J., Ding, Y., Esders, S. & Burrows, C. J. Oxidative modification of guanine in a potential Z-DNA-forming sequence of a gene promoter impacts gene expression. Chemical Research in Toxicology 32, 899–909, https://doi.org/10.1021/acs.chemrestox.9b00041 (2019).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Sahakyan, A. B. & Balasubramanian, S. Single genome retrieval of context-dependent variability in mutation rates for human germline. BMC Genomics 18, 1–17, https://doi.org/10.1186/s12864-016-3440-5 (2017).Article 

Google Scholar 
Sorkun, E., Zhang, Q., Khetan, A., Sorkun, M. C. & Er, S. RedDB, a computational database of electroactive molecules for aqueous redox flow batteries. Scientific Data 9, 718, https://doi.org/10.1038/s41597-022-01832-2 (2022).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Scientific Data 9, 273, https://doi.org/10.1038/s41597-022-01390-7 (2022).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Shen, J.-X. et al. A representation-independent electronic charge density database for crystalline materials. Scientific Data 9, 661, https://doi.org/10.1038/s41597-022-01746-z (2022).Article 
PubMed 
PubMed Central 

Google Scholar 
Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 9, 185, https://doi.org/10.1038/s41597-022-01288-4 (2022).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Stuke, A. et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Scientific Data 7, 58, https://doi.org/10.1038/s41597-020-0385-y (2020).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Scientific Data 4, 170193, https://doi.org/10.1038/sdata2017.193 (2017).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
St. John, P. C. et al. Quantum chemical calculations for over 200,000 organic radical species and 40,000 associated closed-shell molecules. Scientific Data 7, 244, https://doi.org/10.1038/s41597-020-00588-x (2020).Article 
CAS 

Google Scholar 
Gervasoni, S. et al. AB-DB: force-field parameters, MD trajectories, QM-based data, and descriptors of antimicrobials. Scientific Data 9, 148, https://doi.org/10.1038/s41597-022-01261-1 (2022).Article 
PubMed 
PubMed Central 

Google Scholar 
Liang, J. et al. QM-symex, update of the QM-sym database with excited state information for 173 kilo molecules. Scientific Data 7, 400, https://doi.org/10.1038/s41597-020-00746-1 (2020).Article 
ADS 
PubMed 
PubMed Central 

Google Scholar 
Prasad, V. K., Otero-de-la Roza, A. & DiLabio, G. A. PEPCONF, a diverse data set of peptide conformational energies. Scientific Data 6, 180310, https://doi.org/10.1038/sdata2018.310 (2019).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1, 140022, https://doi.org/10.1186/s12864-016-3440-5/sdata2014.22 (2014).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Masuda, K., Abdullah, A. A., Pflughaupt, P. & Sahakyan, A. B. Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning. Zenodo https://doi.org/10.5281/zenodo.10866166 (2024).R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2022).Macke, T. J. & Case, D. A. Modeling unusual nucleic acid structures (American Chemical Society, Washington, DC, USA, 1998).Neidle, S.Oxford handbook of nucleic acid structure (Oxford University Press, Oxford, UK, 1999).Li, S., Olson, W. K. & Lu, X. J. Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures. Nucleic Acids Res. 47, W26–W34, https://doi.org/10.1093/nar/gkz394 (2019).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Case, D. A. et al. Amber 2021. University of California, San Francisco, USA (2021).Zgarbová, M. et al. Refinement of the sugar–phosphate backbone torsion beta for amber force fields improves the description of Z- and B-DNA. J. Chem. Theory Comput. 11, 5723–5736, https://doi.org/10.1021/acs.jctc.5b00716 (2015).Article 
CAS 
PubMed 

Google Scholar 
Tsui, V. & Case, D. A. Theory and applications of the generalized Born solvation model in macromolecular simulations. Biopolymers 56, 275–291 (2001).Article 
CAS 

Google Scholar 
Grant, B. J., Rodrigues, A. P. C., ElSawy, K. M., McCammon, J. A. & Caves, L. S. D. Bio3D: an R package for the comparative analysis of protein structures. Bioinformatics 22, 2695–2696, https://doi.org/10.1093/bioinformatics/btl461 (2006).Article 
CAS 
PubMed 

Google Scholar 
Stewart, James J. P. MOPAC2016. Stewart Computational Chemistry, Colorado Springs, CO, USA (2016).Korth, M. Third-generation hydrogen-bonding corrections for semiempirical qm methods and force fields. J. Chem. Theory Comput. 6, 3808–3816, https://doi.org/10.1021/ct100408b (2010).Article 
CAS 

Google Scholar 
Klamt, A. & Schüürmann, G. COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc. Perkin Trans. 799–805 https://doi.org/10.1039/P29930000799 (1993).Besler, B. H., Merz Jr, K. M. & Kollman, P. A. Atomic charges derived from semiempirical methods. J. Comput. Chem. 11, 431–439, https://doi.org/10.1002/jcc.540110404 (1990).Article 
CAS 

Google Scholar 
Lavery, R., Moakher, M., Maddocks, J. H., Petkeviciute, D. & Zakrzewska, K. Conformational analysis of nucleic acids revisited: Curves+. Nucleic Acids Research 37, 5917–5929, https://doi.org/10.1093/nar/gkp608 (2009).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Lu, X. J. & Olson, W. K. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat. Protoc. 3, 1213–1227, https://doi.org/10.1038/nprot.2008.104 (2008).Article 
CAS 
PubMed 
PubMed Central 

Google Scholar 
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/2939672.2939785 (2016).Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378, https://doi.org/10.1016/S0167-9473(01)00065-2 (2002).Article 
MathSciNet 

Google Scholar 
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 1–21, https://doi.org/10.3389/fnbot.2013.00021 (2013).Article 

Google Scholar 
Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd ICML https://doi.org/10.1145/1143844.1143865 (2006).Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2, 56–67, https://doi.org/10.1038/s42256-019-0138-9 (2020).Article 
PubMed 
PubMed Central 

Google Scholar 

Hot Topics

Related Articles