Synthetic data in biomedicine via generative artificial intelligence

Rubin, D. B. Statistical disclosure limitation. J. Off. Stat. 9, 461–468 (1993).
Google Scholar 
Yoon, J., Drumright, L. N. & Van Der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inform. 24, 2378–2388 (2020).Article 

Google Scholar 
Bond-Taylor, S., Leach, A., Long, Y. & Willcocks, C. G. Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2021.3116668 (2021).Xu, D., Yuan, S., Zhang, L. & Wu, X. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data 570–575 (IEEE, 2018).Xu, D., Wu, Y., Yuan, S., Zhang, L. & Wu, X. Achieving causal fairness through generative adversarial networks. In Proc. International Joint Conference on Artificial Intelligence 1452–1458 (IJCAI, 2019).van Breugel, B., Kyono, T., Berrevoets, J. & van der Schaar, M. DECAF: generating fair synthetic data using causally-aware generative networks. Adv. Neural Inform. Process. Syst. 34, 22221–22233 (2021).
Google Scholar 
Antoniou, A., Storkey, A. & Edwards, H. Data augmentation generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.1711.04340 (2017).Dina, A. S., Siddique, A. B. & Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access. 10, 96731–96747 (2022).Article 

Google Scholar 
Das, H. P. et al. Conditional synthetic data generation for robust machine learning applications with limited pandemic data. In Proc. AAAI Conference on Artificial Intelligence 36, 11792–11800 (AAAI, 2021).Bing, S., Dittadi, A., Bauer, S. & Schwab, P. Conditional generation of medical time series for extrapolation to underrepresented populations. PLoS Digital Health 1, e0000074 (2022).Article 

Google Scholar 
van Breugel, B., Seedat, N., Imrie, F. & van der Schaar, M. Can you rely on your model evaluation? Improving model evaluation with synthetic test data. Adv. Neural Inform. Process. Syst. 36, 1889–1904 (2023).
Google Scholar 
Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).Article 

Google Scholar 
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug. Discov. 18, 463–477 (2019).Article 

Google Scholar 
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).Article 

Google Scholar 
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE, 2022).Wang, H. et al. Predicting the epidemics trend of COVID-19 using epidemiological-based generative adversarial networks. IEEE J. Sel. Top. Signal Process. 16, 276–288 (2022).Article 

Google Scholar 
Morbiducci, U. et al. Synthetic dataset generation for the analysis and the evaluation of image-based hemodynamics of the human aorta. Med. Biol. Eng. Comput. 50, 145–154 (2012).Article 

Google Scholar 
Frangi, A. F., Tsaftaris, S. A. & Prince, J. L. Simulation and synthesis in medical imaging. IEEE Trans. Med. Imaging 37, 673–679 (2018).Article 

Google Scholar 
Bray, A. et al. Pulse physiology engine: an open-source software platform for computational modeling of human medical simulation. SN Compr. Clin. Med. 1, 362–377 (2019).Article 

Google Scholar 
Webb, J. B. et al. Computational simulation to assess patient safety of uncompensated COVID-19 two-patient ventilator sharing using the Pulse Physiology Engine. PLOS ONE 15, e0242532 (2020).Article 

Google Scholar 
Patki, N., Wedge, R. & Veeramachaneni, K. The synthetic data vault. In Proc. 3rd IEEE International Conference on Data Science and Advanced Analytics (DSAA 2016) 399–410 (IEEE, 2016).Walonoski, J. et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25, 230–238 (2018).Article 

Google Scholar 
von Platen, P. et al. Diffusers: state-of-the-art diffusion models. GitHub github.com/huggingface/diffusers (2022).Qian, Z., Davies, R. & van der Schaar, M. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. Adv. Neural Inform. Process. Syst. 36, 3173–3188 (2023).
Google Scholar 
Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–487 (2014).Article 
MathSciNet 

Google Scholar 
Qu, Y. et al. GAN-DP: generative adversarial net driven differentially privacy-preserving big data publishing. In 2019 IEEE International Conference on Communications (ICC) (IEEE, 2019).Nikolenko, S. I. Synthetic Data for Deep Learning SOIA Vol. 174 (Springer, 2021).Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).Article 

Google Scholar 
Jordon, J. et al. Synthetic data — what, why and how? Preprint at https://doi.org/10.48550/arxiv.2205.03257 (2022).Alloza, C. et al. A case for synthetic data in regulatory decision-making in Europe. Clin. Pharmacol. Ther. 114, 795–801 (2023).Article 

Google Scholar 
Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digital Med. 6, 186 (2023).Article 

Google Scholar 
Savage, N. Synthetic data could be better than real data. Nature https://doi.org/10.1038/d41586-023-01445-8 (2023).Rocher, L., Hendrickx, J. M. & de Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 3069 (2019).Article 

Google Scholar 
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. & Rankin, D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022).Article 

Google Scholar 
Li, J., Cairns, B. J., Li, J. & Zhu, T. Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. npj Digital Med. 6, 98 (2023).Article 

Google Scholar 
Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun. 14, 5305 (2023).Article 

Google Scholar 
Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning (ICML) 290–306 (PMLR, 2022).Stadler, T., Oprisanu, B. & Troncoso, C. Synthetic data — anonymisation Groundhog Day. In 31st USENIX Security Symp. (USENIX, 2022).Dressel, J. & Farid, H. The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 4, eaao5580 (2018).Article 

Google Scholar 
Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters (11 October 2018).Lu, K., Mardziel, P., Wu, F., Amancharla, P. & Datta, A. Gender bias in neural natural language processing. In Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of his 65th Birthday 189–202 (Springer International Publishing, 2020).de Vassimon Manela, D., Errington, D., Fisher, T., van Breugel, B. & Minervini, P. Stereotype and skew: quantifying gender bias in pre-trained and fine-tuned language models. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics (ECACL) 2232–2242 (ACL, 2021).Kadambi, A. Achieving fairness in medical devices. Science 372, 30–31 (2021).Article 

Google Scholar 
Abid, A., Farooqi, M. & Zou, J. Persistent anti-Muslim bias in large language models. In Proc. 2021 AAAI/ACM Conference on AI, Ethics, and Society 9, 298–306 (ACM, 2021).Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54 (ACM, 2021).Grgic-Hlaca, N., Zafar, M. B., Gummadi, K. P. & Weller, A. The case for process fairness in learning: feature selection for fair decision making. In Symposium on Machine Learning and the Law at the 29th Conference on Neural Information Processing Systems (NIPS, 2016).Barocas, S. & Selbst, A. D. Big data’s disparate impact. Calif. Law Rev. 104, 671 (2016).
Google Scholar 
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. In International Conference on Machine Learning 325–333 (PMLR, 2013).Alessandra, A. M. When doctrines collide: disparate treatment, disparate impact, and Watson v. Fort Worth Bank & Trust. Univ. Pennsylvania Law Rev. 137, 1755 (1988).Article 

Google Scholar 
Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C. & Venkatasubramanian, S. Certifying and removing disparate impact. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 259–268 (ACM, 2015).Saxena, N. A. et al. How do fairness definitions fare? Testing public attitudes towards three algorithmic definitions of fairness in loan allocations. Artif. Intell. 283, 103238 (2020).Article 
MathSciNet 

Google Scholar 
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. & SMOTE Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).Article 

Google Scholar 
Draghi, B., Wang, Z., Myles, P. & Tucker, A. BayesBoost: identifying and handling bias using synthetic data generators. In Proc. 3rd Int. Worksh. on Learning with Imbalanced Domains: Theory and Applications 49–62 (PMLR, 2021).Waheed, A. et al. CovidGAN: data augmentation using auxiliary classifier GAN for improved Covid-19 detection. IEEE Access. 8, 91916–91923 (2020).Article 

Google Scholar 
Mahmood, F. et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans. Med. Imaging 39, 3257–3267 (2020).Article 

Google Scholar 
Shen, T., Hao, K., Gou, C. & Wang, F. Y. Mass image synthesis in mammogram with contextual information based on GANs. Comput. Meth. Prog. Biomed. 202, 106019 (2021).Article 

Google Scholar 
Tang, Y., Tang, Y., Zhu, Y., Xiao, J. & Summers, R. M. A disentangled generative model for disease decomposition in chest X-rays via normal image synthesis. Med. Image Anal. 67, 101839 (2021).Article 

Google Scholar 
van Breugel, B., Qian, Z. & van der Schaar, M. Synthetic data, real errors: how (not) to publish and use synthetic data. In Proc. 40th International Conference on Machine Learning (PMLR, 2023).Manousakas, D. & Aydöre, S. On the usefulness of synthetic tabular data generation. Preprint at https://doi.org/10.48550/arXiv.2306.15636 (2023).Liu, M. Y. & Tuzel, O. Coupled generative adversarial networks. Adv. Neural Inform. Process. Syst. 469, 477 (2016).
Google Scholar 
Kim, T., Cha, M., Kim, H., Lee, J. K. & Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In 34th International Conference on Machine Learning 4, 2941–2949 (PMLR, 2017).Zhu, J. Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. IEEE International Conference on Computer Vision 2017, 2242–2251 (IEEE, 2017).Liu, M. Y., Breuel, T. & Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inform. Process. Syst. 30, 701–709 (2017).
Google Scholar 
Choi, E. et al. Generating multi-label discrete patient records using generative adversarial networks. In Machine Learning for Healthcare 286–305 (PMLR, 2017).Yoon, J., Jordon, J., Van Der Schaar, M. & RadialGAN Leveraging multiple datasets to improve target-specific predictive models using generative adversarial networks. In 35th International Conference on Machine Learning 13, 9060–9068 (PMLR, 2018).Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4217–4228 (2018).Article 

Google Scholar 
Ali, M. B. et al. Domain mapping and deep learning from multiple MRI clinical datasets for prediction of molecular subtypes in low grade gliomas. Brain Sci. 10, 463 (2020).Article 

Google Scholar 
Ge, C., Gu, I. Y.-H., Jakola, A. S. & Yang, J. Enlarged training dataset by pairwise GANs for molecular-based brain tumor classification. IEEE Access. 8, 22560–22570 (2020).Article 

Google Scholar 
Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).Article 

Google Scholar 
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).Article 

Google Scholar 
Tao, F. et al. Digital twin-driven product design, manufacturing and service with big data. Int. J. Adv. Manuf. Technol. 94, 3563–3576 (2018).Article 

Google Scholar 
Corral-Acero, J. et al. The ‘Digital Twin’ to enable the vision of precision cardiology. Eur. Heart J. 41, 4556–4564 (2020).Article 

Google Scholar 
Eddy, D. M. & Schlessinger, L. Validation of the Archimedes diabetes model. Diabetes Care 26, 3102–3110 (2003).Article 

Google Scholar 
Laubenbacher, R., Sluka, J. P. & Glazier, J. A. Using digital twins in viral infection. Science 371, 1105–1106 (2021).Article 

Google Scholar 
Chan, A., Bica, I., Hüyük, A., Jarrett, D. & van der Schaar, M. The medkit-learn(ing) environment: medical decision modelling through simulation. In Adv. Neural Inf. Process. Syst. Track on Datasets and Benchmarks 1 (Curran Associates, 2021).Berrevoets, J., Jarrett, D., Chan, A. J. & Schaar, M. van der. AllSim: Simulating and benchmarking resource allocation policies in multi-user systems. Adv. Neural Inf. Proces. Syst. 36, 851–866 (2023).
Google Scholar 
Zhang, J. et al. Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism. Nat. Commun. 11, 4880 (2020).Article 

Google Scholar 
Allen, A. et al. A digital twins machine learning model for forecasting disease progression in stroke patients. Appl. Sci. 11, 5576 (2021).Article 

Google Scholar 
Bertolini, D. et al. Forecasting progression of mild cognitive impairment (MCI) and Alzheimer’s disease with digital twins. Alzheimer’s Dement. 17, e054414 (2021).Article 

Google Scholar 
Tang, Y. et al. GANDA: a deep generative adversarial network conditionally generates intratumoral nanoparticles distribution pixels-to-pixels. J. Control. Rel. 336, 336–343 (2021).Article 

Google Scholar 
Du, P., Zhu, X. & Wang, J.-X. Deep learning-based surrogate model for three-dimensional patient-specific computational fluid dynamics. Phys. Fluids 34, 081906 (2022).Article 

Google Scholar 
Donovan-Maiye, R. M. et al. A deep generative model of 3D single-cell organization. PLoS Comput. Biol. 18, e1009155 (2022).Article 

Google Scholar 
Pearl, J. Causality (Cambridge Univ. Press, 2009).Yang, Y. & Perdikaris, P. Physics-informed deep generative models. Preprint at https://doi.org/10.48550/arXiv.1812.03511 (2018).Johansson, F., Shalit, U. & Sontag, D. Learning representations for counterfactual inference. In International Conference on Machine Learning 3020–3029 (PMLR, 2016).Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021).Article 
MathSciNet 

Google Scholar 
Tsialiamanis, G., Wagg, D. J., Dervilis, N. & Worden, K. On generative models as the basis for digital twins. Data Centric Eng. 2, e11 (2021).Article 

Google Scholar 
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://doi.org/10.48550/arxiv.2204.06125 (2022).Chambon, P. et al. RoentGen: vision-language foundation model for chest X-ray generation. Preprint at https://doi.org/10.48550/arXiv.2211.12737 (2022).Pérez-García, F. et al. Radedit: stress-testing biomedical vision models via diffusion image editing. In Eur. Conf. on Computer Vision (ECCV) (Springer Science, 2024).Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://doi.org/10.48550/arXiv.2305.09617 (2023).Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).Article 

Google Scholar 
Chen, Y. T. & Zou, J. GenePT: a simple but effective foundation model for genes and cells built from ChatGPT. Preprint at bioRxiv https://doi.org/10.1101/2023.10.16.562533 (2023).Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y. & Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proc. 37th International Conference Machine Learning Vol. 119, 7176–7185 (PMLR, 2020).Kahveci, Z. Ü. Attribution problem of generative AI: a view from US copyright law. J. Intellect. Property Law Pract. 18, 796–807 (2023).Article 

Google Scholar 
Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313–313 (2023).Article 

Google Scholar 
Susnjak, T. ChatGPT: the end of online exam integrity? Education Sciences 14, 656 (MDPI, 2024).van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R. & Bockting, C. L. ChatGPT: five priorities for research. Nature 614, 224–226 (2023).Article 

Google Scholar 
Gates, B. The age of AI has begun. Gates Notes https://www.gatesnotes.com/The-Age-of-AI-Has-Begun (21 March 2023).Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J. & Aila, T. Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 32 (2019).Sajjadi, M. S. M. et al. Assessing generative model precision and recall. Adv. Neural Inf. Process. Syst. 31, 3927–3936 (2018).
Google Scholar 
Gretton, A. et al. A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012).MathSciNet 

Google Scholar 
Arora, S., Ge, R., Liang, Y., Ma, T. & Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). In 34th International Conference on Machine Learning 1, 322–349 (PMLR, 2017).Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant risk minimization. Preprint at https://doi.org/10.48550/arXiv.1907.02893 (2019).Gulrajani, I., Raffel, C. & Metz, L. Towards GAN benchmarks which require generalization. In 7th International Conference on Learning Representations (ICLR, 2019).Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6627–6638 (2017).
Google Scholar 
Theis, L., Van Den Oord, A. & Bethge, M. A note on the evaluation of generative models. In 4th International Conference on Learning Representations (ICLR, 2016).Lee, J. & Clifton, C. How much is enough? Choosing ε for differential privacy. Lecture Notes Comput. Sci. 7001, 325–340 (2011).Article 

Google Scholar 
Hayes, J., Melis, L., Danezis, G., De Cristofaro, E. & LOGAN Membership inference attacks against generative models. Proc. Priv. Enhancing Technol. 2019, 133–152 (2019).Article 

Google Scholar 
Hilprecht, B., Härterich, M. & Bernau, D. Monte Carlo and reconstruction membership inference attacks against generative models. In Proc. Conference on Privacy Enhancing Technologies https://doi.org/10.2478/popets-2019-0067 (De Gruyter Open/Sciendo, 2019).Chen, D., Yu, N., Zhang, Y. & Fritz, M. GAN-leaks: a taxonomy of membership inference attacks against generative models. In Proc. ACM Conference on Computer and Communications Security 343–362 (ACM, 2019).Liu, K. S., Xiao, C., Li, B. & Gao, J. Performing co-membership attacks against deep generative models. In Proc. IEEE International Conference on Data Mining (ICDM) 459–467 (IEEE, 2019).Hu, H. & Pang, J. Membership inference attacks against GANs by leveraging over-representation regions. In Proc. ACM Conference on Computer and Communications Security 2387–2389 (ACM, 2021).van Breugel, B., Sun, H., Qian, Z. & van der Schaar, M. Membership inference attacks against synthetic data through overfitting detection. In Proc. 26th International Conference on Artificial Intelligence and Statistics (AISTATS) (PMLR, 2023).Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowledge-based Syst. 10, 557–570 (2002).Article 
MathSciNet 

Google Scholar 
Machanavajjhala, A., Gehrke, J., Kifer, D. & Venkitasubramaniam, M. ℓ-diversity: privacy beyond k-anonymity. In Proc. International Conference on Data Engineering 2006, 24 (IEEE, 2006).Ninghui, L., Tiancheng, L. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and ℓ-diversity. In Proc. International Conference on Data Engineering 106–115 https://doi.org/10.1109/ICDE.2007.367856 (IEEE, 2007).Rubin, D. B. & Schenker, N. Multiple imputation in health-care databases: an overview and some applications. Stat. Med. 10, 585–598 (1991).Article 

Google Scholar 
Räisä, O., Jälkö, J. & Honkela, A. On consistent Bayesian inference from synthetic data. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI (2023).Hansen, L., Seedat, N., van der Schaar, M. & Petrovic, A. Reimagining synthetic tabular data generation through data-centric AI: a comprehensive benchmark. Adv. Neural Inf. Process. Syst. 36, 33781–33823 (2023).
Google Scholar 
Franceschelli, G. & Musolesi, M. Copyright in generative deep learning. Data Policy 4, e17 (2022).Article 

Google Scholar 
Kasneci, E. et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 103, 102274 (2023).Article 

Google Scholar 
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surveys 55, 248 (2023).Article 

Google Scholar 
Bohnet, B. et al. Attributed question answering: evaluation and modeling for attributed large language models. Preprint at https://doi.org/10.48550/arXiv.2212.08037 (2022).Gao, T., Yen, H., Yu, J. & Chen, D. Enabling large language models to generate text with citations. In The 2023 Conference on Empirical Methods in Natural Language Processing (ACL, 2023).Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).OpenAI, R. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).Anil, R. et al. Palm 2 technical report. Preprint at https://doi.org/10.48550/arXiv.2305.10403 (2023).Jiang, Z., Zhang, Y., Liu, C., Zhao, J. & Liu, K. Generative calibration for in-context learning. In Findings of the Association for Computational Linguistics (EMNLP 2023) 2312–2333 (ACL, 2023).Gao, L. et al. The pile: an 800Gb dataset of diverse text for language modeling. Preprint at https://doi.org/10.48550/arXiv.2101.00027 (2020).Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1290–1299 (IEEE, 2022).Oquab, M. et al. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2024).Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020).
Google Scholar 
Radford, A. et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning 28492–28518 (PMLR, 2023).Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning Vol. 139, 8748–8763 (PMLR, 2021).Driess, D. et al. Palm-e: an embodied multimodal language model. Preprint at https://doi.org/10.48550/arXiv.2303.03378 (2023).Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar 
van Breugel, B. & van der Schaar, M. Why tabular foundation models should be a research priority. In International Conference on Machine Learning (PMLR, 2024).Ye, C. et al. Towards cross-table masked pretraining for web data mining. In Proc. ACM Web Conference 2024 (WWW ’24) (ACM, 2023).Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M. & Kasneci, G. Language models are realistic tabular data generators. In 11th International Conference on Learning Representations (ICLR, 2023).Eggert, G., Huo, K., Biven, M. & Waugh, J. TabLib: A dataset of 627M tables with context. Preprint at https://doi.org/10.48550/arXiv.2310.07875 (2023).Schneider, G. & Fechner, U. Computer-based de novo design of drug-like molecules. Nat. Rev. Drug. Discov. 4, 649–663 (2005).Article 

Google Scholar 
Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. & Borgwardt, K. M. Weisfeiler–Lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011).MathSciNet 

Google Scholar 
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Google Scholar 
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).Article 

Google Scholar 
Blaschke, T. et al. REINVENT 2.0: an AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).Article 

Google Scholar 
Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet — a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).Article 

Google Scholar 
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).Article 

Google Scholar 
Satorras, V. G., Hoogeboom, E. & Welling, M. E(n) equivariant graph neural networks. In Proc. 38th International Conference on Machine Learning Vol. 139, 9323–9332 (PMLR, 2021).Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).Article 

Google Scholar 
Kayala, M. A., Azencott, C.-A., Chen, J. H. & Baldi, P. Learning to predict chemical reactions. J. Chem. Inf. Model. 51, 2209–2222 (2011).Article 

Google Scholar 
Genheden, S. et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Chemoinf. 12, https://doi.org/10.1186/s13321-020-00472-1 (2020).Oglic, D., Garnett, R. & Gaertner, T. Active search in intensionally specified structured spaces. In Proc. AAAI Conference on Artificial Intelligence (AAAI, 2017).Schneider, G. & Böhm, H.-J. Virtual screening and fast automated docking methods. Drug. Discov. Today 7, 64–70 (2002).Article 

Google Scholar 
Hartenfeller, M. et al. DOGS: reaction-driven de novo design of bioactive compounds. PLoS Comput. Biol. 8, 1–12 (2012).Article 

Google Scholar 
Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug. Discov. Today 20, 458–465 (2015).Article 

Google Scholar 
Oglic, D. et al. Active search for computer-aided drug design. Mol. Inform. 37, https://doi.org/10.1002/minf.201700130 (2018).Buterez, D., Janet, J. P., Kiddle, S. J., Oglic, D. & Lio, P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat. Commun. 15, 1517 (2024).Article 

Google Scholar 
Ucar, T. et al. Improving antibody humanness prediction using patent data. In 41st International Conference on Machine Learning (PMLR, 2024).Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).Article 

Google Scholar 
Kovaltsuk, A. et al. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J. Immunol. 201, 2502–2509 (2018).Article 

Google Scholar 
Dunbar, J. et al. SAbDab: the structural antibody database. Nucleic Acids Res. 42, D1140–D1146 (2013).Article 

Google Scholar 
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).Article 

Google Scholar 
Tang, L. Large models for genomics. Nat. Meth. 20, 1868 (2023).Article 

Google Scholar 
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).Article 

Google Scholar 
John, B. et al. Human microRNA targets. PLOS Biol. 2, e363 (2004).Article 

Google Scholar 
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).Article 

Google Scholar 
Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).Article 

Google Scholar 
Baker, D. & Sali, A. Protein structure prediction and structural genomics. Science 294, 93–96 (2001).Article 

Google Scholar 
McKinney, B. A., Reif, D. M., Ritchie, M. D. & Moore, J. H. Machine learning for detecting gene–gene interactions. Appl. Bioinform. 5, 77–88 (2006).Article 

Google Scholar 
Van Steen, K. Travelling the world of gene–gene interactions. Brief. Bioinform. 13, 1–19 (2012).Article 

Google Scholar 
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Meth. 17, 184–192 (2019).Article 

Google Scholar 
Martinkus, K. et al. AbDiffuser: full-atom generation of in-vitro functioning antibodies. Adv. Neural Inf. Process. Syst. 36, 40729–40759 (2023).
Google Scholar 
Raybould, M. & Deane, C. The therapeutic antibody profiler for computational developability assessment. Methods in Molecular Biology 13, 115–125 (2022).Article 

Google Scholar 
Abanades, B., Georges, G., Bujotzek, A. & Deane, C. M. ABlooper: fast accurate antibody CDR loop structure prediction with accuracy estimation. Bioinformatics 38, 1877–1880 (2022).Article 

Google Scholar 
Gong, J. et al. xTrimoGene: an efficient and scalable representation learner for single-cell RNA-seq data. Adv. Neural Inf. Process. Syst. 36, 69391–69403 (2023).
Google Scholar 
Baldi, P. & Chauvin, Y. Neural networks for fingerprint recognition. Neural Comput. 5, 402–418 (1993).Article 

Google Scholar 
Ciresan, D., Giusti, A., Gambardella, L. & Schmidhuber, J. Deep neural networks segment neuronal membranes in electron microscopy images. Adv. Neural Inf. Process. Syst. 25, 2843–2851 (2012).
Google Scholar 
Cireşan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. Mitosis detection in breast cancer histology images with deep neural networks. In Medical Image Computing and Computer-Assisted Intervention 411–418 (Springer, 2013).Wang, J. et al. Detecting cardiovascular disease from mammograms with deep learning. IEEE Trans. Medical Imaging 36, 1172–1181 (2017).Article 

Google Scholar 
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).Article 

Google Scholar 
Klang, E. et al. Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy. Gastrointest. Endosc. 91, 606–613.e2 (2020).Article 

Google Scholar 
Ackerman, M. J. The visible human project: a resource for education. Acad. Med. 74, 667–670 (1999).Article 

Google Scholar 
Lundervold, A. S. & Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 29, 102–127 (2019).Article 

Google Scholar 
Liu, S. et al. Deep learning in medical ultrasound analysis: a review. Engineering 5, 261–275 (2019).Article 

Google Scholar 
Brattain, L. J., Telfer, B. A., Dhyani, M., Grajo, J. R. & Samir, A. E. Machine learning for medical ultrasound: status, methods, and future opportunities. Abdom. Radiol. 43, 786–799 (2018).Article 

Google Scholar 
Ng, K. et al. PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. J. Biomed. Inform. 48, 160–170 (2014).Article 

Google Scholar 
Steinhubl, S. R., Wolff-Hughes, D. L., Nilsen, W., Iturriaga, E. & Califf, R. M. Digital clinical trials: creating a vision for the future. npj Digit. Med. 2, 126 (2019).Article 

Google Scholar 
Dunn, J. et al. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nat. Med. 27, 1105–1112 (2021).Article 

Google Scholar 
Steinhubl, S. R. et al. Effect of a home-based wearable continuous ECG monitoring patch on detection of undiagnosed atrial fibrillation. JAMA 320, 146–155 (2018).Article 

Google Scholar 
Pandit, J. A., Radin, J. M., Quer, G. & Topol, E. J. Smartphone apps in the COVID-19 pandemic. Nat. Biotechnol. 40, 1013–1022 (2022).Article 

Google Scholar 
Strain, T. et al. Wearable-device-measured physical activity and future health risk. Nat. Med. 26, 1385–1391 (2020).Article 

Google Scholar 
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).Article 

Google Scholar 
Stahlschmidt, S. R., Ulfenborg, B. & Synnergren, J. Multimodal deep learning for biomedical data fusion: a review. Brief. Bioinform. 23, bbab569 (2022).Article 

Google Scholar 
Zhavoronkov, A. Artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry. Mol. Pharm. 15, 4311–4313 (2018).Article 

Google Scholar 
Mann, M., Kumar, C., Zeng, W.-F. & Strauss, M. T. Artificial intelligence for proteomics and biomarker discovery. Cell Syst. 12, 759–770 (2021).Article 

Google Scholar 
Mandair, D., Reis-Filho, J. S. & Ashworth, A. Biological insights and novel biomarker discovery through deep learning approaches in breast cancer histopathology. npj Breast Cancer 9, 21 (2023).Lin, Q., Oglic, D., Lam, H.-K., Curtis, M. & Cvetkovic, Z. A Hybrid GCN-LSTM model for ventricular arrhythmia classification based on ECG pattern similarity. In 46th Annual International Conference IEEE Engineering in Medicine and Biology Society (EMBC 2024) (IEEE, 2024).Beaulieu-Jones, B. K., Greene, C. S. & Consortium, P. R. O.-A. A. C. T. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168–178 (2016).Article 

Google Scholar 
Bent, B. et al. Non-invasive wearables for remote monitoring of HbA1c and glucose variability: proof of concept. BMJ Open Diabetes Res. Care 9, e002027 (2021).Article 

Google Scholar 
Smit, L. C., Dikken, J., Schuurmans, M. J., de Wit, N. J. & Bleijenberg, N. Value of social network analysis for developing and evaluating complex healthcare interventions: a scoping review. BMJ Open 10, e039681 (2020).Article 

Google Scholar 
Gupta, A. & Katarya, R. Social media based surveillance systems for healthcare using machine learning: a systematic review. J. Biomed. Inform. 108, 103500 (2020).Article 

Google Scholar 
Jensen, A. B. et al. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 5, 4022 (2014).Article 

Google Scholar 
Miotto, R., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).Article 

Google Scholar 
Lee, C. K., Hofer, I., Gabel, E., Baldi, P. & Cannesson, M. Development and validation of a deep neural network model for prediction of postoperative in-hospital mortality. Anesthesiology 129, 649–662 (2018).Article 

Google Scholar 
Pham, T., Tran, T., Phung, D. & Venkatesh, S. Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017).Article 

Google Scholar 
Van Der Schaar, M. & Alaa, A. M. Synthetic healthcare data generation and assessment: challenges, methods, and impact on machine learning. In International Conference on Machine Learning (PMLR, 2021).Weng, L. What are diffusion models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models (2021).Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In 31st International Conference on Machine Learning 4, 3057–3070 (PMLR, 2014).Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In 2nd International Conference on Learning Representations (ICLR, 2014).Goodfellow, I. et al. Generative adversarial networks. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014).
Google Scholar 
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In 32nd International Conference on Machine Learning 3, 2246–2255 (PMLR, 2015).Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32, 11918–11930 (2019).
Google Scholar 
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar 
van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. Pixel recurrent neural networks. Proc. 33rd International Conference on Machine Learning 48, 1747–1756 (PMLR, 2016).Liu, J. et al. Towards out-of-distribution generalization: a survey. Preprint at https://doi.org/10.48550/arXiv.2108.13624 (2021).Bayer, J. et al. Universal ventricular coordinates: a generic framework for describing position within the heart and transferring data. Med. Image Anal. 45, 83–93 (2018).Article 

Google Scholar 
Kovatchev, B. A century of diabetes technology: signals, models, and artificial pancreas control. Trends Endocrinol. Metab. 30, 432–444 (2019).Article 

Google Scholar 
Ghaffarizadeh, A., Heiland, R., Friedman, S. H., Mumenthaler, S. M. & Macklin, P. PhysiCell: An open source physics-based cell simulator for 3-D multicellular systems. PLOS Comput. Biol. 14, e1005991 (2018).Article 

Google Scholar 

Hot Topics

Related Articles