Synthetic data in biomedicine via generative artificial intelligence

Rubin, D. B. Statistical disclosure limitation. J. Off. Stat. 9, 461–468 (1993).
Google Scholar
Yoon, J., Drumright, L. N. & Van Der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inform. 24, 2378–2388 (2020).Article

Google Scholar
Bond-Taylor, S., Leach, A., Long, Y. & Willcocks, C. G. Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2021.3116668 (2021).Xu, D., Yuan, S., Zhang, L. & Wu, X. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data 570–575 (IEEE, 2018).Xu, D., Wu, Y., Yuan, S., Zhang, L. & Wu, X. Achieving causal fairness through generative adversarial networks. In Proc. International Joint Conference on Artificial Intelligence 1452–1458 (IJCAI, 2019).van Breugel, B., Kyono, T., Berrevoets, J. & van der Schaar, M. DECAF: generating fair synthetic data using causally-aware generative networks. Adv. Neural Inform. Process. Syst. 34, 22221–22233 (2021).
Google Scholar
Antoniou, A., Storkey, A. & Edwards, H. Data augmentation generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.1711.04340 (2017).Dina, A. S., Siddique, A. B. & Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access. 10, 96731–96747 (2022).Article

Google Scholar
Das, H. P. et al. Conditional synthetic data generation for robust machine learning applications with limited pandemic data. In Proc. AAAI Conference on Artificial Intelligence 36, 11792–11800 (AAAI, 2021).Bing, S., Dittadi, A., Bauer, S. & Schwab, P. Conditional generation of medical time series for extrapolation to underrepresented populations. PLoS Digital Health 1, e0000074 (2022).Article

Google Scholar
van Breugel, B., Seedat, N., Imrie, F. & van der Schaar, M. Can you rely on your model evaluation? Improving model evaluation with synthetic test data. Adv. Neural Inform. Process. Syst. 36, 1889–1904 (2023).
Google Scholar
Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).Article

Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug. Discov. 18, 463–477 (2019).Article

Google Scholar
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).Article

Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE, 2022).Wang, H. et al. Predicting the epidemics trend of COVID-19 using epidemiological-based generative adversarial networks. IEEE J. Sel. Top. Signal Process. 16, 276–288 (2022).Article

Google Scholar
Morbiducci, U. et al. Synthetic dataset generation for the analysis and the evaluation of image-based hemodynamics of the human aorta. Med. Biol. Eng. Comput. 50, 145–154 (2012).Article

Google Scholar
Frangi, A. F., Tsaftaris, S. A. & Prince, J. L. Simulation and synthesis in medical imaging. IEEE Trans. Med. Imaging 37, 673–679 (2018).Article

Google Scholar
Bray, A. et al. Pulse physiology engine: an open-source software platform for computational modeling of human medical simulation. SN Compr. Clin. Med. 1, 362–377 (2019).Article

Google Scholar
Webb, J. B. et al. Computational simulation to assess patient safety of uncompensated COVID-19 two-patient ventilator sharing using the Pulse Physiology Engine. PLOS ONE 15, e0242532 (2020).Article

Google Scholar
Patki, N., Wedge, R. & Veeramachaneni, K. The synthetic data vault. In Proc. 3rd IEEE International Conference on Data Science and Advanced Analytics (DSAA 2016) 399–410 (IEEE, 2016).Walonoski, J. et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25, 230–238 (2018).Article

Google Scholar
von Platen, P. et al. Diffusers: state-of-the-art diffusion models. GitHub github.com/huggingface/diffusers (2022).Qian, Z., Davies, R. & van der Schaar, M. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. Adv. Neural Inform. Process. Syst. 36, 3173–3188 (2023).
Google Scholar
Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–487 (2014).Article
MathSciNet

Google Scholar
Qu, Y. et al. GAN-DP: generative adversarial net driven differentially privacy-preserving big data publishing. In 2019 IEEE International Conference on Communications (ICC) (IEEE, 2019).Nikolenko, S. I. Synthetic Data for Deep Learning SOIA Vol. 174 (Springer, 2021).Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).Article

Google Scholar
Jordon, J. et al. Synthetic data — what, why and how? Preprint at https://doi.org/10.48550/arxiv.2205.03257 (2022).Alloza, C. et al. A case for synthetic data in regulatory decision-making in Europe. Clin. Pharmacol. Ther. 114, 795–801 (2023).Article

Google Scholar
Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digital Med. 6, 186 (2023).Article

Google Scholar
Savage, N. Synthetic data could be better than real data. Nature https://doi.org/10.1038/d41586-023-01445-8 (2023).Rocher, L., Hendrickx, J. M. & de Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 3069 (2019).Article

Google Scholar
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. & Rankin, D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022).Article

Google Scholar
Li, J., Cairns, B. J., Li, J. & Zhu, T. Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. npj Digital Med. 6, 98 (2023).Article

Google Scholar
Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun. 14, 5305 (2023).Article

Google Scholar
Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning (ICML) 290–306 (PMLR, 2022).Stadler, T., Oprisanu, B. & Troncoso, C. Synthetic data — anonymisation Groundhog Day. In 31st USENIX Security Symp. (USENIX, 2022).Dressel, J. & Farid, H. The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 4, eaao5580 (2018).Article

Google Scholar
Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters (11 October 2018).Lu, K., Mardziel, P., Wu, F., Amancharla, P. & Datta, A. Gender bias in neural natural language processing. In Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of his 65th Birthday 189–202 (Springer International Publishing, 2020).de Vassimon Manela, D., Errington, D., Fisher, T., van Breugel, B. & Minervini, P. Stereotype and skew: quantifying gender bias in pre-trained and fine-tuned language models. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics (ECACL) 2232–2242 (ACL, 2021).Kadambi, A. Achieving fairness in medical devices. Science 372, 30–31 (2021).Article

Google Scholar
Abid, A., Farooqi, M. & Zou, J. Persistent anti-Muslim bias in large language models. In Proc. 2021 AAAI/ACM Conference on AI, Ethics, and Society 9, 298–306 (ACM, 2021).Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54 (ACM, 2021).Grgic-Hlaca, N., Zafar, M. B., Gummadi, K. P. & Weller, A. The case for process fairness in learning: feature selection for fair decision making. In Symposium on Machine Learning and the Law at the 29th Conference on Neural Information Processing Systems (NIPS, 2016).Barocas, S. & Selbst, A. D. Big data’s disparate impact. Calif. Law Rev. 104, 671 (2016).
Google Scholar
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. In International Conference on Machine Learning 325–333 (PMLR, 2013).Alessandra, A. M. When doctrines collide: disparate treatment, disparate impact, and Watson v. Fort Worth Bank & Trust. Univ. Pennsylvania Law Rev. 137, 1755 (1988).Article

Google Scholar
Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C. & Venkatasubramanian, S. Certifying and removing disparate impact. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 259–268 (ACM, 2015).Saxena, N. A. et al. How do fairness definitions fare? Testing public attitudes towards three algorithmic definitions of fairness in loan allocations. Artif. Intell. 283, 103238 (2020).Article
MathSciNet

Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. & SMOTE Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).Article

Google Scholar
Draghi, B., Wang, Z., Myles, P. & Tucker, A. BayesBoost: identifying and handling bias using synthetic data generators. In Proc. 3rd Int. Worksh. on Learning with Imbalanced Domains: Theory and Applications 49–62 (PMLR, 2021).Waheed, A. et al. CovidGAN: data augmentation using auxiliary classifier GAN for improved Covid-19 detection. IEEE Access. 8, 91916–91923 (2020).Article

Google Scholar
Mahmood, F. et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans. Med. Imaging 39, 3257–3267 (2020).Article

Google Scholar
Shen, T., Hao, K., Gou, C. & Wang, F. Y. Mass image synthesis in mammogram with contextual information based on GANs. Comput. Meth. Prog. Biomed. 202, 106019 (2021).Article

Google Scholar
Tang, Y., Tang, Y., Zhu, Y., Xiao, J. & Summers, R. M. A disentangled generative model for disease decomposition in chest X-rays via normal image synthesis. Med. Image Anal. 67, 101839 (2021).Article

Google Scholar
van Breugel, B., Qian, Z. & van der Schaar, M. Synthetic data, real errors: how (not) to publish and use synthetic data. In Proc. 40th International Conference on Machine Learning (PMLR, 2023).Manousakas, D. & Aydöre, S. On the usefulness of synthetic tabular data generation. Preprint at https://doi.org/10.48550/arXiv.2306.15636 (2023).Liu, M. Y. & Tuzel, O. Coupled generative adversarial networks. Adv. Neural Inform. Process. Syst. 469, 477 (2016).
Google Scholar
Kim, T., Cha, M., Kim, H., Lee, J. K. & Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In 34th International Conference on Machine Learning 4, 2941–2949 (PMLR, 2017).Zhu, J. Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. IEEE International Conference on Computer Vision 2017, 2242–2251 (IEEE, 2017).Liu, M. Y., Breuel, T. & Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inform. Process. Syst. 30, 701–709 (2017).
Google Scholar
Choi, E. et al. Generating multi-label discrete patient records using generative adversarial networks. In Machine Learning for Healthcare 286–305 (PMLR, 2017).Yoon, J., Jordon, J., Van Der Schaar, M. & RadialGAN Leveraging multiple datasets to improve target-specific predictive models using generative adversarial networks. In 35th International Conference on Machine Learning 13, 9060–9068 (PMLR, 2018).Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4217–4228 (2018).Article

Google Scholar
Ali, M. B. et al. Domain mapping and deep learning from multiple MRI clinical datasets for prediction of molecular subtypes in low grade gliomas. Brain Sci. 10, 463 (2020).Article

Google Scholar
Ge, C., Gu, I. Y.-H., Jakola, A. S. & Yang, J. Enlarged training dataset by pairwise GANs for molecular-based brain tumor classification. IEEE Access. 8, 22560–22570 (2020).Article

Google Scholar
Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).Article

Google Scholar
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).Article

Google Scholar
Tao, F. et al. Digital twin-driven product design, manufacturing and service with big data. Int. J. Adv. Manuf. Technol. 94, 3563–3576 (2018).Article

Google Scholar
Corral-Acero, J. et al. The ‘Digital Twin’ to enable the vision of precision cardiology. Eur. Heart J. 41, 4556–4564 (2020).Article

Google Scholar
Eddy, D. M. & Schlessinger, L. Validation of the Archimedes diabetes model. Diabetes Care 26, 3102–3110 (2003).Article

Google Scholar
Laubenbacher, R., Sluka, J. P. & Glazier, J. A. Using digital twins in viral infection. Science 371, 1105–1106 (2021).Article

Google Scholar
Chan, A., Bica, I., Hüyük, A., Jarrett, D. & van der Schaar, M. The medkit-learn(ing) environment: medical decision modelling through simulation. In Adv. Neural Inf. Process. Syst. Track on Datasets and Benchmarks 1 (Curran Associates, 2021).Berrevoets, J., Jarrett, D., Chan, A. J. & Schaar, M. van der. AllSim: Simulating and benchmarking resource allocation policies in multi-user systems. Adv. Neural Inf. Proces. Syst. 36, 851–866 (2023).
Google Scholar
Zhang, J. et al. Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism. Nat. Commun. 11, 4880 (2020).Article

Google Scholar
Allen, A. et al. A digital twins machine learning model for forecasting disease progression in stroke patients. Appl. Sci. 11, 5576 (2021).Article

Google Scholar
Bertolini, D. et al. Forecasting progression of mild cognitive impairment (MCI) and Alzheimer’s disease with digital twins. Alzheimer’s Dement. 17, e054414 (2021).Article

Google Scholar
Tang, Y. et al. GANDA: a deep generative adversarial network conditionally generates intratumoral nanoparticles distribution pixels-to-pixels. J. Control. Rel. 336, 336–343 (2021).Article

Google Scholar
Du, P., Zhu, X. & Wang, J.-X. Deep learning-based surrogate model for three-dimensional patient-specific computational fluid dynamics. Phys. Fluids 34, 081906 (2022).Article

Google Scholar
Donovan-Maiye, R. M. et al. A deep generative model of 3D single-cell organization. PLoS Comput. Biol. 18, e1009155 (2022).Article

Google Scholar
Pearl, J. Causality (Cambridge Univ. Press, 2009).Yang, Y. & Perdikaris, P. Physics-informed deep generative models. Preprint at https://doi.org/10.48550/arXiv.1812.03511 (2018).Johansson, F., Shalit, U. & Sontag, D. Learning representations for counterfactual inference. In International Conference on Machine Learning 3020–3029 (PMLR, 2016).Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021).Article
MathSciNet

Google Scholar
Tsialiamanis, G., Wagg, D. J., Dervilis, N. & Worden, K. On generative models as the basis for digital twins. Data Centric Eng. 2, e11 (2021).Article

Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://doi.org/10.48550/arxiv.2204.06125 (2022).Chambon, P. et al. RoentGen: vision-language foundation model for chest X-ray generation. Preprint at https://doi.org/10.48550/arXiv.2211.12737 (2022).Pérez-García, F. et al. Radedit: stress-testing biomedical vision models via diffusion image editing. In Eur. Conf. on Computer Vision (ECCV) (Springer Science, 2024).Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://doi.org/10.48550/arXiv.2305.09617 (2023).Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).Article

Google Scholar
Chen, Y. T. & Zou, J. GenePT: a simple but effective foundation model for genes and cells built from ChatGPT. Preprint at bioRxiv https://doi.org/10.1101/2023.10.16.562533 (2023).Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y. & Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proc. 37th International Conference Machine Learning Vol. 119, 7176–7185 (PMLR, 2020).Kahveci, Z. Ü. Attribution problem of generative AI: a view from US copyright law. J. Intellect. Property Law Pract. 18, 796–807 (2023).Article

Google Scholar
Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313–313 (2023).Article

Google Scholar
Susnjak, T. ChatGPT: the end of online exam integrity? Education Sciences 14, 656 (MDPI, 2024).van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R. & Bockting, C. L. ChatGPT: five priorities for research. Nature 614, 224–226 (2023).Article

Google Scholar
Gates, B. The age of AI has begun. Gates Notes https://www.gatesnotes.com/The-Age-of-AI-Has-Begun (21 March 2023).Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J. & Aila, T. Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 32 (2019).Sajjadi, M. S. M. et al. Assessing generative model precision and recall. Adv. Neural Inf. Process. Syst. 31, 3927–3936 (2018).
Google Scholar
Gretton, A. et al. A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012).MathSciNet

Google Scholar
Arora, S., Ge, R., Liang, Y., Ma, T. & Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). In 34th International Conference on Machine Learning 1, 322–349 (PMLR, 2017).Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant risk minimization. Preprint at https://doi.org/10.48550/arXiv.1907.02893 (2019).Gulrajani, I., Raffel, C. & Metz, L. Towards GAN benchmarks which require generalization. In 7th International Conference on Learning Representations (ICLR, 2019).Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6627–6638 (2017).
Google Scholar
Theis, L., Van Den Oord, A. & Bethge, M. A note on the evaluation of generative models. In 4th International Conference on Learning Representations (ICLR, 2016).Lee, J. & Clifton, C. How much is enough? Choosing ε for differential privacy. Lecture Notes Comput. Sci. 7001, 325–340 (2011).Article

Google Scholar
Hayes, J., Melis, L., Danezis, G., De Cristofaro, E. & LOGAN Membership inference attacks against generative models. Proc. Priv. Enhancing Technol. 2019, 133–152 (2019).Article

Google Scholar
Hilprecht, B., Härterich, M. & Bernau, D. Monte Carlo and reconstruction membership inference attacks against generative models. In Proc. Conference on Privacy Enhancing Technologies https://doi.org/10.2478/popets-2019-0067 (De Gruyter Open/Sciendo, 2019).Chen, D., Yu, N., Zhang, Y. & Fritz, M. GAN-leaks: a taxonomy of membership inference attacks against generative models. In Proc. ACM Conference on Computer and Communications Security 343–362 (ACM, 2019).Liu, K. S., Xiao, C., Li, B. & Gao, J. Performing co-membership attacks against deep generative models. In Proc. IEEE International Conference on Data Mining (ICDM) 459–467 (IEEE, 2019).Hu, H. & Pang, J. Membership inference attacks against GANs by leveraging over-representation regions. In Proc. ACM Conference on Computer and Communications Security 2387–2389 (ACM, 2021).van Breugel, B., Sun, H., Qian, Z. & van der Schaar, M. Membership inference attacks against synthetic data through overfitting detection. In Proc. 26th International Conference on Artificial Intelligence and Statistics (AISTATS) (PMLR, 2023).Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowledge-based Syst. 10, 557–570 (2002).Article
MathSciNet

Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D. & Venkitasubramaniam, M. ℓ-diversity: privacy beyond k-anonymity. In Proc. International Conference on Data Engineering 2006, 24 (IEEE, 2006).Ninghui, L., Tiancheng, L. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and ℓ-diversity. In Proc. International Conference on Data Engineering 106–115 https://doi.org/10.1109/ICDE.2007.367856 (IEEE, 2007).Rubin, D. B. & Schenker, N. Multiple imputation in health-care databases: an overview and some applications. Stat. Med. 10, 585–598 (1991).Article

Google Scholar
Räisä, O., Jälkö, J. & Honkela, A. On consistent Bayesian inference from synthetic data. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI (2023).Hansen, L., Seedat, N., van der Schaar, M. & Petrovic, A. Reimagining synthetic tabular data generation through data-centric AI: a comprehensive benchmark. Adv. Neural Inf. Process. Syst. 36, 33781–33823 (2023).
Google Scholar
Franceschelli, G. & Musolesi, M. Copyright in generative deep learning. Data Policy 4, e17 (2022).Article

Google Scholar
Kasneci, E. et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 103, 102274 (2023).Article

Google Scholar
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surveys 55, 248 (2023).Article

Google Scholar
Bohnet, B. et al. Attributed question answering: evaluation and modeling for attributed large language models. Preprint at https://doi.org/10.48550/arXiv.2212.08037 (2022).Gao, T., Yen, H., Yu, J. & Chen, D. Enabling large language models to generate text with citations. In The 2023 Conference on Empirical Methods in Natural Language Processing (ACL, 2023).Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).OpenAI, R. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).Anil, R. et al. Palm 2 technical report. Preprint at https://doi.org/10.48550/arXiv.2305.10403 (2023).Jiang, Z., Zhang, Y., Liu, C., Zhao, J. & Liu, K. Generative calibration for in-context learning. In Findings of the Association for Computational Linguistics (EMNLP 2023) 2312–2333 (ACL, 2023).Gao, L. et al. The pile: an 800Gb dataset of diverse text for language modeling. Preprint at https://doi.org/10.48550/arXiv.2101.00027 (2020).Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1290–1299 (IEEE, 2022).Oquab, M. et al. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2024).Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020).
Google Scholar
Radford, A. et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning 28492–28518 (PMLR, 2023).Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning Vol. 139, 8748–8763 (PMLR, 2021).Driess, D. et al. Palm-e: an embodied multimodal language model. Preprint at https://doi.org/10.48550/arXiv.2303.03378 (2023).Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
van Breugel, B. & van der Schaar, M. Why tabular foundation models should be a research priority. In International Conference on Machine Learning (PMLR, 2024).Ye, C. et al. Towards cross-table masked pretraining for web data mining. In Proc. ACM Web Conference 2024 (WWW ’24) (ACM, 2023).Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M. & Kasneci, G. Language models are realistic tabular data generators. In 11th International Conference on Learning Representations (ICLR, 2023).Eggert, G., Huo, K., Biven, M. & Waugh, J. TabLib: A dataset of 627M tables with context. Preprint at https://doi.org/10.48550/arXiv.2310.07875 (2023).Schneider, G. & Fechner, U. Computer-based de novo design of drug-like molecules. Nat. Rev. Drug. Discov. 4, 649–663 (2005).Article

Google Scholar
Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. & Borgwardt, K. M. Weisfeiler–Lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011).MathSciNet

Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).Article

Google Scholar
Blaschke, T. et al. REINVENT 2.0: an AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).Article

Google Scholar
Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet — a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).Article

Google Scholar
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).Article

Google Scholar
Satorras, V. G., Hoogeboom, E. & Welling, M. E(n) equivariant graph neural networks. In Proc. 38th International Conference on Machine Learning Vol. 139, 9323–9332 (PMLR, 2021).Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).Article

Google Scholar
Kayala, M. A., Azencott, C.-A., Chen, J. H. & Baldi, P. Learning to predict chemical reactions. J. Chem. Inf. Model. 51, 2209–2222 (2011).Article

Google Scholar
Genheden, S. et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Chemoinf. 12, https://doi.org/10.1186/s13321-020-00472-1 (2020).Oglic, D., Garnett, R. & Gaertner, T. Active search in intensionally specified structured spaces. In Proc. AAAI Conference on Artificial Intelligence (AAAI, 2017).Schneider, G. & Böhm, H.-J. Virtual screening and fast automated docking methods. Drug. Discov. Today 7, 64–70 (2002).Article

Google Scholar
Hartenfeller, M. et al. DOGS: reaction-driven de novo design of bioactive compounds. PLoS Comput. Biol. 8, 1–12 (2012).Article

Google Scholar
Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug. Discov. Today 20, 458–465 (2015).Article

Google Scholar
Oglic, D. et al. Active search for computer-aided drug design. Mol. Inform. 37, https://doi.org/10.1002/minf.201700130 (2018).Buterez, D., Janet, J. P., Kiddle, S. J., Oglic, D. & Lio, P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat. Commun. 15, 1517 (2024).Article

Google Scholar
Ucar, T. et al. Improving antibody humanness prediction using patent data. In 41st International Conference on Machine Learning (PMLR, 2024).Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).Article

Google Scholar
Kovaltsuk, A. et al. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J. Immunol. 201, 2502–2509 (2018).Article

Google Scholar
Dunbar, J. et al. SAbDab: the structural antibody database. Nucleic Acids Res. 42, D1140–D1146 (2013).Article

Google Scholar
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).Article

Google Scholar
Tang, L. Large models for genomics. Nat. Meth. 20, 1868 (2023).Article

Google Scholar
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).Article

Google Scholar
John, B. et al. Human microRNA targets. PLOS Biol. 2, e363 (2004).Article

Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).Article

Google Scholar
Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).Article

Google Scholar
Baker, D. & Sali, A. Protein structure prediction and structural genomics. Science 294, 93–96 (2001).Article

Google Scholar
McKinney, B. A., Reif, D. M., Ritchie, M. D. & Moore, J. H. Machine learning for detecting gene–gene interactions. Appl. Bioinform. 5, 77–88 (2006).Article

Google Scholar
Van Steen, K. Travelling the world of gene–gene interactions. Brief. Bioinform. 13, 1–19 (2012).Article

Google Scholar
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Meth. 17, 184–192 (2019).Article

Google Scholar
Martinkus, K. et al. AbDiffuser: full-atom generation of in-vitro functioning antibodies. Adv. Neural Inf. Process. Syst. 36, 40729–40759 (2023).
Google Scholar
Raybould, M. & Deane, C. The therapeutic antibody profiler for computational developability assessment. Methods in Molecular Biology 13, 115–125 (2022).Article

Google Scholar
Abanades, B., Georges, G., Bujotzek, A. & Deane, C. M. ABlooper: fast accurate antibody CDR loop structure prediction with accuracy estimation. Bioinformatics 38, 1877–1880 (2022).Article

Google Scholar
Gong, J. et al. xTrimoGene: an efficient and scalable representation learner for single-cell RNA-seq data. Adv. Neural Inf. Process. Syst. 36, 69391–69403 (2023).
Google Scholar
Baldi, P. & Chauvin, Y. Neural networks for fingerprint recognition. Neural Comput. 5, 402–418 (1993).Article

Google Scholar
Ciresan, D., Giusti, A., Gambardella, L. & Schmidhuber, J. Deep neural networks segment neuronal membranes in electron microscopy images. Adv. Neural Inf. Process. Syst. 25, 2843–2851 (2012).
Google Scholar
Cireşan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. Mitosis detection in breast cancer histology images with deep neural networks. In Medical Image Computing and Computer-Assisted Intervention 411–418 (Springer, 2013).Wang, J. et al. Detecting cardiovascular disease from mammograms with deep learning. IEEE Trans. Medical Imaging 36, 1172–1181 (2017).Article

Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).Article

Google Scholar
Klang, E. et al. Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy. Gastrointest. Endosc. 91, 606–613.e2 (2020).Article

Google Scholar
Ackerman, M. J. The visible human project: a resource for education. Acad. Med. 74, 667–670 (1999).Article

Google Scholar
Lundervold, A. S. & Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 29, 102–127 (2019).Article

Google Scholar
Liu, S. et al. Deep learning in medical ultrasound analysis: a review. Engineering 5, 261–275 (2019).Article

Google Scholar
Brattain, L. J., Telfer, B. A., Dhyani, M., Grajo, J. R. & Samir, A. E. Machine learning for medical ultrasound: status, methods, and future opportunities. Abdom. Radiol. 43, 786–799 (2018).Article

Google Scholar
Ng, K. et al. PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. J. Biomed. Inform. 48, 160–170 (2014).Article

Google Scholar
Steinhubl, S. R., Wolff-Hughes, D. L., Nilsen, W., Iturriaga, E. & Califf, R. M. Digital clinical trials: creating a vision for the future. npj Digit. Med. 2, 126 (2019).Article

Google Scholar
Dunn, J. et al. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nat. Med. 27, 1105–1112 (2021).Article

Google Scholar
Steinhubl, S. R. et al. Effect of a home-based wearable continuous ECG monitoring patch on detection of undiagnosed atrial fibrillation. JAMA 320, 146–155 (2018).Article

Google Scholar
Pandit, J. A., Radin, J. M., Quer, G. & Topol, E. J. Smartphone apps in the COVID-19 pandemic. Nat. Biotechnol. 40, 1013–1022 (2022).Article

Google Scholar
Strain, T. et al. Wearable-device-measured physical activity and future health risk. Nat. Med. 26, 1385–1391 (2020).Article

Google Scholar
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).Article

Google Scholar
Stahlschmidt, S. R., Ulfenborg, B. & Synnergren, J. Multimodal deep learning for biomedical data fusion: a review. Brief. Bioinform. 23, bbab569 (2022).Article

Google Scholar
Zhavoronkov, A. Artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry. Mol. Pharm. 15, 4311–4313 (2018).Article

Google Scholar
Mann, M., Kumar, C., Zeng, W.-F. & Strauss, M. T. Artificial intelligence for proteomics and biomarker discovery. Cell Syst. 12, 759–770 (2021).Article

Google Scholar
Mandair, D., Reis-Filho, J. S. & Ashworth, A. Biological insights and novel biomarker discovery through deep learning approaches in breast cancer histopathology. npj Breast Cancer 9, 21 (2023).Lin, Q., Oglic, D., Lam, H.-K., Curtis, M. & Cvetkovic, Z. A Hybrid GCN-LSTM model for ventricular arrhythmia classification based on ECG pattern similarity. In 46th Annual International Conference IEEE Engineering in Medicine and Biology Society (EMBC 2024) (IEEE, 2024).Beaulieu-Jones, B. K., Greene, C. S. & Consortium, P. R. O.-A. A. C. T. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168–178 (2016).Article

Google Scholar
Bent, B. et al. Non-invasive wearables for remote monitoring of HbA1c and glucose variability: proof of concept. BMJ Open Diabetes Res. Care 9, e002027 (2021).Article

Google Scholar
Smit, L. C., Dikken, J., Schuurmans, M. J., de Wit, N. J. & Bleijenberg, N. Value of social network analysis for developing and evaluating complex healthcare interventions: a scoping review. BMJ Open 10, e039681 (2020).Article

Google Scholar
Gupta, A. & Katarya, R. Social media based surveillance systems for healthcare using machine learning: a systematic review. J. Biomed. Inform. 108, 103500 (2020).Article

Google Scholar
Jensen, A. B. et al. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 5, 4022 (2014).Article

Google Scholar
Miotto, R., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).Article

Google Scholar
Lee, C. K., Hofer, I., Gabel, E., Baldi, P. & Cannesson, M. Development and validation of a deep neural network model for prediction of postoperative in-hospital mortality. Anesthesiology 129, 649–662 (2018).Article

Google Scholar
Pham, T., Tran, T., Phung, D. & Venkatesh, S. Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017).Article

Google Scholar
Van Der Schaar, M. & Alaa, A. M. Synthetic healthcare data generation and assessment: challenges, methods, and impact on machine learning. In International Conference on Machine Learning (PMLR, 2021).Weng, L. What are diffusion models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models (2021).Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In 31st International Conference on Machine Learning 4, 3057–3070 (PMLR, 2014).Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In 2nd International Conference on Learning Representations (ICLR, 2014).Goodfellow, I. et al. Generative adversarial networks. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014).
Google Scholar
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In 32nd International Conference on Machine Learning 3, 2246–2255 (PMLR, 2015).Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32, 11918–11930 (2019).
Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. Pixel recurrent neural networks. Proc. 33rd International Conference on Machine Learning 48, 1747–1756 (PMLR, 2016).Liu, J. et al. Towards out-of-distribution generalization: a survey. Preprint at https://doi.org/10.48550/arXiv.2108.13624 (2021).Bayer, J. et al. Universal ventricular coordinates: a generic framework for describing position within the heart and transferring data. Med. Image Anal. 45, 83–93 (2018).Article

Google Scholar
Kovatchev, B. A century of diabetes technology: signals, models, and artificial pancreas control. Trends Endocrinol. Metab. 30, 432–444 (2019).Article

Google Scholar
Ghaffarizadeh, A., Heiland, R., Friedman, S. H., Mumenthaler, S. M. & Macklin, P. PhysiCell: An open source physics-based cell simulator for 3-D multicellular systems. PLOS Comput. Biol. 14, e1005991 (2018).Article

Google Scholar

Synthetic data in biomedicine via generative artificial intelligence

Antioxidant activities of Saudi honey samples related to their content of short peptides

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks | BMC Bioinformatics

A prenatal skin atlas reveals immune regulation of human skin morphogenesis

Be-dataHIVE: a base editing database | BMC Bioinformatics

Generalizable and automated classification of TNM stage from pathology reports with external validation

Hot Topics

Antioxidant activities of Saudi honey samples related to their content of short peptides

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks | BMC Bioinformatics

A prenatal skin atlas reveals immune regulation of human skin morphogenesis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Antioxidant activities of Saudi honey samples related to their content of short peptides

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks | BMC Bioinformatics

A prenatal skin atlas reveals immune regulation of human skin morphogenesis

Be-dataHIVE: a base editing database | BMC Bioinformatics

Popular Articles

Antioxidant activities of Saudi honey samples related to their content of short peptides

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks | BMC Bioinformatics

A prenatal skin atlas reveals immune regulation of human skin morphogenesis