A vision–language foundation model for the generation of realistic chest X-ray images

Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10674–10685 (IEEE, 2022).Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://arxiv.org/abs/2204.06125v1 (2022).Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).
Google Scholar 
Schuhmann, C. et al. LAION-5B: an open large-scale dataset for training next generation imagetext models. In Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 25278–25294 (Curran Associates, Inc., 2022)Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258v3 (2022).Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc. AAAI Conf. Artif. Intell. 33, 590–597 (2019).
Google Scholar 
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via selfsupervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 1–8 (2019).Article 

Google Scholar 
Cohen, J. P. et al. TorchXRayVision: a library of chest X-ray datasets and models. GitHub https://github.com/mlmed/torchxrayvision (2022).Chambon, P., Cook, T. S. & Langlotz, C. P. Improved fine-tuning of in-domain transformer model for inferring COVID-19 presence in multi-institutional radiology reports. J. Digit. Imaging 36, 164–177 (2023).Article 
PubMed 

Google Scholar 
Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 4228–4238 (Association for Computational Linguistics, 2021).Miura, Y., Zhang, Y., Tsai, E., Langlotz, C. & Jurafsky, D. Improving factual completeness and consistency of image-to-text radiology report generation. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 5288–5304 (Association for Computational Linguistics, 2021).Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association for Computational Linguistics, 2002).Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations (2020).Zhang, Y., Merck, D., Tsai, E., Manning, C. D. & Langlotz, C. Optimizing the factual correctness of a summary: a study of summarizing radiology reports. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 5108–5120 (Association for Computational Linguistics, 2020).Delbrouck, J.-B. et al. Improving the factual correctness of radiology report generation with semantic rewards. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 4348–4360 (Association for Computational Linguistics, 2022).Zhang, Y., Jiang, H., Miura, Y., Manning, C. D. & Langlotz, C. P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proc. 7th Machine Learning for Healthcare Conference Vol. 182 (eds Lipton, Z., Ranganath, R., Sendak, M., Sjoding, M. & Yeung, S.) 2–25 (PMLR, 2022).Endo, M., Krishnan, R., Krishna, V., Ng, A. Y. & Rajpurkar, P. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In Proc. Machine Learning for Health Vol. 158 (eds Roy, S. et al.) 209–219 (PMLR, 2021).Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. npj Digit. Med. 6, 74 (2023).Article 
PubMed 
PubMed Central 

Google Scholar 
van der Sluijs, R., Bhaskhar, N., Rubin, D., Langlotz, C. & Chaudhari, A. Exploring image augmentations for siamese representation learning with chest X-rays. In Medical Imaging with Deep Learning Vol. 227 (eds Oguz, I. et al.) 444–467 (PMLR, 2024).Alain, G. & Bengio, Y. Understanding intermediate layers using linear classifier probes. Preprint at https://arxiv.org/abs/1610.01644v4 (2016).Müller-Franzes, G. et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 13, 12098 (2023).Article 
PubMed 
PubMed Central 

Google Scholar 
Ktena, I. et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat. Med. 30, 1166–1173 (2024).Goyal, P., Mahajan, D., Gupta, A. & Misra, I. Scaling and benchmarking self-supervised visual representation learning. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 6391-6400 (IEEE, 2019).Dominic, J. et al. Improving data-efficiency and robustness of medical imaging segmentation using inpainting-based self-supervised learning. Bioengineering 10, 207 (2023).Article 
PubMed 
PubMed Central 

Google Scholar 
Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 6, 1330–1345 (2022).Li, A. C., Prabhudesai, M., Duggal, S., Brown, E. & Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 2206–2217 (IEEE, 2023).Graham, M. S. et al. Denoising diffusion models for out-of-distribution detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2947–2956 (IEEE, 2023).Rahman, A., Valanarasu, J. M. J., Hacihaliloglu, I. & Patel, V. M. Ambiguous medical image segmentation using diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11536–11546 (IEEE, 2023).Moor, M. et al. Med-Flamingo: a Multimodal Medical Few-shot Learner. In Proc. 3rd Machine Learning for Health Symposium Vol. 225, 353–367 (PMLR, 2023).Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).Liu, C., Shah, A., Bai, W. & Arcucci, R. Utilizing synthetic data for medical vision-language pre-training: bypassing the need for real images. Preprint at https://arxiv.org/abs/2310.07027 (2023).Gu, Y. et al. Biomedjourney: counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. Preprint at https://arxiv.org/abs/2310.10765v3 (2023).Carlini, N. et al. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23) 5253–5270 (USENIX Association, 2023).Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).Article 
PubMed 
PubMed Central 

Google Scholar 
Lee, K. et al. Aligning text-to-image models using human feedback. Preprint at https://arxiv.org/abs/2302.12192v1 (2023).Clark, K., Vicol, P., Swersky, K. & Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. In The Twelfth International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=1vmSEVL19f (2024).Xu, J. et al. ImageReward: learning and evaluating human preferences for text-to-image generation. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 15903–15935 (Curran Associates, Inc., 2023).Nguyen, H. Q. et al. VinDr-CXR: an open dataset of chest X-rays with radiologist’s annotations. Sci. Data 9, 429 (2022).Article 
PubMed 
PubMed Central 

Google Scholar 
von Platen, P. et al. Diffusers: State-of-the-art diffusion models. GitHub https://github.com/huggingface/diffusers (2022).Delbrouck, J.-B. et al. ViLMedic: a framework for research at the intersection of vision and language in medical AI. In Proc. 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (eds Basile, V., Kozareva, Z. & Stajner, S.) 23–34 (Association for Computational Linguistics, 2022).Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 6000–6010 (Curran Associates, Inc., 2017).Liu, L., Ren, Y., Lin, Z. & Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. In The Tenth International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=PlKWVd2yBkY (2022).Ruiz, N. et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 22500–22510 (IEEE, 2023).Chambon, P., Bluethgen, C., Langlotz, C. P. & Chaudhari, A. Adapting pretrained vision-language foundational models to medical imaging domains. In NeurIPS 2022 Foundation Models for Decision Making Workshop https://openreview.net/forum?id=QtxbYdJVT8Q (2022).Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 (IEEE, 2016).Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T. & Lehtinen, J. The role of ImageNet classes in Fréchet inception distance. In The Eleventh International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=4oXTQ6m_ws8 (2023).Wang, Z., Simoncelli, E. P. & Bovik, A. C. Multiscale structural similarity for image quality assessment. In Thirty-Seventh Asilomar Conference on Signals, Systems and Computers 2003 Vol. 2, 1398–1402 (IEEE, 2003).Pinaya, W. H. et al. Brain imaging generation with latent diffusion models. In Deep Generative Models. DGM4MICCAI 2022. Lecture Notes in Computer Science Vol. 13609 (eds Mukhopadhyay, A., Oksuz, I., Engelhardt, S., Zhu, D. & Yuan, Y.) (Springer, 2022).Smit, A. et al. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1500–1519 (Association for Computational Linguistics, 2020).Sechidis, K., Tsoumakas, G. & Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science Vol. 6913 (eds Gunopulos, D., Hofmann, T., Malerba, D. & Vazirgiannis, M.) (Springer, 2011).Szymański, P. & Kajdanowicz, T. A network perspective on stratification of multi-label data. In Proc. First International Workshop on Learning with Imbalanced Domains: Theory and Applications Vol. 74 (eds Torgo, L., Branco, P. & Moniz, N.) 22–35 (PMLR, 2017).Chen, X. & He, K. Exploring simple siamese representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15750–15758 (IEEE, 2021).Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning 119 (eds Daumé III, H. & Singh, A.) 1597–1607 (PMLR, 2020).Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. Preprint at https://arxiv.org/abs/2003.04297v1 (2020).Mitchell, M. et al. Model cards for model reporting. In Proc. Conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).Tang, R. et al. What the DAAM: Interpreting stable diffusion using cross attention. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 5644–5659 (Association for Computational Linguistics, 2023).

Hot Topics

Related Articles