A survey of datasets in medicine for large language models

  • Deshiwei Zhang
  • , Xiaojuan Xue
  • , Peng Gao
  • , Zhijuan Jin
  • , Menghan Hu*
  • , Yue Wu
  • , Xiayang Ying*
  • *Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

8 Scopus citations

Abstract

With the advent of models such as ChatGPT and other models, large language models (LLMs) have demonstrated unprecedented capabilities in understanding and generating natural language, presenting novel opportunities and challenges within the medicine domain. While there have been many studies focusing on the employment of LLMs in medicine, comprehensive reviews of the datasets utilized in this field remain scarce. This survey seeks to address this gap by providing a comprehensive overview of the datasets in medicine fueling LLMs, highlighting their unique characteristics and the critical roles they play at different stages of LLMs’ development: pre-training, fine-tuning, and evaluation. Ultimately, this survey aims to underline the significance of datasets in realizing the full potential of LLMs to innovate and improve healthcare outcomes.

Original languageEnglish
Pages (from-to)457-478
Number of pages22
JournalIntelligence and Robotics
Volume4
Issue number4
DOIs
StatePublished - Dec 2024

Keywords

  • Large language models (LLMs)
  • NLP
  • Q&A system in medicine
  • dataset in medicine

Fingerprint

Dive into the research topics of 'A survey of datasets in medicine for large language models'. Together they form a unique fingerprint.

Cite this