An astronomical question answering dataset for evaluating large language models

Jie Li, Fuyong Zhao, Panfeng Chen, Jiafu Xie, Xiangrui Zhang, Hui Li, Mei Chen, Yanhao Wang, Ming Zhu

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Large language models (LLMs) have recently demonstrated exceptional capabilities across a variety of linguistic tasks including question answering (QA). However, it remains challenging to assess their performance in astronomical QA due to the lack of comprehensive benchmark datasets. To bridge this gap, we construct Astro-QA, the first benchmark dataset specifically for QA in astronomy. The dataset contains a collection of 3,082 questions of six types in both English and Chinese, along with standard (reference) answers and related material. These questions encompass several core branches of astronomy, including astrophysics, astrometry, celestial mechanics, history of astronomy, and astronomical techniques and methods. Furthermore, we propose a new measure called DGscore that integrates different measures for objective and subjective questions and incorporates a weighting scheme based on type- and question-specific difficulty coefficients to accurately assess the QA performance of each LLM. We validate the Astro-QA dataset through extensive experimentation with 27 open-source and commercial LLMs. The results show that it can serve as a reliable benchmark dataset to evaluate the capacity of LLM in terms of instruction following, knowledge reasoning, and natural language generation in the astronomical domain, which can calibrate current progress and facilitate future research of astronomical LLMs.

Original languageEnglish
Article number447
JournalScientific Data
Volume12
Issue number1
DOIs
StatePublished - Dec 2025

Fingerprint

Dive into the research topics of 'An astronomical question answering dataset for evaluating large language models'. Together they form a unique fingerprint.

Cite this