TY - GEN
T1 - HalluScope
T2 - 21st International Conference on Intelligent Computing, ICIC 2025
AU - Zhao, Chen
AU - Zeng, Biao Jie
AU - Chen, Kedi
AU - Lin, Xin
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - This study presents HalluScope, a benchmark specifically developed for evaluating hallucinations in large language models. HalluScope comprises 800 adversarially designed questions spanning multiple domains, systematically categorized into selective, temporal, imitative, factual, and overconfidence hallucinations. The dataset was constructed through automated question generation with mutual supervision between models, enabling both generation and evaluation. The evaluation adopts a multiple-choice format, requiring models to select the correct answers from options containing multiple correct choices, thereby providing a more nuanced assessment of model confidence and judgment under uncertainty. Extensive experiments were conducted on 12 large language models, including ERNIE-Bot, ChatGLM, Qwen, and XVerse, with nine models exhibiting hallucination-free rates below 50%, underscoring the benchmark’s difficulty. Furthermore, HalluScope offers insights into hallucination-prone domains and hallucination types, providing guidance for fine-tuning models to mitigate hallucinations effectively.
AB - This study presents HalluScope, a benchmark specifically developed for evaluating hallucinations in large language models. HalluScope comprises 800 adversarially designed questions spanning multiple domains, systematically categorized into selective, temporal, imitative, factual, and overconfidence hallucinations. The dataset was constructed through automated question generation with mutual supervision between models, enabling both generation and evaluation. The evaluation adopts a multiple-choice format, requiring models to select the correct answers from options containing multiple correct choices, thereby providing a more nuanced assessment of model confidence and judgment under uncertainty. Extensive experiments were conducted on 12 large language models, including ERNIE-Bot, ChatGLM, Qwen, and XVerse, with nine models exhibiting hallucination-free rates below 50%, underscoring the benchmark’s difficulty. Furthermore, HalluScope offers insights into hallucination-prone domains and hallucination types, providing guidance for fine-tuning models to mitigate hallucinations effectively.
KW - Component
KW - Formatting
KW - Insert
KW - Style
KW - Styling
UR - https://www.scopus.com/pages/publications/105012430042
U2 - 10.1007/978-981-96-9994-0_39
DO - 10.1007/978-981-96-9994-0_39
M3 - 会议稿件
AN - SCOPUS:105012430042
SN - 9789819699933
T3 - Communications in Computer and Information Science
SP - 466
EP - 477
BT - Advanced Intelligent Computing Technology and Applications - 21st International Conference, ICIC 2025, Proceedings
A2 - Huang, De-Shuang
A2 - Chen, Haiming
A2 - Li, Bo
A2 - Zhang, Qinhu
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 26 July 2025 through 29 July 2025
ER -