TY - GEN
T1 - MoCA-Dialog
T2 - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
AU - Zhang, Yunjia
AU - Zhu, Junyi
AU - Wang, Rui
AU - Zhuang, Tianai
AU - Sang, Jinqiu
AU - Kim, Ha Kyung
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Current benchmarks for evaluating large language models (LLMs) in medicine primarily focus on static questionanswering, neglecting the dynamic, interactive nature of many clinical workflows. This is particularly evident in cognitive assessments, where nuanced dialogue and multimodal interpretation are critical, yet no specialized benchmark exists to evaluate these capabilities. To address this gap, we introduce MoCADialog, the first large-scale, multimodal benchmark featuring 5000 high-fidelity simulated dialogues for the Montreal Cognitive Assessment (MoCA). The benchmark includes tasks of increasing complexity: accurate scoring, cognitive profile generation, and clinical error attribution, allowing for a fine-grained evaluation across seven cognitive domains. Our comprehensive evaluation of state-of-the-art multimodal LLMs reveals significant performance disparities; while models excel at simple recall tasks, they consistently fail in domains requiring executive function and abstract reasoning. A novel, clinically-driven error analysis further indicates that these failures stem not from knowledge deficits, but from fundamental difficulties in interpreting nuanced cues and applying domain-specific reasoning. MoCA-Dialog provides a crucial tool for assessing the clinical readiness of LLMs and highlights that future progress depends on enhancing their core reasoning and interpretive abilities, not just expanding their knowledge base. We release our demo, and prompt examples at https://mocadialogue.github.io.
AB - Current benchmarks for evaluating large language models (LLMs) in medicine primarily focus on static questionanswering, neglecting the dynamic, interactive nature of many clinical workflows. This is particularly evident in cognitive assessments, where nuanced dialogue and multimodal interpretation are critical, yet no specialized benchmark exists to evaluate these capabilities. To address this gap, we introduce MoCADialog, the first large-scale, multimodal benchmark featuring 5000 high-fidelity simulated dialogues for the Montreal Cognitive Assessment (MoCA). The benchmark includes tasks of increasing complexity: accurate scoring, cognitive profile generation, and clinical error attribution, allowing for a fine-grained evaluation across seven cognitive domains. Our comprehensive evaluation of state-of-the-art multimodal LLMs reveals significant performance disparities; while models excel at simple recall tasks, they consistently fail in domains requiring executive function and abstract reasoning. A novel, clinically-driven error analysis further indicates that these failures stem not from knowledge deficits, but from fundamental difficulties in interpreting nuanced cues and applying domain-specific reasoning. MoCA-Dialog provides a crucial tool for assessing the clinical readiness of LLMs and highlights that future progress depends on enhancing their core reasoning and interpretive abilities, not just expanding their knowledge base. We release our demo, and prompt examples at https://mocadialogue.github.io.
KW - cognitive assessment
KW - large language model
KW - medical benchmark
KW - MoCA
UR - https://www.scopus.com/pages/publications/105033547191
U2 - 10.1109/BIBM66473.2025.11356537
DO - 10.1109/BIBM66473.2025.11356537
M3 - 会议稿件
AN - SCOPUS:105033547191
T3 - Proceedings - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
SP - 6654
EP - 6659
BT - Proceedings - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
A2 - Liu, Juan
A2 - Huang, Jingshan
A2 - Wang, Xiaowo
A2 - Zhang, Fa
A2 - Zou, Xiufen
A2 - Tian, Tian
A2 - Hu, Xiaohua
A2 - Hu, Bin
A2 - Xiong, Yi
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 December 2025 through 18 December 2025
ER -