TY - JOUR
T1 - Evaluating Psychological Competency via Chinese Q&A in Large Language Models
AU - Gao, Feng
AU - He, Yishen
AU - Chen, Qin
AU - Liu, Feng
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/8
Y1 - 2025/8
N2 - Recently, the application of large language models (LLMs) in psychology has gained increasing attention. However, their psychological competence still requires further investigation. This study explores this issue through the lens of Chinese psychological knowledge question answering (QA). Specifically, we constructed a dedicated dataset based on Chinese qualification examinations for psychological counselors and psychotherapists. Subsequently, we evaluated dense, Mixture-of-Expert, and reasoning LLMs with varying parameter sizes and evaluation modes in the Chinese context, measuring answer accuracy in both closed-ended and open-ended settings. The experimental results showed that the larger and more recent LLMs achieved higher accuracy in psychological QA. While few-shot learning led to improvements in accuracy, Chain-of-Thought prompting and reasoning LLMs provided only limited gains. Notably, LLMs achieved higher accuracy in closed-ended settings than in open-ended ones. Furthermore, error analysis indicated that LLMs can produce incorrect or hallucinated responses, primarily due to insufficient psychological knowledge and conceptual confusion. Although current LLMs show promise in psychological QA tasks, users should remain cautious about over-reliance on their responses. A complementary, human-AI collaborative approach is recommended for practical use.
AB - Recently, the application of large language models (LLMs) in psychology has gained increasing attention. However, their psychological competence still requires further investigation. This study explores this issue through the lens of Chinese psychological knowledge question answering (QA). Specifically, we constructed a dedicated dataset based on Chinese qualification examinations for psychological counselors and psychotherapists. Subsequently, we evaluated dense, Mixture-of-Expert, and reasoning LLMs with varying parameter sizes and evaluation modes in the Chinese context, measuring answer accuracy in both closed-ended and open-ended settings. The experimental results showed that the larger and more recent LLMs achieved higher accuracy in psychological QA. While few-shot learning led to improvements in accuracy, Chain-of-Thought prompting and reasoning LLMs provided only limited gains. Notably, LLMs achieved higher accuracy in closed-ended settings than in open-ended ones. Furthermore, error analysis indicated that LLMs can produce incorrect or hallucinated responses, primarily due to insufficient psychological knowledge and conceptual confusion. Although current LLMs show promise in psychological QA tasks, users should remain cautious about over-reliance on their responses. A complementary, human-AI collaborative approach is recommended for practical use.
KW - LLM evaluation
KW - large language models
KW - psychological question answering
UR - https://www.scopus.com/pages/publications/105014397703
U2 - 10.3390/app15169089
DO - 10.3390/app15169089
M3 - 文章
AN - SCOPUS:105014397703
SN - 2076-3417
VL - 15
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 16
M1 - 9089
ER -