TY - GEN
T1 - TOREE
T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Zhuang, Xinlin
AU - Wu, Hongyi
AU - Shen, Xinshu
AU - Yu, Peimin
AU - Yi, Gaowei
AU - Chen, Xinhao
AU - Hu, Tu
AU - Chen, Yang
AU - Ren, Yupei
AU - Zhang, Yadong
AU - Song, Youqi
AU - Liu, Binxuan
AU - Lan, Man
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Topic relevance of an essay demands that the composition adheres to a clear theme and aligns well with the essay prompt requirements, a critical aspect of essay quality evaluation. However, existing research of Automatic Essay Scoring (AES) for Chinese essays has overlooked topic relevance and lacks detailed feedback, while Automatic Essay Comment Generation (AECG) faces much complexity and difficulty. Additionally, current Large Language Models, including GPT-4, often make incorrect judgments and provide overly impractical feedback when evaluating topic relevance. This paper introduces TOREE (Topic Relevance Evaluation), a comprehensive dataset developed to assess topic relevance in Chinese primary and middle school students' essays, which is beneficial for AES, AECG and other applications. Moreover, our proposed two-step method utilizes TOREE through a combination of Supervised Fine-tuning and Preference Learning. Experimental results demonstrate that TOREE is of high quality, and our method significantly enhances models' performance on two designed tasks for topic relevance evaluation, improving both automatic and human evaluations across four diverse LLMs.
AB - Topic relevance of an essay demands that the composition adheres to a clear theme and aligns well with the essay prompt requirements, a critical aspect of essay quality evaluation. However, existing research of Automatic Essay Scoring (AES) for Chinese essays has overlooked topic relevance and lacks detailed feedback, while Automatic Essay Comment Generation (AECG) faces much complexity and difficulty. Additionally, current Large Language Models, including GPT-4, often make incorrect judgments and provide overly impractical feedback when evaluating topic relevance. This paper introduces TOREE (Topic Relevance Evaluation), a comprehensive dataset developed to assess topic relevance in Chinese primary and middle school students' essays, which is beneficial for AES, AECG and other applications. Moreover, our proposed two-step method utilizes TOREE through a combination of Supervised Fine-tuning and Preference Learning. Experimental results demonstrate that TOREE is of high quality, and our method significantly enhances models' performance on two designed tasks for topic relevance evaluation, improving both automatic and human evaluations across four diverse LLMs.
UR - https://www.scopus.com/pages/publications/85205281714
U2 - 10.18653/v1/2024.findings-acl.342
DO - 10.18653/v1/2024.findings-acl.342
M3 - 会议稿件
AN - SCOPUS:85205281714
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 5749
EP - 5765
BT - The 62nd Annual Meeting of the Association for Computational Linguistics
A2 - Ku, Lun-Wei
A2 - Martins, Andre
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -