TY - GEN
T1 - FlaCGEC
T2 - 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023
AU - Du, Hanyue
AU - Zhao, Yike
AU - Tian, Qingyuan
AU - Wang, Jiani
AU - Wang, Lei
AU - Lan, Yunshi
AU - Lu, Xuesong
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s).
PY - 2023/10/21
Y1 - 2023/10/21
N2 - Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.
AB - Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.
KW - Chinese Grammatical Error Correction
KW - Deep Learning
KW - Fine-grained Linguistic Annotation
UR - https://www.scopus.com/pages/publications/85178164806
U2 - 10.1145/3583780.3615119
DO - 10.1145/3583780.3615119
M3 - 会议稿件
AN - SCOPUS:85178164806
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 5321
EP - 5325
BT - CIKM 2023 - Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 21 October 2023 through 25 October 2023
ER -