FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation

  • Hanyue Du
  • , Yike Zhao
  • , Qingyuan Tian
  • , Jiani Wang
  • , Lei Wang
  • , Yunshi Lan*
  • , Xuesong Lu
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.

Original languageEnglish
Title of host publicationCIKM 2023 - Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages5321-5325
Number of pages5
ISBN (Electronic)9798400701245
DOIs
StatePublished - 21 Oct 2023
Event32nd ACM International Conference on Information and Knowledge Management, CIKM 2023 - Birmingham, United Kingdom
Duration: 21 Oct 202325 Oct 2023

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference32nd ACM International Conference on Information and Knowledge Management, CIKM 2023
Country/TerritoryUnited Kingdom
CityBirmingham
Period21/10/2325/10/23

Keywords

  • Chinese Grammatical Error Correction
  • Deep Learning
  • Fine-grained Linguistic Annotation

Fingerprint

Dive into the research topics of 'FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation'. Together they form a unique fingerprint.

Cite this