TY - GEN
T1 - UMRSpell
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
AU - He, Zheyu
AU - Zhu, Yujin
AU - Wang, Linlin
AU - Xu, Liang
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Chinese Spelling Correction (CSC) is the task of detecting and correcting misspelled characters in Chinese texts. As an important step for various downstream tasks, CSC confronts two challenges: 1) Character-level errors consist not only of spelling errors but also of missing and redundant ones that cause variable length between input and output texts, for which most CSC methods could not handle well because of the consistence length of texts required by their inherent detection-correction framework. Consequently, the two errors are considered outside the scope and left to future work, despite the fact that they are widely found and bound to CSC task in Chinese industrial scenario, such as Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR). 2) Most existing CSC methods focus on either detector or corrector and train different models for each one, respectively, leading to insufficiency of parameters sharing. To address these issues, we propose a novel model UMRSpell to learn detection and correction parts together at the same time from a multi-task learning perspective by using a detection transmission self-attention matrix, and flexibly deal with both missing, redundant, and spelling errors through re-tagging rules. Furthermore, we build a new dataset ECMR-2023 containing five kinds of character-level errors to enrich the CSC task closer to real-world applications. Experiments on both SIGHAN benchmarks and ECMR-2023 demonstrate the significant effectiveness of UMRSpell over previous representative baselines.
AB - Chinese Spelling Correction (CSC) is the task of detecting and correcting misspelled characters in Chinese texts. As an important step for various downstream tasks, CSC confronts two challenges: 1) Character-level errors consist not only of spelling errors but also of missing and redundant ones that cause variable length between input and output texts, for which most CSC methods could not handle well because of the consistence length of texts required by their inherent detection-correction framework. Consequently, the two errors are considered outside the scope and left to future work, despite the fact that they are widely found and bound to CSC task in Chinese industrial scenario, such as Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR). 2) Most existing CSC methods focus on either detector or corrector and train different models for each one, respectively, leading to insufficiency of parameters sharing. To address these issues, we propose a novel model UMRSpell to learn detection and correction parts together at the same time from a multi-task learning perspective by using a detection transmission self-attention matrix, and flexibly deal with both missing, redundant, and spelling errors through re-tagging rules. Furthermore, we build a new dataset ECMR-2023 containing five kinds of character-level errors to enrich the CSC task closer to real-world applications. Experiments on both SIGHAN benchmarks and ECMR-2023 demonstrate the significant effectiveness of UMRSpell over previous representative baselines.
UR - https://www.scopus.com/pages/publications/85174419519
U2 - 10.18653/v1/2023.acl-long.570
DO - 10.18653/v1/2023.acl-long.570
M3 - 会议稿件
AN - SCOPUS:85174419519
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 10238
EP - 10250
BT - Long Papers
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -