TY - JOUR
T1 - 字音和字形能有效增强汉字的表示吗?-基于命名实体识别任务的验证
AU - Duan, Yufeng
AU - Zhang, Meicong
AU - Liu, Yanzuo
AU - He, Guoxiu
N1 - Publisher Copyright:
© 2024 Chinese Academy of Sciences. All rights reserved.
PY - 2024/10/25
Y1 - 2024/10/25
N2 - [Objective] This study aims to investigate the effectiveness of using phonetics and orthography features to enhance the representation of Chinese characters. [Methods] Based on the Named Entity Recognition (NER) task, we used a general embedding module, a bidirectional LSTM module, and a fully connected network with Softmax activation as the benchmark embedding layer, context encoding and decoding layers. Then, we compared the changes in Micro-F1 scores and entity-specific F1 scores after enhancing character embeddings with Chinese pinyin, images, Wubi input codes, Four-Corner codes, Cangjie codes, and radicals, using datasets such as MSRA, PeopleDaily, CCKS2017, Resume, and E-Commerce. [Results] Using phonetic and orthographic enhanced embeddings led to a performance decrease of nearly 0.01 in the MSRA and PeopleDaily datasets. At the same time, there was no statistically significant change in performance in the CCKS2017, Resume, and E-Commerce datasets. [Limitations] Using only 32×32 pixels images of Chinese simplified characters may affect the extraction of orthographic features. [Conclusions] While phonetic and orthographic features can enhance the representation of Chinese characters, they also introduce noise. They lead to varying impacts on model performance across different corpora and entities.
AB - [Objective] This study aims to investigate the effectiveness of using phonetics and orthography features to enhance the representation of Chinese characters. [Methods] Based on the Named Entity Recognition (NER) task, we used a general embedding module, a bidirectional LSTM module, and a fully connected network with Softmax activation as the benchmark embedding layer, context encoding and decoding layers. Then, we compared the changes in Micro-F1 scores and entity-specific F1 scores after enhancing character embeddings with Chinese pinyin, images, Wubi input codes, Four-Corner codes, Cangjie codes, and radicals, using datasets such as MSRA, PeopleDaily, CCKS2017, Resume, and E-Commerce. [Results] Using phonetic and orthographic enhanced embeddings led to a performance decrease of nearly 0.01 in the MSRA and PeopleDaily datasets. At the same time, there was no statistically significant change in performance in the CCKS2017, Resume, and E-Commerce datasets. [Limitations] Using only 32×32 pixels images of Chinese simplified characters may affect the extraction of orthographic features. [Conclusions] While phonetic and orthographic features can enhance the representation of Chinese characters, they also introduce noise. They lead to varying impacts on model performance across different corpora and entities.
KW - Character Glyph
KW - Character Pronunciation
KW - Feature Fusion
KW - Glyph Embedding
KW - Named Entity Recognition
UR - https://www.scopus.com/pages/publications/85214483811
U2 - 10.11925/infotech.2096-3467.2023.0665
DO - 10.11925/infotech.2096-3467.2023.0665
M3 - 文章
AN - SCOPUS:85214483811
SN - 2096-3467
VL - 8
SP - 100
EP - 111
JO - Data Analysis and Knowledge Discovery
JF - Data Analysis and Knowledge Discovery
IS - 10
ER -