TY - JOUR
T1 - Few Clean Instances Help Denoising Distant Supervision
AU - Liu, Yufang
AU - Huang, Ziyin
AU - Wang, Yijun
AU - Sun, Changzhi
AU - Lan, Man
AU - Wu, Yuanbin
AU - Mou, Xiaofeng
AU - Wang, Ding
N1 - Publisher Copyright:
© 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.
AB - Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.
UR - https://www.scopus.com/pages/publications/85165725888
M3 - 会议文章
AN - SCOPUS:85165725888
SN - 2951-2093
VL - 29
SP - 2528
EP - 2539
JO - Proceedings - International Conference on Computational Linguistics, COLING
JF - Proceedings - International Conference on Computational Linguistics, COLING
IS - 1
T2 - 29th International Conference on Computational Linguistics, COLING 2022
Y2 - 12 October 2022 through 17 October 2022
ER -