Skip to main navigation Skip to search Skip to main content

Few Clean Instances Help Denoising Distant Supervision

  • Yufang Liu*
  • , Ziyin Huang*
  • , Yijun Wang
  • , Changzhi Sun
  • , Man Lan
  • , Yuanbin Wu
  • , Xiaofeng Mou
  • , Ding Wang
  • *Corresponding author for this work
  • East China Normal University
  • Shanghai Jiao Tong University
  • ByteDance Ltd.
  • Midea Group

Research output: Contribution to journalConference articlepeer-review

Abstract

Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.

Original languageEnglish
Pages (from-to)2528-2539
Number of pages12
JournalProceedings - International Conference on Computational Linguistics, COLING
Volume29
Issue number1
StatePublished - 2022
Event29th International Conference on Computational Linguistics, COLING 2022 - Hybrid, Gyeongju, Korea, Republic of
Duration: 12 Oct 202217 Oct 2022

Fingerprint

Dive into the research topics of 'Few Clean Instances Help Denoising Distant Supervision'. Together they form a unique fingerprint.

Cite this