Abstract
Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 2528-2539 |
| Number of pages | 12 |
| Journal | Proceedings - International Conference on Computational Linguistics, COLING |
| Volume | 29 |
| Issue number | 1 |
| State | Published - 2022 |
| Event | 29th International Conference on Computational Linguistics, COLING 2022 - Hybrid, Gyeongju, Korea, Republic of Duration: 12 Oct 2022 → 17 Oct 2022 |
Fingerprint
Dive into the research topics of 'Few Clean Instances Help Denoising Distant Supervision'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver