TY - GEN
T1 - Farewell to Aimless Large-scale Pretraining
T2 - Findings of the Association for Computational Linguistics, ACL 2023
AU - Wang, Xiao
AU - Zhou, Weikang
AU - Zhang, Qi
AU - Zhou, Jie
AU - Gao, Songyang
AU - Wang, Junzhe
AU - Zhang, Menghan
AU - Gao, Xiang
AU - Chen, Yunwen
AU - Gui, Tao
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
AB - Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
UR - https://www.scopus.com/pages/publications/85175480227
U2 - 10.18653/v1/2023.findings-acl.35
DO - 10.18653/v1/2023.findings-acl.35
M3 - 会议稿件
AN - SCOPUS:85175480227
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 555
EP - 568
BT - Findings of the Association for Computational Linguistics, ACL 2023
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -