跳到主要导航 跳到搜索 跳到主要内容

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

  • Xiao Wang
  • , Weikang Zhou
  • , Qi Zhang*
  • , Jie Zhou
  • , Songyang Gao
  • , Junzhe Wang
  • , Menghan Zhang
  • , Xiang Gao
  • , Yunwen Chen
  • , Tao Gui*
  • *此作品的通讯作者
  • Fudan University
  • Ltd.

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.

源语言英语
主期刊名Findings of the Association for Computational Linguistics, ACL 2023
出版商Association for Computational Linguistics (ACL)
555-568
页数14
ISBN(电子版)9781959429623
DOI
出版状态已出版 - 2023
已对外发布
活动Findings of the Association for Computational Linguistics, ACL 2023 - Toronto, 加拿大
期限: 9 7月 202314 7月 2023

出版系列

姓名Proceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN(印刷版)0736-587X

会议

会议Findings of the Association for Computational Linguistics, ACL 2023
国家/地区加拿大
Toronto
时期9/07/2314/07/23

指纹

探究 'Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model' 的科研主题。它们共同构成独一无二的指纹。

引用此