跳到主要导航 跳到搜索 跳到主要内容

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training

  • Taolin Zhang
  • , Junwei Dong
  • , Jianing Wang
  • , Chengyu Wang*
  • , Ang Wang
  • , Yinghui Liu
  • , Jun Huang
  • , Yong Li
  • , Xiaofeng He
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Recently, knowledge-enhanced pre-trained language models (KEPLMs) improve context-aware representations via learning from structured relations in knowledge graphs, and/or linguistic knowledge from syntactic or dependency analysis. Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications. In this paper, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes, namely CKBERT (Chinese knowledge-enhanced BERT). Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks, i.e., linguistic-aware masked language modeling and contrastive multi-hop relation modeling. Based on the above two pre-training paradigms and our in-house implemented TorchAccelerator, we have pre-trained base (110M), large (345M) and huge (1.3B) versions of CKBERT efficiently on GPU clusters. Experiments demonstrate that CKBERT outperforms strong baselines for Chinese over various benchmark NLP tasks and in terms of different model sizes.

源语言英语
主期刊名EMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
主期刊副标题Industry Track
出版商Association for Computational Linguistics (ACL)
570-580
页数11
ISBN(电子版)9781952148255
DOI
出版状态已出版 - 2022
活动2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022 - Abu Dhabi, 阿拉伯联合酋长国
期限: 7 12月 202211 12月 2022

出版系列

姓名EMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

会议

会议2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022
国家/地区阿拉伯联合酋长国
Abu Dhabi
时期7/12/2211/12/22

指纹

探究 'Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training' 的科研主题。它们共同构成独一无二的指纹。

引用此