Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training

  • Taolin Zhang
  • , Junwei Dong
  • , Jianing Wang
  • , Chengyu Wang*
  • , Ang Wang
  • , Yinghui Liu
  • , Jun Huang
  • , Yong Li
  • , Xiaofeng He
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Recently, knowledge-enhanced pre-trained language models (KEPLMs) improve context-aware representations via learning from structured relations in knowledge graphs, and/or linguistic knowledge from syntactic or dependency analysis. Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications. In this paper, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes, namely CKBERT (Chinese knowledge-enhanced BERT). Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks, i.e., linguistic-aware masked language modeling and contrastive multi-hop relation modeling. Based on the above two pre-training paradigms and our in-house implemented TorchAccelerator, we have pre-trained base (110M), large (345M) and huge (1.3B) versions of CKBERT efficiently on GPU clusters. Experiments demonstrate that CKBERT outperforms strong baselines for Chinese over various benchmark NLP tasks and in terms of different model sizes.

Original languageEnglish
Title of host publicationEMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Subtitle of host publicationIndustry Track
PublisherAssociation for Computational Linguistics (ACL)
Pages570-580
Number of pages11
ISBN (Electronic)9781952148255
DOIs
StatePublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022

Publication series

NameEMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period7/12/2211/12/22

Fingerprint

Dive into the research topics of 'Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training'. Together they form a unique fingerprint.

Cite this