EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining

Zerui Cai, Taolin Zhang, Chengyu Wang, Xiaofeng He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Medical text mining aims to learn models to extract useful information from medical sources. A major challenge is obtaining large-scale labeled data in the medical domain for model training, which is highly expensive. Recent studies show that leveraging massive unlabeled corpora for pre-training language models alleviates this problem by self-supervised learning. In this paper, we propose EMBERT, an entity-level knowledge-enhanced pre-trained language model, which leverages several distinct self-supervised tasks for Chinese medical text mining. EMBERT captures fine-grained semantic relations among medical terms by three self-supervised tasks, including i) context-entity consistency prediction (whether entities are of equivalence in meanings given certain contexts), ii) entity segmentation (segmenting entities into fine-grained semantic parts) and iii) bidirectional entity masking (predicting the atomic or adjective terms of long entities). The experimental results demonstrate that our model achieves significant improvements over five strong baselines on six public Chinese medical text mining datasets.

Original languageEnglish
Title of host publicationWeb and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Proceedings
EditorsLeong Hou U, Marc Spaniol, Yasushi Sakurai, Junying Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages242-257
Number of pages16
ISBN (Print)9783030858957
DOIs
StatePublished - 2021
Event5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021 - Guangzhou, China
Duration: 23 Aug 202125 Aug 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12858 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Country/TerritoryChina
CityGuangzhou
Period23/08/2125/08/21

Keywords

  • Chinese medical text mining
  • Deep context-aware neural network
  • Pre-trained language model
  • Self-supervised learning

Fingerprint

Dive into the research topics of 'EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining'. Together they form a unique fingerprint.

Cite this