CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding

  • Shu Liu
  • , Yongnan Jin
  • , Harry Lu
  • , Shangqing Zhao
  • , Man Lan*
  • , Yuefeng Chen
  • , Hao Yuan
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https://github.com/yysirs/CLUPT.

Original languageEnglish
Title of host publicationNatural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings
EditorsFei Liu, Nan Duan, Qingting Xu, Yu Hong
PublisherSpringer Science and Business Media Deutschland GmbH
Pages704-716
Number of pages13
ISBN (Print)9783031446955
DOIs
StatePublished - 2023
Event12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023 - Foshan, China
Duration: 12 Oct 202315 Oct 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14303 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023
Country/TerritoryChina
CityFoshan
Period12/10/2315/10/23

Keywords

  • Chinese Layout Understanding Pre-train Transformer
  • Chinese commercial contract
  • Visual Document Understanding

Fingerprint

Dive into the research topics of 'CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding'. Together they form a unique fingerprint.

Cite this