TY - GEN
T1 - CCC
T2 - 12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023
AU - Liu, Shu
AU - Jin, Yongnan
AU - Lu, Harry
AU - Zhao, Shangqing
AU - Lan, Man
AU - Chen, Yuefeng
AU - Yuan, Hao
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.
PY - 2023
Y1 - 2023
N2 - In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https://github.com/yysirs/CLUPT.
AB - In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https://github.com/yysirs/CLUPT.
KW - Chinese Layout Understanding Pre-train Transformer
KW - Chinese commercial contract
KW - Visual Document Understanding
UR - https://www.scopus.com/pages/publications/85174707178
U2 - 10.1007/978-3-031-44696-2_55
DO - 10.1007/978-3-031-44696-2_55
M3 - 会议稿件
AN - SCOPUS:85174707178
SN - 9783031446955
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 704
EP - 716
BT - Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings
A2 - Liu, Fei
A2 - Duan, Nan
A2 - Xu, Qingting
A2 - Hong, Yu
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 12 October 2023 through 15 October 2023
ER -