TY - GEN
T1 - UReader
T2 - 2023 Findings of the Association for Computational Linguistics: EMNLP 2023
AU - Ye, Jiabo
AU - Hu, Anwen
AU - Xu, Haiyang
AU - Ye, Qinghao
AU - Yan, Ming
AU - Xu, Guohai
AU - Li, Chenliang
AU - Tian, Junfeng
AU - Qian, Qi
AU - Zhang, Ji
AU - Jin, Qin
AU - He, Liang
AU - Lin, Xin
AU - Huang, Fei
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets are released at https://github.com/LukeForeverYoung/UReader.
AB - Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets are released at https://github.com/LukeForeverYoung/UReader.
UR - https://www.scopus.com/pages/publications/85178114404
U2 - 10.18653/v1/2023.findings-emnlp.187
DO - 10.18653/v1/2023.findings-emnlp.187
M3 - 会议稿件
AN - SCOPUS:85178114404
T3 - Findings of the Association for Computational Linguistics: EMNLP 2023
SP - 2841
EP - 2858
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -