UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

  • Jiabo Ye
  • , Anwen Hu
  • , Haiyang Xu
  • , Qinghao Ye
  • , Ming Yan*
  • , Guohai Xu
  • , Chenliang Li
  • , Junfeng Tian
  • , Qi Qian
  • , Ji Zhang
  • , Qin Jin
  • , Liang He
  • , Xin Lin
  • , Fei Huang
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

54 Scopus citations

Abstract

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets are released at https://github.com/LukeForeverYoung/UReader.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationEMNLP 2023
PublisherAssociation for Computational Linguistics (ACL)
Pages2841-2858
Number of pages18
ISBN (Electronic)9798891760615
DOIs
StatePublished - 2023
Event2023 Findings of the Association for Computational Linguistics: EMNLP 2023 - Hybrid, Singapore
Duration: 6 Dec 202310 Dec 2023

Publication series

NameFindings of the Association for Computational Linguistics: EMNLP 2023

Conference

Conference2023 Findings of the Association for Computational Linguistics: EMNLP 2023
Country/TerritorySingapore
CityHybrid
Period6/12/2310/12/23

Fingerprint

Dive into the research topics of 'UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model'. Together they form a unique fingerprint.

Cite this