Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

  • Zhen Zhao
  • , Jingqun Tang*
  • , Chunhui Lin
  • , Binghong Wu
  • , Can Huang
  • , Hao Liu
  • , Xin Tan
  • , Zhizhong Zhang
  • , Yuan Xie*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

28 Scopus citations

Abstract

Scene text recognition (STR) in the wild frequently en-counters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a spe-cific scenario, but it is computationally intensive and re-quires multiple model copies for various scenarios. Re-cent studies indicate that large language models (LLMs) can learn from afew demonstration examples in a training-free manner, termed 'In-Context Learning' (ICL). Never-theless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from di-verse samples in the training stage. To this end, we intro-duce E2 STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E2 STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2 STR exhibits remarkable training-free adaptation in var-ious scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedanceIE2STR.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages15567-15576
Number of pages10
ISBN (Electronic)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Keywords

  • in-context learning
  • multi-modal learning
  • text recognition

Fingerprint

Dive into the research topics of 'Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer'. Together they form a unique fingerprint.

Cite this