TY - GEN
T1 - Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
AU - Zhao, Zhen
AU - Tang, Jingqun
AU - Lin, Chunhui
AU - Wu, Binghong
AU - Huang, Can
AU - Liu, Hao
AU - Tan, Xin
AU - Zhang, Zhizhong
AU - Xie, Yuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Scene text recognition (STR) in the wild frequently en-counters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a spe-cific scenario, but it is computationally intensive and re-quires multiple model copies for various scenarios. Re-cent studies indicate that large language models (LLMs) can learn from afew demonstration examples in a training-free manner, termed 'In-Context Learning' (ICL). Never-theless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from di-verse samples in the training stage. To this end, we intro-duce E2 STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E2 STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2 STR exhibits remarkable training-free adaptation in var-ious scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedanceIE2STR.
AB - Scene text recognition (STR) in the wild frequently en-counters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a spe-cific scenario, but it is computationally intensive and re-quires multiple model copies for various scenarios. Re-cent studies indicate that large language models (LLMs) can learn from afew demonstration examples in a training-free manner, termed 'In-Context Learning' (ICL). Never-theless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from di-verse samples in the training stage. To this end, we intro-duce E2 STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E2 STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2 STR exhibits remarkable training-free adaptation in var-ious scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedanceIE2STR.
KW - in-context learning
KW - multi-modal learning
KW - text recognition
UR - https://www.scopus.com/pages/publications/85202367368
U2 - 10.1109/CVPR52733.2024.01474
DO - 10.1109/CVPR52733.2024.01474
M3 - 会议稿件
AN - SCOPUS:85202367368
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 15567
EP - 15576
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -