Enhancing scene text script identification through multi-task self-supervised learning

Jin Huang, Li Liu, Yue Lu, Ching Y. Suen

Research output: Contribution to journalArticlepeer-review

Abstract

This paper proposes a multi-task self-supervised learning framework for scene text script identification, aimed at addressing the challenges posed by diverse fonts, complex backgrounds, low resolutions, and frequent distortions in natural scenes. By leveraging unlabeled data, our approach learns robust image representations tailored for script identification. Three complementary tasks are designed: an advanced Jigsaw puzzle task to capture both local and global features, a spatial alignment-enhanced SwAV task to generalize across transformations while retaining spatial details, and a rotation prediction task to enhance spatial reasoning. Experiments on four benchmark datasets demonstrate that our method achieves state-of-the-art results, outperforming existing approaches. Ablation studies confirm the effectiveness of each module within our framework, showcasing its potential to reduce reliance on labeled data and enhance script identification in real-world applications. The code can be accessed at https://github.com/jin-or-king/Enhancing_scene_text_script_identification_through_multi-task_self-supervised_learning.

Original languageEnglish
Pages (from-to)9571-9586
Number of pages16
JournalVisual Computer
Volume41
Issue number12
DOIs
StatePublished - Sep 2025

Keywords

  • Advanced Jigsaw puzzle
  • Multi-task self-supervised learning
  • Scene text script identification
  • Spatial alignment-enhanced SwAV

Fingerprint

Dive into the research topics of 'Enhancing scene text script identification through multi-task self-supervised learning'. Together they form a unique fingerprint.

Cite this