TY - JOUR
T1 - Enhancing scene text script identification through multi-task self-supervised learning
AU - Huang, Jin
AU - Liu, Li
AU - Lu, Yue
AU - Suen, Ching Y.
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
PY - 2025/9
Y1 - 2025/9
N2 - This paper proposes a multi-task self-supervised learning framework for scene text script identification, aimed at addressing the challenges posed by diverse fonts, complex backgrounds, low resolutions, and frequent distortions in natural scenes. By leveraging unlabeled data, our approach learns robust image representations tailored for script identification. Three complementary tasks are designed: an advanced Jigsaw puzzle task to capture both local and global features, a spatial alignment-enhanced SwAV task to generalize across transformations while retaining spatial details, and a rotation prediction task to enhance spatial reasoning. Experiments on four benchmark datasets demonstrate that our method achieves state-of-the-art results, outperforming existing approaches. Ablation studies confirm the effectiveness of each module within our framework, showcasing its potential to reduce reliance on labeled data and enhance script identification in real-world applications. The code can be accessed at https://github.com/jin-or-king/Enhancing_scene_text_script_identification_through_multi-task_self-supervised_learning.
AB - This paper proposes a multi-task self-supervised learning framework for scene text script identification, aimed at addressing the challenges posed by diverse fonts, complex backgrounds, low resolutions, and frequent distortions in natural scenes. By leveraging unlabeled data, our approach learns robust image representations tailored for script identification. Three complementary tasks are designed: an advanced Jigsaw puzzle task to capture both local and global features, a spatial alignment-enhanced SwAV task to generalize across transformations while retaining spatial details, and a rotation prediction task to enhance spatial reasoning. Experiments on four benchmark datasets demonstrate that our method achieves state-of-the-art results, outperforming existing approaches. Ablation studies confirm the effectiveness of each module within our framework, showcasing its potential to reduce reliance on labeled data and enhance script identification in real-world applications. The code can be accessed at https://github.com/jin-or-king/Enhancing_scene_text_script_identification_through_multi-task_self-supervised_learning.
KW - Advanced Jigsaw puzzle
KW - Multi-task self-supervised learning
KW - Scene text script identification
KW - Spatial alignment-enhanced SwAV
UR - https://www.scopus.com/pages/publications/105005807435
U2 - 10.1007/s00371-025-03982-x
DO - 10.1007/s00371-025-03982-x
M3 - 文章
AN - SCOPUS:105005807435
SN - 0178-2789
VL - 41
SP - 9571
EP - 9586
JO - Visual Computer
JF - Visual Computer
IS - 12
ER -