TY - JOUR
T1 - NDOrder
T2 - Exploring a novel decoding order for scene text recognition
AU - Zhong, Dajian
AU - Zhan, Hongjian
AU - Lyu, Shujing
AU - Liu, Cong
AU - Yin, Bing
AU - Shivakumara, Palaiahnakote
AU - Pal, Umapada
AU - Lu, Yue
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/9/1
Y1 - 2024/9/1
N2 - Text recognition in scene images is still considered as a challenging task for the computer vision and pattern recognition community. For text images affected by multiple adverse factors, such as occlusion (due to obstacles) and poor quality (due to blur and low resolution), the performance of the state-of-the-art scene text recognition methods degrades. The key reason is that the existing encoder–decoder framework follows fixed left-to-right decoding order, which lacks sufficient contextual information. In this paper, we present a novel decoding order where good-quality characters can first be decoded followed by low-quality characters, which preserves the contextual information irrespective of the aforementioned difficult scenarios. Our method, named NDOrder, extracts visual features with a ViT encoder and then decodes with the Random Order Generation module (ROG) for learning to decode with random decoding orders and the Vision-Content-Position module (VCP) for exploiting the connections among visual information, content and position. In addition, a new dataset named OLQT (Occluded and Low-Quality Text) is created by manually collecting text images that suffer from occlusion or low-quality from several standard text recognition datasets. The dataset is now available at https://github.com/djzhong1/OLQT. Experiments on OLQT and public scene text recognition benchmarks show that the proposed method achieves state-of-the-art performance.
AB - Text recognition in scene images is still considered as a challenging task for the computer vision and pattern recognition community. For text images affected by multiple adverse factors, such as occlusion (due to obstacles) and poor quality (due to blur and low resolution), the performance of the state-of-the-art scene text recognition methods degrades. The key reason is that the existing encoder–decoder framework follows fixed left-to-right decoding order, which lacks sufficient contextual information. In this paper, we present a novel decoding order where good-quality characters can first be decoded followed by low-quality characters, which preserves the contextual information irrespective of the aforementioned difficult scenarios. Our method, named NDOrder, extracts visual features with a ViT encoder and then decodes with the Random Order Generation module (ROG) for learning to decode with random decoding orders and the Vision-Content-Position module (VCP) for exploiting the connections among visual information, content and position. In addition, a new dataset named OLQT (Occluded and Low-Quality Text) is created by manually collecting text images that suffer from occlusion or low-quality from several standard text recognition datasets. The dataset is now available at https://github.com/djzhong1/OLQT. Experiments on OLQT and public scene text recognition benchmarks show that the proposed method achieves state-of-the-art performance.
KW - Contextual information
KW - Decoding order optimization
KW - Random order generation
KW - Scene text recognition
KW - Transformer
UR - https://www.scopus.com/pages/publications/85188690155
U2 - 10.1016/j.eswa.2024.123771
DO - 10.1016/j.eswa.2024.123771
M3 - 文章
AN - SCOPUS:85188690155
SN - 0957-4174
VL - 249
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 123771
ER -