TY - JOUR
T1 - Adaptive feature fusion for scene text script identification
AU - Peng, Fuyou
AU - Ma, Hui
AU - Liu, Li
AU - Lu, Yue
AU - Suen, Ching Y.
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2024/7
Y1 - 2024/7
N2 - Script identification is an essential preliminary step in multilingual OCR systems. This paper focuses primarily on tackling the challenging problem of script identification in scene text images, which are usually characterized by low image quality, diverse text styles, and complex backgrounds. Furthermore, script identification becomes a fine-grained classification problem when some scripts share common characters. To address this issue, we propose a novel end-to-end CNN comprising two streams for extracting distinct types of features, namely, visual features and spatial features. In the visual stream, we introduce an enhanced Squeeze-and-Excitation (SE) channel attention mechanism to emphasize valuable features and suppress irrelevant ones. The enhanced SE is composed of squeeze and excitation steps. The squeeze step employs adaptive average pooling for information aggregation. Two 1x1 convolutional layers are used to derive channel weights in the excitation step. In the spatial stream, we perform efficient analysis of the spatial dependencies within the text lines based on LSTM. Finally, we propose an adaptive fusion approach that combines probability vectors from the two streams. Instead of being fixed, the weight assigned to each probability vector is learned during network training. To validate our proposed method, we conduct extensive tests on four publicly available datasets, viz. MLe2e, RRC-MLT2017, SIW-13, and CVSI-2015. Our proposed method achieves accuracies of 97.66%, 90.24%, 96.66%, and 98.44% on these four datasets, respectively, which compare favorably with state-of-the-art methods. The two streams have demonstrated complementarity. Moreover, ablation experiments have been conducted to verify the effectiveness of each component in the proposed method.
AB - Script identification is an essential preliminary step in multilingual OCR systems. This paper focuses primarily on tackling the challenging problem of script identification in scene text images, which are usually characterized by low image quality, diverse text styles, and complex backgrounds. Furthermore, script identification becomes a fine-grained classification problem when some scripts share common characters. To address this issue, we propose a novel end-to-end CNN comprising two streams for extracting distinct types of features, namely, visual features and spatial features. In the visual stream, we introduce an enhanced Squeeze-and-Excitation (SE) channel attention mechanism to emphasize valuable features and suppress irrelevant ones. The enhanced SE is composed of squeeze and excitation steps. The squeeze step employs adaptive average pooling for information aggregation. Two 1x1 convolutional layers are used to derive channel weights in the excitation step. In the spatial stream, we perform efficient analysis of the spatial dependencies within the text lines based on LSTM. Finally, we propose an adaptive fusion approach that combines probability vectors from the two streams. Instead of being fixed, the weight assigned to each probability vector is learned during network training. To validate our proposed method, we conduct extensive tests on four publicly available datasets, viz. MLe2e, RRC-MLT2017, SIW-13, and CVSI-2015. Our proposed method achieves accuracies of 97.66%, 90.24%, 96.66%, and 98.44% on these four datasets, respectively, which compare favorably with state-of-the-art methods. The two streams have demonstrated complementarity. Moreover, ablation experiments have been conducted to verify the effectiveness of each component in the proposed method.
KW - Adaptive fusion
KW - Enhanced SE
KW - Script identification
KW - Two streams
UR - https://www.scopus.com/pages/publications/85181729957
U2 - 10.1007/s11042-023-17986-z
DO - 10.1007/s11042-023-17986-z
M3 - 文章
AN - SCOPUS:85181729957
SN - 1380-7501
VL - 83
SP - 62677
EP - 62699
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 23
ER -