TY - GEN
T1 - DLE
T2 - 2024 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2024
AU - Quan, Jiahao
AU - Wang, Hailing
AU - Wu, Chunwei
AU - Cao, Guitao
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Document images captured through mobile devices in natural environments are often affected by various types of illumination degradation. The degradation diminishes the clarity and readability of document images, thereby complicating their application to OCR downstream tasks. Existing methods typically address only one or a limited number of degradation types and do not consider the diversity of image degradation types. Additionally, these methods typically involve a pre-trained fixed sub-network to estimate background light or shadows, which lacks flexibility and adaptability. To overcome these challenges, this study proposes a novel framework named DLE, which comprises a two-loop generative adversarial network and a multi-modal discriminator. Specifically, to improve the quality of image representation, a mask extractor is embedded before the image input generator. This forces the model to focus on the distinct features in the image, enhancing the representation of illumination anomalous and degraded regions. The mask extractor generates a luminance mask to evaluate the difference in illumination between the input and target images. Subsequently, the consistency loss computation incorporates a dynamic optimization of the mask extractor, strengthening its ability to estimate the illumination degradation part. Moreover, a pre-trained visual-language model is introduced into the multi-modal discriminator, leveraging its robust cross-modal alignment capability to improve the semantic consistency of the generated images with the preset input text. Extensive experiments demonstrate that our approach achieves the SOTA performance in terms of edit distance (ED) and character error rate (CER).
AB - Document images captured through mobile devices in natural environments are often affected by various types of illumination degradation. The degradation diminishes the clarity and readability of document images, thereby complicating their application to OCR downstream tasks. Existing methods typically address only one or a limited number of degradation types and do not consider the diversity of image degradation types. Additionally, these methods typically involve a pre-trained fixed sub-network to estimate background light or shadows, which lacks flexibility and adaptability. To overcome these challenges, this study proposes a novel framework named DLE, which comprises a two-loop generative adversarial network and a multi-modal discriminator. Specifically, to improve the quality of image representation, a mask extractor is embedded before the image input generator. This forces the model to focus on the distinct features in the image, enhancing the representation of illumination anomalous and degraded regions. The mask extractor generates a luminance mask to evaluate the difference in illumination between the input and target images. Subsequently, the consistency loss computation incorporates a dynamic optimization of the mask extractor, strengthening its ability to estimate the illumination degradation part. Moreover, a pre-trained visual-language model is introduced into the multi-modal discriminator, leveraging its robust cross-modal alignment capability to improve the semantic consistency of the generated images with the preset input text. Extensive experiments demonstrate that our approach achieves the SOTA performance in terms of edit distance (ED) and character error rate (CER).
UR - https://www.scopus.com/pages/publications/85217852232
U2 - 10.1109/SMC54092.2024.10831684
DO - 10.1109/SMC54092.2024.10831684
M3 - 会议稿件
AN - SCOPUS:85217852232
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 3701
EP - 3707
BT - 2024 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 October 2024 through 10 October 2024
ER -