TY - GEN
T1 - Causal Fusion of Convolutional Neural Network and Vision Transformer for Image Anomaly Detection and Localization
AU - Zhang, Shuo
AU - Hu, Xiongpeng
AU - Liu, Jing
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - To address the challenge of visual anomaly detection amidst complex background interference. First, we construct a structural causal model for anomaly detection under complex background interference and propose an intervention strategy to block background feature interference. Then, we build an anomaly feature-sensitive neural network (AFSNN) containing two feature extraction modules based on the causal intervention strategy. Given the limitations of convolutional neural networks in capturing global features associated with spatial location dependence, and the substantial data requirements of vision transformers, we opt for the enhanced Swin Transformer module and the deformable convolutional networks encoder module to extract global features and local details, respectively. We also designed the cross-attention to fuse these two scales of feature representation. Finally, we introduce a causality-sensitive learning module that differentiates the outputs of the two feature extraction modules and constructs a causality-sensitive loss function by maximizing the output differences. This approach blocks background features and enhances sensitivity to anomaly features during training. Experiments show that AFSNN can effectively attenuate the confusing interference of the background pattern.
AB - To address the challenge of visual anomaly detection amidst complex background interference. First, we construct a structural causal model for anomaly detection under complex background interference and propose an intervention strategy to block background feature interference. Then, we build an anomaly feature-sensitive neural network (AFSNN) containing two feature extraction modules based on the causal intervention strategy. Given the limitations of convolutional neural networks in capturing global features associated with spatial location dependence, and the substantial data requirements of vision transformers, we opt for the enhanced Swin Transformer module and the deformable convolutional networks encoder module to extract global features and local details, respectively. We also designed the cross-attention to fuse these two scales of feature representation. Finally, we introduce a causality-sensitive learning module that differentiates the outputs of the two feature extraction modules and constructs a causality-sensitive loss function by maximizing the output differences. This approach blocks background features and enhances sensitivity to anomaly features during training. Experiments show that AFSNN can effectively attenuate the confusing interference of the background pattern.
KW - Swin Transformer
KW - anomaly detection
KW - causal inference
KW - cross-attention mechanism
UR - https://www.scopus.com/pages/publications/85206576902
U2 - 10.1109/ICME57554.2024.10687979
DO - 10.1109/ICME57554.2024.10687979
M3 - 会议稿件
AN - SCOPUS:85206576902
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PB - IEEE Computer Society
T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Y2 - 15 July 2024 through 19 July 2024
ER -