TY - GEN
T1 - DualToken-ViT
T2 - 2024 SIAM International Conference on Data Mining, SDM 2024
AU - Chu, Zhenzhen
AU - Chen, Jiayu
AU - Chen, Cen
AU - Wang, Chengyu
AU - Wu, Ziheng
AU - Huang, Jun
AU - Weining, Qian
N1 - Publisher Copyright:
Copyright © 2024 by SIAM.
PY - 2024
Y1 - 2024
N2 - Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.
AB - Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.
KW - Attention
KW - Vision Transformers
UR - https://www.scopus.com/pages/publications/85193498161
M3 - 会议稿件
AN - SCOPUS:85193498161
T3 - Proceedings of the 2024 SIAM International Conference on Data Mining, SDM 2024
SP - 688
EP - 696
BT - Proceedings of the 2024 SIAM International Conference on Data Mining, SDM 2024
A2 - Shekhar, Shashi
A2 - Papalexakis, Vagelis
A2 - Gao, Jing
A2 - Jiang, Zhe
A2 - Riondato, Matteo
PB - Society for Industrial and Applied Mathematics Publications
Y2 - 18 April 2024 through 20 April 2024
ER -