TY - JOUR
T1 - A General Framework for Efficient Medical Image Analysis via Shared Attention Vision Transformer
AU - Liu, Yihang
AU - Wen, Ying
AU - Yang, Longzhen
AU - He, Lianghua
AU - Zhou, Mengchu
N1 - Publisher Copyright:
© 1982-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision Transformers (ViTs) demonstrate significant promise in medical image analysis but face two critical challenges: 1) their limited ability to capture local features in data-scarce scenarios, leading to data inefficiency, and 2) their high computational and storage demands of the full fine-tuning process in transfer learning, resulting in parameter inefficiency. To achieve efficient and accurate medical image analysis, we propose Shared Attention Vision Transformer (SAViT) that comprises three innovative modules: i) Shared Prior Attention (SPA) that enhances data efficiency by innovatively employing a visual prompt to sequentially share consistent attention weights across local image regions, thereby enabling the learning of translational invariance to capture locality; ii) MixPool that preserves global modeling ability by aggregating local features after SPA through a multi-pooling mechanism, thus effectively facilitating long-range dependency across local image regions; and iii) Low-rank Multi-head Self-Attention (Lr-MSA) that improves parameter efficiency by using low-rank weights of multi-head self-attention, hence reducing computational complexity while maintaining accuracy in medical image analysis. SAViT demonstrates strong generalization across multiple medical imaging modalities, including retinopathy, dermoscopy, and radiography. Extensive experiments are conducted. The results indicate its high data efficiency and outstanding performance in comparison with more than 20 medical-specific and ViT-based models when all of them are trained from scratch. It excels in parameter-efficient tuning by surpassing 17 models across 6 datasets in transfer learning, with only 0.17M/0.23M trainable parameters on ViT-B/SwinViT-B backbones requiring 86.60M/88.00M parameters. Source code can be found at: https://github.com/LYH-hh/SAViT.
AB - Vision Transformers (ViTs) demonstrate significant promise in medical image analysis but face two critical challenges: 1) their limited ability to capture local features in data-scarce scenarios, leading to data inefficiency, and 2) their high computational and storage demands of the full fine-tuning process in transfer learning, resulting in parameter inefficiency. To achieve efficient and accurate medical image analysis, we propose Shared Attention Vision Transformer (SAViT) that comprises three innovative modules: i) Shared Prior Attention (SPA) that enhances data efficiency by innovatively employing a visual prompt to sequentially share consistent attention weights across local image regions, thereby enabling the learning of translational invariance to capture locality; ii) MixPool that preserves global modeling ability by aggregating local features after SPA through a multi-pooling mechanism, thus effectively facilitating long-range dependency across local image regions; and iii) Low-rank Multi-head Self-Attention (Lr-MSA) that improves parameter efficiency by using low-rank weights of multi-head self-attention, hence reducing computational complexity while maintaining accuracy in medical image analysis. SAViT demonstrates strong generalization across multiple medical imaging modalities, including retinopathy, dermoscopy, and radiography. Extensive experiments are conducted. The results indicate its high data efficiency and outstanding performance in comparison with more than 20 medical-specific and ViT-based models when all of them are trained from scratch. It excels in parameter-efficient tuning by surpassing 17 models across 6 datasets in transfer learning, with only 0.17M/0.23M trainable parameters on ViT-B/SwinViT-B backbones requiring 86.60M/88.00M parameters. Source code can be found at: https://github.com/LYH-hh/SAViT.
KW - Data Efficiency
KW - Medical Image Analysis
KW - Parameter Efficiency
KW - Prompt Learning
UR - https://www.scopus.com/pages/publications/105025134058
U2 - 10.1109/TMI.2025.3644949
DO - 10.1109/TMI.2025.3644949
M3 - 文章
AN - SCOPUS:105025134058
SN - 0278-0062
JO - IEEE Transactions on Medical Imaging
JF - IEEE Transactions on Medical Imaging
ER -