TY - JOUR
T1 - SARCLIP
T2 - The First Vision-Language Foundation Model for SAR Image
AU - Wang, Pengfei
AU - Lu, Zhuhao
AU - Li, Yajun
AU - Ding, Baogang
AU - Zhang, Ding
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Foundation models have achieved remarkable breakthroughs across various domains, with the widely use of masked image modeling (MIM) and self-supervised learning (SSL). However, these models lack language comprehension, which limits their ability to generalize in downstream tasks and zero-shot applications. The contrastive language-image pretraining (CLIP) model overcomes these challenges by integrating textual information and performs exceptionally in remote sensing. However, no such multimodal model exists for the synthetic aperture radar (SAR) domain. To bridge this gap, we propose SARCLIP, a vision-language model specially designed for SAR, which leverages textual supervision to enhance visual representation and achieves fine-grained vision-language alignment through modality alignment. First, to overcome the lack of SAR image-to-text datasets, we propose the detection-to-caption (D2C) algorithm, which transforms the heterogeneous format of object detection into diverse captions, ultimately creating the vision-language dataset, MMSAR. In addition, to address the distortion and information loss caused by downsampling in high-resolution SAR images, we propose AnyReader, a dynamic module capable of processing images of any size through a low-resolution encoder. We validate SARCLIP's generalization and performance on five downstream tasks: image-to-text retrieval, zero-shot classification, few-shot classification, linear probing, and object counting. Especially, we introduce a new counting benchmark, SARCount. Our results demonstrate that SARCLIP consistently outperforms the CLIP baseline across various model sizes. Impressively, SARCLIP achieved a 7.07% higher average recall on the retrieval task compared to the largest CLIP model. In zero-shot classification, SARCLIP surpasses CLIP by 7.8% in average accuracy across three downstream datasets.
AB - Foundation models have achieved remarkable breakthroughs across various domains, with the widely use of masked image modeling (MIM) and self-supervised learning (SSL). However, these models lack language comprehension, which limits their ability to generalize in downstream tasks and zero-shot applications. The contrastive language-image pretraining (CLIP) model overcomes these challenges by integrating textual information and performs exceptionally in remote sensing. However, no such multimodal model exists for the synthetic aperture radar (SAR) domain. To bridge this gap, we propose SARCLIP, a vision-language model specially designed for SAR, which leverages textual supervision to enhance visual representation and achieves fine-grained vision-language alignment through modality alignment. First, to overcome the lack of SAR image-to-text datasets, we propose the detection-to-caption (D2C) algorithm, which transforms the heterogeneous format of object detection into diverse captions, ultimately creating the vision-language dataset, MMSAR. In addition, to address the distortion and information loss caused by downsampling in high-resolution SAR images, we propose AnyReader, a dynamic module capable of processing images of any size through a low-resolution encoder. We validate SARCLIP's generalization and performance on five downstream tasks: image-to-text retrieval, zero-shot classification, few-shot classification, linear probing, and object counting. Especially, we introduce a new counting benchmark, SARCount. Our results demonstrate that SARCLIP consistently outperforms the CLIP baseline across various model sizes. Impressively, SARCLIP achieved a 7.07% higher average recall on the retrieval task compared to the largest CLIP model. In zero-shot classification, SARCLIP surpasses CLIP by 7.8% in average accuracy across three downstream datasets.
KW - Multimodal
KW - SARCLIP
KW - vision-language foundation model
UR - https://www.scopus.com/pages/publications/105020907936
U2 - 10.1109/TGRS.2025.3630131
DO - 10.1109/TGRS.2025.3630131
M3 - 文章
AN - SCOPUS:105020907936
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5223211
ER -