SARCLIP: The First Vision-Language Foundation Model for SAR Image

Pengfei Wang*, Zhuhao Lu, Yajun Li, Baogang Ding, Ding Zhang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Foundation models have achieved remarkable breakthroughs across various domains, with the widely use of masked image modeling (MIM) and self-supervised learning (SSL). However, these models lack language comprehension, which limits their ability to generalize in downstream tasks and zero-shot applications. The contrastive language-image pretraining (CLIP) model overcomes these challenges by integrating textual information and performs exceptionally in remote sensing. However, no such multimodal model exists for the synthetic aperture radar (SAR) domain. To bridge this gap, we propose SARCLIP, a vision-language model specially designed for SAR, which leverages textual supervision to enhance visual representation and achieves fine-grained vision-language alignment through modality alignment. First, to overcome the lack of SAR image-to-text datasets, we propose the detection-to-caption (D2C) algorithm, which transforms the heterogeneous format of object detection into diverse captions, ultimately creating the vision-language dataset, MMSAR. In addition, to address the distortion and information loss caused by downsampling in high-resolution SAR images, we propose AnyReader, a dynamic module capable of processing images of any size through a low-resolution encoder. We validate SARCLIP's generalization and performance on five downstream tasks: image-to-text retrieval, zero-shot classification, few-shot classification, linear probing, and object counting. Especially, we introduce a new counting benchmark, SARCount. Our results demonstrate that SARCLIP consistently outperforms the CLIP baseline across various model sizes. Impressively, SARCLIP achieved a 7.07% higher average recall on the retrieval task compared to the largest CLIP model. In zero-shot classification, SARCLIP surpasses CLIP by 7.8% in average accuracy across three downstream datasets.

Original languageEnglish
Article number5223211
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
StatePublished - 2025

Keywords

  • Multimodal
  • SARCLIP
  • vision-language foundation model

Fingerprint

Dive into the research topics of 'SARCLIP: The First Vision-Language Foundation Model for SAR Image'. Together they form a unique fingerprint.

Cite this