跳到主要导航 跳到搜索 跳到主要内容

SARCLIP: The First Vision-Language Foundation Model for SAR Image

  • East China Normal University

科研成果: 期刊稿件文章同行评审

摘要

Foundation models have achieved remarkable breakthroughs across various domains, with the widely use of masked image modeling (MIM) and self-supervised learning (SSL). However, these models lack language comprehension, which limits their ability to generalize in downstream tasks and zero-shot applications. The contrastive language-image pretraining (CLIP) model overcomes these challenges by integrating textual information and performs exceptionally in remote sensing. However, no such multimodal model exists for the synthetic aperture radar (SAR) domain. To bridge this gap, we propose SARCLIP, a vision-language model specially designed for SAR, which leverages textual supervision to enhance visual representation and achieves fine-grained vision-language alignment through modality alignment. First, to overcome the lack of SAR image-to-text datasets, we propose the detection-to-caption (D2C) algorithm, which transforms the heterogeneous format of object detection into diverse captions, ultimately creating the vision-language dataset, MMSAR. In addition, to address the distortion and information loss caused by downsampling in high-resolution SAR images, we propose AnyReader, a dynamic module capable of processing images of any size through a low-resolution encoder. We validate SARCLIP's generalization and performance on five downstream tasks: image-to-text retrieval, zero-shot classification, few-shot classification, linear probing, and object counting. Especially, we introduce a new counting benchmark, SARCount. Our results demonstrate that SARCLIP consistently outperforms the CLIP baseline across various model sizes. Impressively, SARCLIP achieved a 7.07% higher average recall on the retrieval task compared to the largest CLIP model. In zero-shot classification, SARCLIP surpasses CLIP by 7.8% in average accuracy across three downstream datasets.

源语言英语
文章编号5223211
期刊IEEE Transactions on Geoscience and Remote Sensing
63
DOI
出版状态已出版 - 2025

指纹

探究 'SARCLIP: The First Vision-Language Foundation Model for SAR Image' 的科研主题。它们共同构成独一无二的指纹。

引用此