跳到主要导航 跳到搜索 跳到主要内容

VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding

  • Jiabo Ye*
  • , Junfeng Tian
  • , Xiaoshan Yang
  • , Zhenru Zhang
  • , Anwen Hu
  • , Ming Yan
  • , Ji Zhang
  • , Liang He
  • , Xin Lin
  • *此作品的通讯作者
  • East China Normal University
  • Nyonic.ai
  • CASIA
  • Alibaba Group Holding Ltd.

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Visual grounding focuses on localizing objects referred to by natural language queries. Existing fully and weakly supervised methods rely on a mass of language queries for training. However, collecting natural language queries corresponding to specific objects by annotators is expensive. To reduce the reliance on human-written queries, we propose a novel unsupervised visual grounding framework named VG-Annotator. Different from the existing unsupervised methods that rely on manually designed rules to link objects and language queries. The key idea of VG-Annotator lies in that vision-language pre-trained (VLP) generation models can be language query annotators. Thanks to the powerful multi-modal understanding ability implicitly learned from large-scale pre-training, we consider stimulating models to explicitly generate appropriate descriptions for specific objects in natural language. To this end, we explore a series of multi-modal instructions to indicate which object should be described. We also introduce a supervised fine-tuning process to teach the vision-language models to follow the instructions. Extensive experiments show that the proposed method obtains high-quality language queries. The visual grounding model trained with the generated queries outperforms state-of-the-art unsupervised methods on five widely used datasets.

源语言英语
主期刊名2024 IEEE International Conference on Multimedia and Expo, ICME 2024
出版商IEEE Computer Society
ISBN(电子版)9798350390155
DOI
出版状态已出版 - 2024
活动2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, 加拿大
期限: 15 7月 202419 7月 2024

出版系列

姓名Proceedings - IEEE International Conference on Multimedia and Expo
ISSN(印刷版)1945-7871
ISSN(电子版)1945-788X

会议

会议2024 IEEE International Conference on Multimedia and Expo, ICME 2024
国家/地区加拿大
Niagra Falls
时期15/07/2419/07/24

指纹

探究 'VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding' 的科研主题。它们共同构成独一无二的指纹。

引用此