VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding

Jiabo Ye, Junfeng Tian, Xiaoshan Yang, Zhenru Zhang, Anwen Hu, Ming Yan, Ji Zhang, Liang He, Xin Lin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual grounding focuses on localizing objects referred to by natural language queries. Existing fully and weakly supervised methods rely on a mass of language queries for training. However, collecting natural language queries corresponding to specific objects by annotators is expensive. To reduce the reliance on human-written queries, we propose a novel unsupervised visual grounding framework named VG-Annotator. Different from the existing unsupervised methods that rely on manually designed rules to link objects and language queries. The key idea of VG-Annotator lies in that vision-language pre-trained (VLP) generation models can be language query annotators. Thanks to the powerful multi-modal understanding ability implicitly learned from large-scale pre-training, we consider stimulating models to explicitly generate appropriate descriptions for specific objects in natural language. To this end, we explore a series of multi-modal instructions to indicate which object should be described. We also introduce a supervised fine-tuning process to teach the vision-language models to follow the instructions. Extensive experiments show that the proposed method obtains high-quality language queries. The visual grounding model trained with the generated queries outperforms state-of-the-art unsupervised methods on five widely used datasets.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PublisherIEEE Computer Society
ISBN (Electronic)9798350390155
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, Canada
Duration: 15 Jul 202419 Jul 2024

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Country/TerritoryCanada
CityNiagra Falls
Period15/07/2419/07/24

Keywords

  • Instruction Tuning
  • Unsupervised Learning
  • Visual Grounding

Fingerprint

Dive into the research topics of 'VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding'. Together they form a unique fingerprint.

Cite this