One-Stage Visual Grounding via Semantic-Aware Feature Filter

Jiabo Ye, Xin Lin, Liang He, DIngbang Li, Qin Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

33 Scopus citations

Abstract

Visual grounding has attracted much attention with the popularity of vision language. Existing one-stage methods are far ahead of two-stage methods in speed. However, these methods fuse the textual feature and visual feature map by simply concatenation, which ignores the textual semantics and limits these models' ability in cross-modal understanding. To overcome this weakness, we propose a semantic-aware framework that utilizes both queries' structured knowledge and context-sensitive representations to filter the visual feature maps to localize the referents more accurately. Our framework contains an entity filter, an attribute filter, and a location filter. These three filters filter the input visual feature map step by step according to each query's aspects respectively. A grounding module further regresses the bounding boxes to localize the referential object. Experiments on various commonly used datasets show that our framework achieves a real-time inference speed and outperforms all state-of-the-art methods.

Original languageEnglish
Title of host publicationMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages1702-1711
Number of pages10
ISBN (Electronic)9781450386517
DOIs
StatePublished - 17 Oct 2021
Event29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, China
Duration: 20 Oct 202124 Oct 2021

Publication series

NameMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia, MM 2021
Country/TerritoryChina
CityVirtual, Online
Period20/10/2124/10/21

Keywords

  • referring expressions
  • scene graph
  • visual grounding

Fingerprint

Dive into the research topics of 'One-Stage Visual Grounding via Semantic-Aware Feature Filter'. Together they form a unique fingerprint.

Cite this