TY - GEN
T1 - One-Stage Visual Grounding via Semantic-Aware Feature Filter
AU - Ye, Jiabo
AU - Lin, Xin
AU - He, Liang
AU - Li, DIngbang
AU - Chen, Qin
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/17
Y1 - 2021/10/17
N2 - Visual grounding has attracted much attention with the popularity of vision language. Existing one-stage methods are far ahead of two-stage methods in speed. However, these methods fuse the textual feature and visual feature map by simply concatenation, which ignores the textual semantics and limits these models' ability in cross-modal understanding. To overcome this weakness, we propose a semantic-aware framework that utilizes both queries' structured knowledge and context-sensitive representations to filter the visual feature maps to localize the referents more accurately. Our framework contains an entity filter, an attribute filter, and a location filter. These three filters filter the input visual feature map step by step according to each query's aspects respectively. A grounding module further regresses the bounding boxes to localize the referential object. Experiments on various commonly used datasets show that our framework achieves a real-time inference speed and outperforms all state-of-the-art methods.
AB - Visual grounding has attracted much attention with the popularity of vision language. Existing one-stage methods are far ahead of two-stage methods in speed. However, these methods fuse the textual feature and visual feature map by simply concatenation, which ignores the textual semantics and limits these models' ability in cross-modal understanding. To overcome this weakness, we propose a semantic-aware framework that utilizes both queries' structured knowledge and context-sensitive representations to filter the visual feature maps to localize the referents more accurately. Our framework contains an entity filter, an attribute filter, and a location filter. These three filters filter the input visual feature map step by step according to each query's aspects respectively. A grounding module further regresses the bounding boxes to localize the referential object. Experiments on various commonly used datasets show that our framework achieves a real-time inference speed and outperforms all state-of-the-art methods.
KW - referring expressions
KW - scene graph
KW - visual grounding
UR - https://www.scopus.com/pages/publications/85119363464
U2 - 10.1145/3474085.3475313
DO - 10.1145/3474085.3475313
M3 - 会议稿件
AN - SCOPUS:85119363464
T3 - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
SP - 1702
EP - 1711
BT - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 29th ACM International Conference on Multimedia, MM 2021
Y2 - 20 October 2021 through 24 October 2021
ER -