TY - JOUR
T1 - UniQRNet
T2 - Unifying Referring Expression Grounding and Segmentation with QRNet
AU - Ye, Jiabo
AU - Tian, Junfeng
AU - Yan, Ming
AU - Xu, Haiyang
AU - Ye, Qinghao
AU - Shi, Yaya
AU - Yang, Xiaoshan
AU - Wang, Xuwu
AU - Zhang, Ji
AU - He, Liang
AU - Lin, Xin
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/6/13
Y1 - 2024/6/13
N2 - Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%∼5.09%) and segmentation tasks (0.57%∼5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
AB - Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%∼5.09%) and segmentation tasks (0.57%∼5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
KW - Additional Key Words and PhrasesReferring expression comprehension
KW - multi-modal understanding
KW - multi-task learning
UR - https://www.scopus.com/pages/publications/85202753935
U2 - 10.1145/3660638
DO - 10.1145/3660638
M3 - 文章
AN - SCOPUS:85202753935
SN - 1551-6857
VL - 20
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 8
M1 - 246
ER -