TY - JOUR
T1 - Enhancing open-vocabulary object detection through region-word and region-vision matching
AU - Chen, Yi
AU - Wang, Chong
AU - Li, Zhehao
AU - Lin, Sunqi
AU - Xiang, Jinhui
AU - Li, Yuqi
AU - Qian, Jiangbo
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
PY - 2025/6
Y1 - 2025/6
N2 - Open-vocabulary object detection (OVOD) aims to detect novel object categories beyond the training set. Existing OVOD methods have made encouraging progress by leveraging large-scale image-caption pairs and pre-trained vision-language models (VLMs). However, two main limitations exhibit: (1) The potential category-specific concepts in global captions are not fully utilized, resulting in a lack of fine-grained semantic guidance for the detector. (2) The compositional structure of multiple concepts naturally existing in image-caption pairs as represented by VLMs remains insufficiently explored, limiting the model’s ability to generalize to novel category concepts. To address these limitations, we propose a novel framework called Region-Word-Vision Matching (RWVM) that integrates two core modules: a Region-Word Matching (RWM) module and a Region-Vision Matching (RVM) module. Our key insight is to simultaneously guide textual and visual knowledge alignment with region features to strengthen the model’s understanding of complex visual scenes. Specifically, the RWM module guides fine-grained semantic aggregation by fusing local region-word matching with global image-caption matching. The RVM module leverages VLMs to capture the compositional structure of single and multiple object concepts, directly enhancing the detector’s ability to learn novel category concepts. Additionally, we demonstrate that the RVM module outperforms embeddings extracted from full language models using only simplified region embeddings. Extensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the average precision (AP) for novel categories on COCO and LVIS datasets.
AB - Open-vocabulary object detection (OVOD) aims to detect novel object categories beyond the training set. Existing OVOD methods have made encouraging progress by leveraging large-scale image-caption pairs and pre-trained vision-language models (VLMs). However, two main limitations exhibit: (1) The potential category-specific concepts in global captions are not fully utilized, resulting in a lack of fine-grained semantic guidance for the detector. (2) The compositional structure of multiple concepts naturally existing in image-caption pairs as represented by VLMs remains insufficiently explored, limiting the model’s ability to generalize to novel category concepts. To address these limitations, we propose a novel framework called Region-Word-Vision Matching (RWVM) that integrates two core modules: a Region-Word Matching (RWM) module and a Region-Vision Matching (RVM) module. Our key insight is to simultaneously guide textual and visual knowledge alignment with region features to strengthen the model’s understanding of complex visual scenes. Specifically, the RWM module guides fine-grained semantic aggregation by fusing local region-word matching with global image-caption matching. The RVM module leverages VLMs to capture the compositional structure of single and multiple object concepts, directly enhancing the detector’s ability to learn novel category concepts. Additionally, we demonstrate that the RVM module outperforms embeddings extracted from full language models using only simplified region embeddings. Extensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the average precision (AP) for novel categories on COCO and LVIS datasets.
KW - Multi-modal training
KW - Open-vocabulary object detection
KW - Region-word matching
KW - Vision-language models
UR - https://www.scopus.com/pages/publications/105004465226
U2 - 10.1007/s00530-025-01806-5
DO - 10.1007/s00530-025-01806-5
M3 - 文章
AN - SCOPUS:105004465226
SN - 0942-4962
VL - 31
JO - Multimedia Systems
JF - Multimedia Systems
IS - 3
M1 - 232
ER -