跳到主要导航 跳到搜索 跳到主要内容

Towards Universal Perception through Language-Guided Open-World Object Detection

  • Zihan Wang
  • , Yunhang Shen
  • , Yuan Fang
  • , Zuwei Long
  • , Ke Li
  • , Xing Sun
  • , Jiao Xie*
  • , Shaohui Lin*
  • *此作品的通讯作者
  • East China Normal University
  • Tencent
  • The 27th Research Institute of CETC

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception, a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.

源语言英语
主期刊名MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
出版商Association for Computing Machinery, Inc
1190-1199
页数10
ISBN(电子版)9798400720352
DOI
出版状态已出版 - 27 10月 2025
活动33rd ACM International Conference on Multimedia, MM 2025 - Dublin, 爱尔兰
期限: 27 10月 202531 10月 2025

出版系列

姓名MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

会议

会议33rd ACM International Conference on Multimedia, MM 2025
国家/地区爱尔兰
Dublin
时期27/10/2531/10/25

指纹

探究 'Towards Universal Perception through Language-Guided Open-World Object Detection' 的科研主题。它们共同构成独一无二的指纹。

引用此