Towards Universal Perception through Language-Guided Open-World Object Detection

  • Zihan Wang
  • , Yunhang Shen
  • , Yuan Fang
  • , Zuwei Long
  • , Ke Li
  • , Xing Sun
  • , Jiao Xie*
  • , Shaohui Lin*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception, a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages1190-1199
Number of pages10
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • multi-modal understanding
  • open-world object detection

Fingerprint

Dive into the research topics of 'Towards Universal Perception through Language-Guided Open-World Object Detection'. Together they form a unique fingerprint.

Cite this