TY - GEN
T1 - Towards Universal Perception through Language-Guided Open-World Object Detection
AU - Wang, Zihan
AU - Shen, Yunhang
AU - Fang, Yuan
AU - Long, Zuwei
AU - Li, Ke
AU - Sun, Xing
AU - Xie, Jiao
AU - Lin, Shaohui
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception, a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.
AB - Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception, a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.
KW - multi-modal understanding
KW - open-world object detection
UR - https://www.scopus.com/pages/publications/105024069103
U2 - 10.1145/3746027.3755017
DO - 10.1145/3746027.3755017
M3 - 会议稿件
AN - SCOPUS:105024069103
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 1190
EP - 1199
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM International Conference on Multimedia, MM 2025
Y2 - 27 October 2025 through 31 October 2025
ER -