Aligning and Prompting Everything All at Once for Universal Visual Perception

  • Yunhang Shen
  • , Chaoyou Fu
  • , Peixian Chen
  • , Mengdan Zhang
  • , Ke Li
  • , Xing Sun
  • , Yunsheng Wu
  • , Shaohui Lin*
  • , Rongrong Ji
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

24 Scopus citations

Abstract

Vision foundation models have been explored recently to build general-purpose vision systems. However, predomi-nant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality in-teraction, which is not effective in prompting object detection and visual grounding. Another line of work that fo-cuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual inter-ference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocab-ularies and region descriptions while maintaining the ef-fectiveness of cross-modality fusion. To bridge the granu-larity gap of different pixel-level tasks, APE equalizes se-mantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual in-stances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive ex-periments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet univer-sal perception for anything aligning and prompting is in-deed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages13193-13203
Number of pages11
ISBN (Electronic)9798350353006
ISBN (Print)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Fingerprint

Dive into the research topics of 'Aligning and Prompting Everything All at Once for Universal Visual Perception'. Together they form a unique fingerprint.

Cite this