TY - GEN
T1 - Aligning and Prompting Everything All at Once for Universal Visual Perception
AU - Shen, Yunhang
AU - Fu, Chaoyou
AU - Chen, Peixian
AU - Zhang, Mengdan
AU - Li, Ke
AU - Sun, Xing
AU - Wu, Yunsheng
AU - Lin, Shaohui
AU - Ji, Rongrong
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Vision foundation models have been explored recently to build general-purpose vision systems. However, predomi-nant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality in-teraction, which is not effective in prompting object detection and visual grounding. Another line of work that fo-cuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual inter-ference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocab-ularies and region descriptions while maintaining the ef-fectiveness of cross-modality fusion. To bridge the granu-larity gap of different pixel-level tasks, APE equalizes se-mantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual in-stances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive ex-periments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet univer-sal perception for anything aligning and prompting is in-deed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
AB - Vision foundation models have been explored recently to build general-purpose vision systems. However, predomi-nant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality in-teraction, which is not effective in prompting object detection and visual grounding. Another line of work that fo-cuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual inter-ference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocab-ularies and region descriptions while maintaining the ef-fectiveness of cross-modality fusion. To bridge the granu-larity gap of different pixel-level tasks, APE equalizes se-mantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual in-stances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive ex-periments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet univer-sal perception for anything aligning and prompting is in-deed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
UR - https://www.scopus.com/pages/publications/85194204871
U2 - 10.1109/CVPR52733.2024.01253
DO - 10.1109/CVPR52733.2024.01253
M3 - 会议稿件
AN - SCOPUS:85194204871
SN - 9798350353006
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 13193
EP - 13203
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -