TY - GEN
T1 - Leveraging Panoptic Prior for 3D Zero-Shot Semantic Understanding Within Language Embedded Radiance Fields
AU - Ji, Yuzhou
AU - Tan, Xin
AU - Zhu, He
AU - Liu, Wuyi
AU - Xu, Jiachen
AU - Xie, Yuan
AU - Ma, Lizhuang
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - Language Embedded Radiance Fields (LERF) achieves promising results in real-time dense relevancy maps within NeRF 3D scenes. Although LERF shows impressive zero-shot ability in many long-tail open-vocabulary queries, the quality of relevancy maps could degrade in certain camera angles especially novel views and may even fail to localize. In this work we propose a method to bring in prior knowledge as the guidance of building a multi-scale CLIP (Contrastive Language-Image Pretraining) feature pyramid, achieving better localization ability and 3D consistency without any harm to original zero-shot capability. Specifically, we use panoptic segmentation to preprocess training images and reconstruct multi-scale image pyramid with segmented tiles. Unlike some other works, we only use the continuous semantic meaning of image tiles for accurate CLIP features, instead of labels or IDs which are inconsistent across views. And the tiles are partially overridden based on location and scale, preserving also a large amount of non-prior knowledge. And in order to effectively compare the results with LERF, we designed a metric based on pixel relevancy, which could further support future research based on LERF representation. Additionally, we explore the possibility of grounding dense 3D consistent segmentation information within LERF during experiments, providing an inspiring train of thought about distilling 2D knowledge into 3D scenes for 3D manipulation.
AB - Language Embedded Radiance Fields (LERF) achieves promising results in real-time dense relevancy maps within NeRF 3D scenes. Although LERF shows impressive zero-shot ability in many long-tail open-vocabulary queries, the quality of relevancy maps could degrade in certain camera angles especially novel views and may even fail to localize. In this work we propose a method to bring in prior knowledge as the guidance of building a multi-scale CLIP (Contrastive Language-Image Pretraining) feature pyramid, achieving better localization ability and 3D consistency without any harm to original zero-shot capability. Specifically, we use panoptic segmentation to preprocess training images and reconstruct multi-scale image pyramid with segmented tiles. Unlike some other works, we only use the continuous semantic meaning of image tiles for accurate CLIP features, instead of labels or IDs which are inconsistent across views. And the tiles are partially overridden based on location and scale, preserving also a large amount of non-prior knowledge. And in order to effectively compare the results with LERF, we designed a metric based on pixel relevancy, which could further support future research based on LERF representation. Additionally, we explore the possibility of grounding dense 3D consistent segmentation information within LERF during experiments, providing an inspiring train of thought about distilling 2D knowledge into 3D scenes for 3D manipulation.
KW - CLIP feature
KW - Neural Radiance Fields
KW - cross-modal distillation
KW - semantic 3D scene
KW - zero-shot learning
UR - https://www.scopus.com/pages/publications/85190389273
U2 - 10.1007/978-981-97-2095-8_3
DO - 10.1007/978-981-97-2095-8_3
M3 - 会议稿件
AN - SCOPUS:85190389273
SN - 9789819720941
T3 - Lecture Notes in Computer Science
SP - 42
EP - 58
BT - Computational Visual Media - 12th International Conference, CVM 2024, Proceedings
A2 - Zhang, Fang-Lue
A2 - Sharf, Andrei
PB - Springer Science and Business Media Deutschland GmbH
T2 - 12th International Conference on Computational Visual Media, CVM 2024
Y2 - 10 April 2024 through 12 April 2024
ER -