Leveraging Panoptic Prior for 3D Zero-Shot Semantic Understanding Within Language Embedded Radiance Fields

Yuzhou Ji, Xin Tan*, He Zhu, Wuyi Liu, Jiachen Xu, Yuan Xie, Lizhuang Ma

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Language Embedded Radiance Fields (LERF) achieves promising results in real-time dense relevancy maps within NeRF 3D scenes. Although LERF shows impressive zero-shot ability in many long-tail open-vocabulary queries, the quality of relevancy maps could degrade in certain camera angles especially novel views and may even fail to localize. In this work we propose a method to bring in prior knowledge as the guidance of building a multi-scale CLIP (Contrastive Language-Image Pretraining) feature pyramid, achieving better localization ability and 3D consistency without any harm to original zero-shot capability. Specifically, we use panoptic segmentation to preprocess training images and reconstruct multi-scale image pyramid with segmented tiles. Unlike some other works, we only use the continuous semantic meaning of image tiles for accurate CLIP features, instead of labels or IDs which are inconsistent across views. And the tiles are partially overridden based on location and scale, preserving also a large amount of non-prior knowledge. And in order to effectively compare the results with LERF, we designed a metric based on pixel relevancy, which could further support future research based on LERF representation. Additionally, we explore the possibility of grounding dense 3D consistent segmentation information within LERF during experiments, providing an inspiring train of thought about distilling 2D knowledge into 3D scenes for 3D manipulation.

Original languageEnglish
Title of host publicationComputational Visual Media - 12th International Conference, CVM 2024, Proceedings
EditorsFang-Lue Zhang, Andrei Sharf
PublisherSpringer Science and Business Media Deutschland GmbH
Pages42-58
Number of pages17
ISBN (Print)9789819720941
DOIs
StatePublished - 2024
Event12th International Conference on Computational Visual Media, CVM 2024 - Wellington, New Zealand
Duration: 10 Apr 202412 Apr 2024

Publication series

NameLecture Notes in Computer Science
Volume14592 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th International Conference on Computational Visual Media, CVM 2024
Country/TerritoryNew Zealand
CityWellington
Period10/04/2412/04/24

Keywords

  • CLIP feature
  • Neural Radiance Fields
  • cross-modal distillation
  • semantic 3D scene
  • zero-shot learning

Fingerprint

Dive into the research topics of 'Leveraging Panoptic Prior for 3D Zero-Shot Semantic Understanding Within Language Embedded Radiance Fields'. Together they form a unique fingerprint.

Cite this