CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation

Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Yanyun Qu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Multi-modal Unsupervised Domain Adaptation (MM-UDA) for large-scale 3D semantic segmentation involves adapting 2D and 3D models to a target domain without labels, which significantly reduces the labor-intensive annotations. Existing MM-UDA methods have often attempted to mitigate the domain discrepancy by aligning features between the source and target data. However, this implementation falls short when applied to image perception due to the susceptibility of images to environmental changes compared to point clouds. To mitigate this limitation, in this work, we explore the potentials of an off-the-shelf Contrastive Language-Image Pre-training (CLIP) model with rich whilst heterogeneous knowledge. To make CLIP task-specific, we propose a top-performing method, dubbed CLIP2UDA, which makes frozen CLIP reward unsupervised domain adaptation in 3D semantic segmentation. Specifically, CLIP2UDA alternates between two steps during adaptation: (a) Learning task-specific prompt. 2D features response from the visual encoder are employed to initiate the learning of adaptive text prompt of each domain, and (b) Learning multi-modal domain-invariant representations. These representations interact hierarchically in the shared decoder to obtain unified 2D visual predictions. This enhancement allows for effective alignment between the modality-specific 3D and unified feature space via cross-modal mutual learning. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several widely-recognized adaptation scenarios. Code is available at: https://github.com/Barcaaaa/CLIP2UDA.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages8662-8671
Number of pages10
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • 3d semantic segmentation
  • multi-modal learning
  • unsupervised domain adaptation
  • vision-language models

Fingerprint

Dive into the research topics of 'CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation'. Together they form a unique fingerprint.

Cite this