Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

54 Scopus citations

Abstract

Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages3667-3675
Number of pages9
ISBN (Electronic)9798400701085
DOIs
StatePublished - 27 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • clip
  • multi-modal data
  • unsupervised learning
  • visible-infrared person re-identification

Fingerprint

Dive into the research topics of 'Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification'. Together they form a unique fingerprint.

Cite this