TY - GEN
T1 - Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification
AU - Chen, Zhong
AU - Zhang, Zhizhong
AU - Tan, Xin
AU - Qu, Yanyun
AU - Xie, Yuan
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.
AB - Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.
KW - clip
KW - multi-modal data
KW - unsupervised learning
KW - visible-infrared person re-identification
UR - https://www.scopus.com/pages/publications/85179557764
U2 - 10.1145/3581783.3612050
DO - 10.1145/3581783.3612050
M3 - 会议稿件
AN - SCOPUS:85179557764
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 3667
EP - 3675
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 31st ACM International Conference on Multimedia, MM 2023
Y2 - 29 October 2023 through 3 November 2023
ER -