TY - GEN
T1 - Not all pixels are matched
T2 - 30th ACM International Conference on Multimedia, MM 2022
AU - Sun, Hanzhe
AU - Liu, Jun
AU - Zhang, Zhizhong
AU - Wang, Chengjie
AU - Qu, Yanyun
AU - Xie, Yuan
AU - Ma, Lizhuang
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Visible-Infrared Person Re-Identification (VI-ReID) has become an emerging task for night-Time surveillance systems. In order to reduce the cross-modality discrepancy, previous works either align the features via metric learning or generate synthesized cross-modality images by Generative Adversary Network. However, feature-level alignment ignores the heterogeneous data itself while generative framework suffers from the low generation quality, limiting their applications. In this paper, we propose a dense contrastive learning framework (DCLNet), which performs pixel-To-pixel dense alignment acting on the intermediate representations, rather than the final deep feature. It is a new loss function that brings views of positive pixels with same semantic information closer in shallow representation space, whilst pushing views of negative pixels apart. It naturally provides additional dense supervision and captures fine-grained pixel correspondence, reducing the modality gap from a new perspective. To implement it, a Part Aware Parsing (PAP) module and a Semantic Rectification Module (SRM) are introduced to learn and refine a semantic-guided mask, allowing us to efficiently find positive pairs only requiring instance-level supervision. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our pipeline over state-of-The-Arts. Code is available at https://github.com/sunhz0117/DCLNet.
AB - Visible-Infrared Person Re-Identification (VI-ReID) has become an emerging task for night-Time surveillance systems. In order to reduce the cross-modality discrepancy, previous works either align the features via metric learning or generate synthesized cross-modality images by Generative Adversary Network. However, feature-level alignment ignores the heterogeneous data itself while generative framework suffers from the low generation quality, limiting their applications. In this paper, we propose a dense contrastive learning framework (DCLNet), which performs pixel-To-pixel dense alignment acting on the intermediate representations, rather than the final deep feature. It is a new loss function that brings views of positive pixels with same semantic information closer in shallow representation space, whilst pushing views of negative pixels apart. It naturally provides additional dense supervision and captures fine-grained pixel correspondence, reducing the modality gap from a new perspective. To implement it, a Part Aware Parsing (PAP) module and a Semantic Rectification Module (SRM) are introduced to learn and refine a semantic-guided mask, allowing us to efficiently find positive pairs only requiring instance-level supervision. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our pipeline over state-of-The-Arts. Code is available at https://github.com/sunhz0117/DCLNet.
KW - cross-modality alignment
KW - dense contrastive learning
KW - visible-infrared person re-identification
UR - https://www.scopus.com/pages/publications/85148757579
U2 - 10.1145/3503161.3547970
DO - 10.1145/3503161.3547970
M3 - 会议稿件
AN - SCOPUS:85148757579
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 5333
EP - 5341
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 10 October 2022 through 14 October 2022
ER -