TY - GEN
T1 - XtremeCLIP
T2 - Findings of the Association for Computational Linguistics, ACL 2023
AU - Tang, Moming
AU - Wang, Chengyu
AU - Wang, Jianing
AU - Tan, Chuanqi
AU - Huang, Songfang
AU - Chen, Cen
AU - Qian, Weining
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Recently, Contrastive Visual-Language Pretraining (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks. Yet, most CLIP-based methods require tasks-specific designs and sufficient training data. In this paper, we introduce a simple yet efficient paradigm for low-resource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models. In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem. Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues. Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings.
AB - Recently, Contrastive Visual-Language Pretraining (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks. Yet, most CLIP-based methods require tasks-specific designs and sufficient training data. In this paper, we introduce a simple yet efficient paradigm for low-resource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models. In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem. Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues. Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings.
UR - https://www.scopus.com/pages/publications/85175466994
U2 - 10.18653/v1/2023.findings-acl.397
DO - 10.18653/v1/2023.findings-acl.397
M3 - 会议稿件
AN - SCOPUS:85175466994
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 6368
EP - 6376
BT - Findings of the Association for Computational Linguistics, ACL 2023
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -