TY - JOUR
T1 - Simple contrastive learning in a self-supervised manner for robust visual question answering
AU - Yang, Shuwen
AU - Xiao, Luwei
AU - Wu, Xingjiao
AU - Xu, Junjie
AU - Wang, Linlin
AU - He, Liang
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/4
Y1 - 2024/4
N2 - Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.
AB - Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.
KW - Contrastive learning
KW - Deep learning
KW - Information extraction
KW - Visual question answering
UR - https://www.scopus.com/pages/publications/85186271711
U2 - 10.1016/j.cviu.2024.103976
DO - 10.1016/j.cviu.2024.103976
M3 - 文章
AN - SCOPUS:85186271711
SN - 1077-3142
VL - 241
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
M1 - 103976
ER -