TY - GEN
T1 - Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
AU - Liu, Yufang
AU - Ji, Tao
AU - Sun, Changzhi
AU - Wu, Yuanbin
AU - Zhou, Aimin
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models.However, there is no clear conclusion as to which part of the model these hallucinations originate from.In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems.We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues.We demonstrate that our method can effectively mitigate object hallucinations for the CLIP model, and we show that the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
AB - Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models.However, there is no clear conclusion as to which part of the model these hallucinations originate from.In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems.We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues.We demonstrate that our method can effectively mitigate object hallucinations for the CLIP model, and we show that the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
UR - https://www.scopus.com/pages/publications/85217747863
U2 - 10.18653/v1/2024.emnlp-main.1016
DO - 10.18653/v1/2024.emnlp-main.1016
M3 - 会议稿件
AN - SCOPUS:85217747863
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 18288
EP - 18301
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Y2 - 12 November 2024 through 16 November 2024
ER -