Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, Aimin Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models.However, there is no clear conclusion as to which part of the model these hallucinations originate from.In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems.We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues.We demonstrate that our method can effectively mitigate object hallucinations for the CLIP model, and we show that the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages18288-18301
Number of pages14
ISBN (Electronic)9798891761643
DOIs
StatePublished - 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period12/11/2416/11/24

Fingerprint

Dive into the research topics of 'Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models'. Together they form a unique fingerprint.

Cite this