跳到主要导航 跳到搜索 跳到主要内容

Visual Context and Commonsense-Guided Causal Chain-of-Thoughts for Visual Commonsense Reasoning

  • Xinyu Li
  • , Jing Zhao*
  • , Tongquan Wei
  • , Shiliang Sun*
  • *此作品的通讯作者
  • East China Normal University
  • Shanghai Jiao Tong University

科研成果: 期刊稿件文章同行评审

摘要

Humans are capable of inferring dynamic context from a still image and, with the provision of additional commonsense knowledge, can accurately complete visual commonsense reasoning tasks. Nevertheless, this remains a highly challenging cognitive-level task for current vision-language models. Previous work has primarily focused on utilizing models fine-tuned for specific downstream tasks and introduces external world knowledge to tackle these challenging tasks, while neglecting the importance of accurate context and the key role of commonsense knowledge in reasoning. In this paper, we propose a novel framework to enhance visual commonsense reasoning by incorporating context and commonsense knowledge. We decompose the visual commonsense reasoning problem into four distinct but interrelated sub-problems and combine visual language models with a large language model to enable zero-shot reasoning. The uniqueness of this work lies in the proposed commonsense knowledge filtering module, which filters out relevant commonsense knowledge through the causal strength of visual context. This process constructs Visual Context and Commonsense-guided Causal Chain-of-Thought (VC3-CoT) reasoning paths, thereby providing double robustness to visual commonsense reasoning by incorporating weighted majority voting strategy. Extensive experiments on several downstream tasks demonstrate that the proposed method significantly improves performance compared to baseline models and the state-of-the-art method, and confirm the effectiveness of the proposed components.

源语言英语
页(从-至)2719-2730
页数12
期刊IEEE Transactions on Multimedia
28
DOI
出版状态已出版 - 2026

指纹

探究 'Visual Context and Commonsense-Guided Causal Chain-of-Thoughts for Visual Commonsense Reasoning' 的科研主题。它们共同构成独一无二的指纹。

引用此