Skip to main navigation Skip to search Skip to main content

Visual Context and Commonsense-Guided Causal Chain-of-Thoughts for Visual Commonsense Reasoning

  • Xinyu Li
  • , Jing Zhao*
  • , Tongquan Wei
  • , Shiliang Sun*
  • *Corresponding author for this work
  • East China Normal University
  • Shanghai Jiao Tong University

Research output: Contribution to journalArticlepeer-review

Abstract

Humans are capable of inferring dynamic context from a still image and, with the provision of additional commonsense knowledge, can accurately complete visual commonsense reasoning tasks. Nevertheless, this remains a highly challenging cognitive-level task for current vision-language models. Previous work has primarily focused on utilizing models fine-tuned for specific downstream tasks and introduces external world knowledge to tackle these challenging tasks, while neglecting the importance of accurate context and the key role of commonsense knowledge in reasoning. In this paper, we propose a novel framework to enhance visual commonsense reasoning by incorporating context and commonsense knowledge. We decompose the visual commonsense reasoning problem into four distinct but interrelated sub-problems and combine visual language models with a large language model to enable zero-shot reasoning. The uniqueness of this work lies in the proposed commonsense knowledge filtering module, which filters out relevant commonsense knowledge through the causal strength of visual context. This process constructs Visual Context and Commonsense-guided Causal Chain-of-Thought (VC3-CoT) reasoning paths, thereby providing double robustness to visual commonsense reasoning by incorporating weighted majority voting strategy. Extensive experiments on several downstream tasks demonstrate that the proposed method significantly improves performance compared to baseline models and the state-of-the-art method, and confirm the effectiveness of the proposed components.

Original languageEnglish
Pages (from-to)2719-2730
Number of pages12
JournalIEEE Transactions on Multimedia
Volume28
DOIs
StatePublished - 2026

Keywords

  • causality
  • chain-of-thoughts
  • Visual commonsense reasoning
  • visual context

Fingerprint

Dive into the research topics of 'Visual Context and Commonsense-Guided Causal Chain-of-Thoughts for Visual Commonsense Reasoning'. Together they form a unique fingerprint.

Cite this