摘要
Humans are capable of inferring dynamic context from a still image and, with the provision of additional commonsense knowledge, can accurately complete visual commonsense reasoning tasks. Nevertheless, this remains a highly challenging cognitive-level task for current vision-language models. Previous work has primarily focused on utilizing models fine-tuned for specific downstream tasks and introduces external world knowledge to tackle these challenging tasks, while neglecting the importance of accurate context and the key role of commonsense knowledge in reasoning. In this paper, we propose a novel framework to enhance visual commonsense reasoning by incorporating context and commonsense knowledge. We decompose the visual commonsense reasoning problem into four distinct but interrelated sub-problems and combine visual language models with a large language model to enable zero-shot reasoning. The uniqueness of this work lies in the proposed commonsense knowledge filtering module, which filters out relevant commonsense knowledge through the causal strength of visual context. This process constructs Visual Context and Commonsense-guided Causal Chain-of-Thought (VC3-CoT) reasoning paths, thereby providing double robustness to visual commonsense reasoning by incorporating weighted majority voting strategy. Extensive experiments on several downstream tasks demonstrate that the proposed method significantly improves performance compared to baseline models and the state-of-the-art method, and confirm the effectiveness of the proposed components.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 2719-2730 |
| 页数 | 12 |
| 期刊 | IEEE Transactions on Multimedia |
| 卷 | 28 |
| DOI | |
| 出版状态 | 已出版 - 2026 |
指纹
探究 'Visual Context and Commonsense-Guided Causal Chain-of-Thoughts for Visual Commonsense Reasoning' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver