DVD: A Debiased Visual Dialog Model via Disentangling Knowledge Features

  • Chenyu Lu
  • , Jing Zhao
  • , Shiliang Sun*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Visual dialog aims to facilitate the answering of multi-round questions by effectively integrating dialog history and the relevant content of images. Existing methods in visual dialog predominantly concentrate on devising multi-modal data interaction architectures to augment multi-modal fusion performance, but they often disregard inherent dataset selection biases. This oversight can lead to imbalanced feature learning and compromising the robustness of the model. In this paper, we propose a Debiased Visual Dialog model (DVD) to mitigate the influence of biases. Specifically, we concretize these biases as spurious relationships between foreground and background knowledge in both image and dialog history modalities and design a dual-encoding workflow to disentangle them effectively. Additionally, we introduce a knowledge bias indicator for each sample, enabling us to assess and quantify the impact of biases on the learning process. By employing a generalized cross-entropy loss, we enhance the distinction of knowledge biases, which significantly improves the efficiency of feature disentanglement. Extensive comparative experiments against state-of-the-art methods, along with ablation studies, validate the effectiveness of our DVD model. These results also substantiate the promising potential of debiasing efforts in advancing the field of visual dialog and vision-language research.

Original languageEnglish
JournalIEEE Transactions on Multimedia
DOIs
StateAccepted/In press - 2026

Keywords

  • debiasing task
  • multi-modal learning
  • Visual dialog

Fingerprint

Dive into the research topics of 'DVD: A Debiased Visual Dialog Model via Disentangling Knowledge Features'. Together they form a unique fingerprint.

Cite this