TY - JOUR
T1 - DVD
T2 - A Debiased Visual Dialog Model via Disentangling Knowledge Features
AU - Lu, Chenyu
AU - Zhao, Jing
AU - Sun, Shiliang
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Visual dialog aims to facilitate the answering of multi-round questions by effectively integrating dialog history and the relevant content of images. Existing methods in visual dialog predominantly concentrate on devising multi-modal data interaction architectures to augment multi-modal fusion performance, but they often disregard inherent dataset selection biases. This oversight can lead to imbalanced feature learning and compromising the robustness of the model. In this paper, we propose a Debiased Visual Dialog model (DVD) to mitigate the influence of biases. Specifically, we concretize these biases as spurious relationships between foreground and background knowledge in both image and dialog history modalities and design a dual-encoding workflow to disentangle them effectively. Additionally, we introduce a knowledge bias indicator for each sample, enabling us to assess and quantify the impact of biases on the learning process. By employing a generalized cross-entropy loss, we enhance the distinction of knowledge biases, which significantly improves the efficiency of feature disentanglement. Extensive comparative experiments against state-of-the-art methods, along with ablation studies, validate the effectiveness of our DVD model. These results also substantiate the promising potential of debiasing efforts in advancing the field of visual dialog and vision-language research.
AB - Visual dialog aims to facilitate the answering of multi-round questions by effectively integrating dialog history and the relevant content of images. Existing methods in visual dialog predominantly concentrate on devising multi-modal data interaction architectures to augment multi-modal fusion performance, but they often disregard inherent dataset selection biases. This oversight can lead to imbalanced feature learning and compromising the robustness of the model. In this paper, we propose a Debiased Visual Dialog model (DVD) to mitigate the influence of biases. Specifically, we concretize these biases as spurious relationships between foreground and background knowledge in both image and dialog history modalities and design a dual-encoding workflow to disentangle them effectively. Additionally, we introduce a knowledge bias indicator for each sample, enabling us to assess and quantify the impact of biases on the learning process. By employing a generalized cross-entropy loss, we enhance the distinction of knowledge biases, which significantly improves the efficiency of feature disentanglement. Extensive comparative experiments against state-of-the-art methods, along with ablation studies, validate the effectiveness of our DVD model. These results also substantiate the promising potential of debiasing efforts in advancing the field of visual dialog and vision-language research.
KW - debiasing task
KW - multi-modal learning
KW - Visual dialog
UR - https://www.scopus.com/pages/publications/105028283441
U2 - 10.1109/TMM.2026.3654405
DO - 10.1109/TMM.2026.3654405
M3 - 文章
AN - SCOPUS:105028283441
SN - 1520-9210
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -