Modeling Intra- and Inter-Modal Alignment with Optimal Transport for Visual Dialog

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Visual dialog aims to address a sequence of questions by effectively reasoning over both the dialog history and image content. While existing methods primarily focus on devising various attention mechanisms to capture interactions between different modalities, explicit signals encouraging semantic alignment in the visual dialog are seldom utilized. In this paper, we present a novel approach that leverages Optimal Transport to provide explicit and interpretable training signals to guide intra- and inter-modal alignment for the text and image in the visual dialog. Specifically, our approach consists of two kinds of alignment modules, Word-Word Alignment (WWA) and Region-Word Alignment (RWA). The WWA module learns latent relationships between a given question and a dialog history to align different concepts or pronouns that represent the same entity. As for the RWA module, it models the internal structures of text and images with graphs and performs graph matching for region-word alignment. We perform experiments on the benchmark dataset Visdial v1.0, and the experimental results show that our proposed approach achieves new state-of-the-art performance with respect to most metrics.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 35th International Conference on Tools with Artificial Intelligence, ICTAI 2023
PublisherIEEE Computer Society
Pages805-812
Number of pages8
ISBN (Electronic)9798350342734
DOIs
StatePublished - 2023
Event35th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2023 - Atlanta, United States
Duration: 6 Nov 20238 Nov 2023

Publication series

NameProceedings - International Conference on Tools with Artificial Intelligence, ICTAI
ISSN (Print)1082-3409

Conference

Conference35th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2023
Country/TerritoryUnited States
CityAtlanta
Period6/11/238/11/23

Keywords

  • alignment
  • optimal transport
  • visual dialog

Fingerprint

Dive into the research topics of 'Modeling Intra- and Inter-Modal Alignment with Optimal Transport for Visual Dialog'. Together they form a unique fingerprint.

Cite this