MSGeN: Multimodal Selective Generation Network for Grounded Explanations

  • Dingbang Li
  • , Wenzhou Chen
  • , Xin Lin*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step’s explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN’s ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.

Original languageEnglish
Article number152
JournalElectronics (Switzerland)
Volume13
Issue number1
DOIs
StatePublished - Jan 2024

Keywords

  • explanation generation
  • multimodal
  • vision and language
  • visual question answering

Fingerprint

Dive into the research topics of 'MSGeN: Multimodal Selective Generation Network for Grounded Explanations'. Together they form a unique fingerprint.

Cite this