TY - JOUR
T1 - MSGeN
T2 - Multimodal Selective Generation Network for Grounded Explanations
AU - Li, Dingbang
AU - Chen, Wenzhou
AU - Lin, Xin
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2024/1
Y1 - 2024/1
N2 - Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step’s explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN’s ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.
AB - Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step’s explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN’s ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.
KW - explanation generation
KW - multimodal
KW - vision and language
KW - visual question answering
UR - https://www.scopus.com/pages/publications/85181913701
U2 - 10.3390/electronics13010152
DO - 10.3390/electronics13010152
M3 - 文章
AN - SCOPUS:85181913701
SN - 2079-9292
VL - 13
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 1
M1 - 152
ER -