TY - JOUR
T1 - BiMAC
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
AU - Zareapoor, Masoumeh
AU - Shamsolmoali, Pourya
AU - Lu, Yue
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Achieving robust performance in vision-language tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose BiMAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. BiMAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, BiMAC identifies and selects relevant image patches for precise cross-modal interaction, reducing information noise and enhancing mapping accuracy. BiMAC achieves state-of-the-art performance across diverse vision-language tasks, including image-text retrieval, captioning, and classification.
AB - Achieving robust performance in vision-language tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose BiMAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. BiMAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, BiMAC identifies and selects relevant image patches for precise cross-modal interaction, reducing information noise and enhancing mapping accuracy. BiMAC achieves state-of-the-art performance across diverse vision-language tasks, including image-text retrieval, captioning, and classification.
UR - https://www.scopus.com/pages/publications/105003996254
U2 - 10.1609/aaai.v39i21.34384
DO - 10.1609/aaai.v39i21.34384
M3 - 会议文章
AN - SCOPUS:105003996254
SN - 2159-5399
VL - 39
SP - 22290
EP - 22298
JO - Proceedings of the AAAI Conference on Artificial Intelligence
JF - Proceedings of the AAAI Conference on Artificial Intelligence
IS - 21
Y2 - 25 February 2025 through 4 March 2025
ER -