BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning

Masoumeh Zareapoor, Pourya Shamsolmoali, Yue Lu

Research output: Contribution to journalConference articlepeer-review

Abstract

Achieving robust performance in vision-language tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose BiMAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. BiMAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, BiMAC identifies and selects relevant image patches for precise cross-modal interaction, reducing information noise and enhancing mapping accuracy. BiMAC achieves state-of-the-art performance across diverse vision-language tasks, including image-text retrieval, captioning, and classification.

Original languageEnglish
Pages (from-to)22290-22298
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number21
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning'. Together they form a unique fingerprint.

Cite this