[Figure presented]MARS: Multimodal-Assisted Refined Semantic Alignment

  • Junjie Xu
  • , Xingjiao Wu*
  • , Zihao Zhang
  • , Shuwen Yang
  • , Tianlong Ma
  • , Daoguo Dong
  • , Liang He
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Audio-to-image generation (AIG) faces challenges in fine-grained semantic alignment, particularly semantic semantic misalignment, and loss of visual detail. To address these issues, we proposed MARS [Figure presented] (Multimodal-Assisted Refined Semantic alignment), a novel framework leveraging a Mamba-based audio encoder to manage the complexity of long audio sequences, coupled with a fine-grained multimodal alignment strategy using visual descriptions from multimodal large language models. We enhanced semantic coherence and aesthetic quality by fine-tuning an image generator using an image aesthetic perception generator. Furthermore, we validated MARS on VGGSound and VEGAS benchmarks, comprising 37,250 and 9,500 records, respectively. The results suggest that MARS significantly outperforms existing methods, achieving average improvements of 28.73% in semantic relevance and 127.35% in aesthetic scores compared with the best AIG generation baseline. In addition, cross-domain evaluations on the AudioCaps and Clotho datasets confirmed the robustness and generalization capability of MARS, with an average improvement of 73.9% on the V2A metric.

Original languageEnglish
Article number104292
JournalInformation Processing and Management
Volume63
Issue number1
DOIs
StatePublished - Jan 2026

Keywords

  • Audio-Based image generation
  • Audio-visual learning
  • Contrastive learning
  • Multimodal alignment

Fingerprint

Dive into the research topics of '[Figure presented]MARS: Multimodal-Assisted Refined Semantic Alignment'. Together they form a unique fingerprint.

Cite this