TY - JOUR
T1 - [Figure presented]MARS
T2 - Multimodal-Assisted Refined Semantic Alignment
AU - Xu, Junjie
AU - Wu, Xingjiao
AU - Zhang, Zihao
AU - Yang, Shuwen
AU - Ma, Tianlong
AU - Dong, Daoguo
AU - He, Liang
N1 - Publisher Copyright:
© 2025
PY - 2026/1
Y1 - 2026/1
N2 - Audio-to-image generation (AIG) faces challenges in fine-grained semantic alignment, particularly semantic semantic misalignment, and loss of visual detail. To address these issues, we proposed MARS [Figure presented] (Multimodal-Assisted Refined Semantic alignment), a novel framework leveraging a Mamba-based audio encoder to manage the complexity of long audio sequences, coupled with a fine-grained multimodal alignment strategy using visual descriptions from multimodal large language models. We enhanced semantic coherence and aesthetic quality by fine-tuning an image generator using an image aesthetic perception generator. Furthermore, we validated MARS on VGGSound and VEGAS benchmarks, comprising 37,250 and 9,500 records, respectively. The results suggest that MARS significantly outperforms existing methods, achieving average improvements of 28.73% in semantic relevance and 127.35% in aesthetic scores compared with the best AIG generation baseline. In addition, cross-domain evaluations on the AudioCaps and Clotho datasets confirmed the robustness and generalization capability of MARS, with an average improvement of 73.9% on the V2A metric.
AB - Audio-to-image generation (AIG) faces challenges in fine-grained semantic alignment, particularly semantic semantic misalignment, and loss of visual detail. To address these issues, we proposed MARS [Figure presented] (Multimodal-Assisted Refined Semantic alignment), a novel framework leveraging a Mamba-based audio encoder to manage the complexity of long audio sequences, coupled with a fine-grained multimodal alignment strategy using visual descriptions from multimodal large language models. We enhanced semantic coherence and aesthetic quality by fine-tuning an image generator using an image aesthetic perception generator. Furthermore, we validated MARS on VGGSound and VEGAS benchmarks, comprising 37,250 and 9,500 records, respectively. The results suggest that MARS significantly outperforms existing methods, achieving average improvements of 28.73% in semantic relevance and 127.35% in aesthetic scores compared with the best AIG generation baseline. In addition, cross-domain evaluations on the AudioCaps and Clotho datasets confirmed the robustness and generalization capability of MARS, with an average improvement of 73.9% on the V2A metric.
KW - Audio-Based image generation
KW - Audio-visual learning
KW - Contrastive learning
KW - Multimodal alignment
UR - https://www.scopus.com/pages/publications/105012439863
U2 - 10.1016/j.ipm.2025.104292
DO - 10.1016/j.ipm.2025.104292
M3 - 文章
AN - SCOPUS:105012439863
SN - 0306-4573
VL - 63
JO - Information Processing and Management
JF - Information Processing and Management
IS - 1
M1 - 104292
ER -