TY - JOUR
T1 - Multi-MELO
T2 - Unified multimodal model editing with dynamic LoRA
AU - Chen, Qin
AU - Yin, Jianghao
AU - Yu, Lang
AU - Zhou, Jie
AU - He, Liang
N1 - Publisher Copyright:
© 2025
PY - 2025/5/10
Y1 - 2025/5/10
N2 - Model editing aims to correct hallucinations or incorporate new knowledge into the pre-trained neural networks. Most previous researches focus on model editing with merely the textual modality, while editing for multimodal models is not well studied. Recent research investigates how to adapt language model editors to multimodal scenarios. However, these methods are limited to image-to-text tasks and similar model architectures. The text-to-image editing task remains unexplored, presenting significant challenges due to the diversity of complex network architectures. In this paper, we propose a unified multimodal model editing framework based on dynamic LoRA (Multi-MELO), which enables effective editing for various multimodal models by dynamically activating corresponding LoRA blocks that encode the related knowledge. We explore the framework for editing diverse multimodal models (i.e., BLIP-2, and latent diffusion model) on three downstream tasks, including image captioning, visual question answering and text-to-image generation. The experimental results show that Multi-MELO achieves superior editing performance compared to the recent state-of-the-art baselines, and meanwhile requires no extra training for additional modules.
AB - Model editing aims to correct hallucinations or incorporate new knowledge into the pre-trained neural networks. Most previous researches focus on model editing with merely the textual modality, while editing for multimodal models is not well studied. Recent research investigates how to adapt language model editors to multimodal scenarios. However, these methods are limited to image-to-text tasks and similar model architectures. The text-to-image editing task remains unexplored, presenting significant challenges due to the diversity of complex network architectures. In this paper, we propose a unified multimodal model editing framework based on dynamic LoRA (Multi-MELO), which enables effective editing for various multimodal models by dynamically activating corresponding LoRA blocks that encode the related knowledge. We explore the framework for editing diverse multimodal models (i.e., BLIP-2, and latent diffusion model) on three downstream tasks, including image captioning, visual question answering and text-to-image generation. The experimental results show that Multi-MELO achieves superior editing performance compared to the recent state-of-the-art baselines, and meanwhile requires no extra training for additional modules.
KW - Diffusion model
KW - Knowledge editing
KW - Model editing
KW - Vision-language model
UR - https://www.scopus.com/pages/publications/85217898134
U2 - 10.1016/j.eswa.2025.126766
DO - 10.1016/j.eswa.2025.126766
M3 - 文章
AN - SCOPUS:85217898134
SN - 0957-4174
VL - 273
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 126766
ER -