Multi-MELO: Unified multimodal model editing with dynamic LoRA

Research output: Contribution to journalArticlepeer-review

Abstract

Model editing aims to correct hallucinations or incorporate new knowledge into the pre-trained neural networks. Most previous researches focus on model editing with merely the textual modality, while editing for multimodal models is not well studied. Recent research investigates how to adapt language model editors to multimodal scenarios. However, these methods are limited to image-to-text tasks and similar model architectures. The text-to-image editing task remains unexplored, presenting significant challenges due to the diversity of complex network architectures. In this paper, we propose a unified multimodal model editing framework based on dynamic LoRA (Multi-MELO), which enables effective editing for various multimodal models by dynamically activating corresponding LoRA blocks that encode the related knowledge. We explore the framework for editing diverse multimodal models (i.e., BLIP-2, and latent diffusion model) on three downstream tasks, including image captioning, visual question answering and text-to-image generation. The experimental results show that Multi-MELO achieves superior editing performance compared to the recent state-of-the-art baselines, and meanwhile requires no extra training for additional modules.

Original languageEnglish
Article number126766
JournalExpert Systems with Applications
Volume273
DOIs
StatePublished - 10 May 2025

Keywords

  • Diffusion model
  • Knowledge editing
  • Model editing
  • Vision-language model

Fingerprint

Dive into the research topics of 'Multi-MELO: Unified multimodal model editing with dynamic LoRA'. Together they form a unique fingerprint.

Cite this