Adaptive Momentum Mixture-of-Experts for Continual Visual Question Answering

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal large language models (MLLMs) have attracted considerable attention for their impressive capabilities in understanding and generating visual-language content, particularly in tasks such as visual question answering (VQA). However, the rapid evolution of knowledge in real-world applications poses challenges for these models: offline training becomes increasingly costly, and exposure to non-stationary data streams often leads to catastrophic forgetting. In this paper, we propose CL-MoE+, a dual-momentum Mixture-of-Experts (MoE) framework based on MLLMs for continual VQA. Our method integrates continual learning into MLLMs to leverage the rich commonsense knowledge embedded in large language models.We introduce a Dual-Router MoE (RMoE) module that selects both global and local experts through task-level and instance-level routers, enabling robust and context-aware expert allocation. Furthermore, we design an adaptive Momentum MoE (MMoE) to update experts’ parameters based on the knowledge drift degree and their relevance to specific tasks, thereby facilitating knowledge integration without forgetting. Extensive experiments on a 10-task split of the VQA v2 benchmark demonstrate that CL-MoE+ achieves state-of-the-art performance, validating its effectiveness in both retaining historical knowledge and learning new information in the continual learning setting.

Keywords

  • Visual question answering
  • continual learning
  • large vision-language model

Fingerprint

Dive into the research topics of 'Adaptive Momentum Mixture-of-Experts for Continual Visual Question Answering'. Together they form a unique fingerprint.

Cite this