TY - JOUR
T1 - Adaptive Momentum Mixture-of-Experts for Continual Visual Question Answering
AU - Huai, Tianyu
AU - Zhou, Jie
AU - Chen, Qin
AU - Bai, Qingchun
AU - Zhou, Ze
AU - Qiu, Xipeng
AU - He, Liang
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Multimodal large language models (MLLMs) have attracted considerable attention for their impressive capabilities in understanding and generating visual-language content, particularly in tasks such as visual question answering (VQA). However, the rapid evolution of knowledge in real-world applications poses challenges for these models: offline training becomes increasingly costly, and exposure to non-stationary data streams often leads to catastrophic forgetting. In this paper, we propose CL-MoE+, a dual-momentum Mixture-of-Experts (MoE) framework based on MLLMs for continual VQA. Our method integrates continual learning into MLLMs to leverage the rich commonsense knowledge embedded in large language models.We introduce a Dual-Router MoE (RMoE) module that selects both global and local experts through task-level and instance-level routers, enabling robust and context-aware expert allocation. Furthermore, we design an adaptive Momentum MoE (MMoE) to update experts’ parameters based on the knowledge drift degree and their relevance to specific tasks, thereby facilitating knowledge integration without forgetting. Extensive experiments on a 10-task split of the VQA v2 benchmark demonstrate that CL-MoE+ achieves state-of-the-art performance, validating its effectiveness in both retaining historical knowledge and learning new information in the continual learning setting.
AB - Multimodal large language models (MLLMs) have attracted considerable attention for their impressive capabilities in understanding and generating visual-language content, particularly in tasks such as visual question answering (VQA). However, the rapid evolution of knowledge in real-world applications poses challenges for these models: offline training becomes increasingly costly, and exposure to non-stationary data streams often leads to catastrophic forgetting. In this paper, we propose CL-MoE+, a dual-momentum Mixture-of-Experts (MoE) framework based on MLLMs for continual VQA. Our method integrates continual learning into MLLMs to leverage the rich commonsense knowledge embedded in large language models.We introduce a Dual-Router MoE (RMoE) module that selects both global and local experts through task-level and instance-level routers, enabling robust and context-aware expert allocation. Furthermore, we design an adaptive Momentum MoE (MMoE) to update experts’ parameters based on the knowledge drift degree and their relevance to specific tasks, thereby facilitating knowledge integration without forgetting. Extensive experiments on a 10-task split of the VQA v2 benchmark demonstrate that CL-MoE+ achieves state-of-the-art performance, validating its effectiveness in both retaining historical knowledge and learning new information in the continual learning setting.
KW - Visual question answering
KW - continual learning
KW - large vision-language model
UR - https://www.scopus.com/pages/publications/105023184274
U2 - 10.1109/TCSVT.2025.3637303
DO - 10.1109/TCSVT.2025.3637303
M3 - 文章
AN - SCOPUS:105023184274
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -