TY - JOUR
T1 - Efficient multimodal large language models
T2 - a survey
AU - Jin, Yizhang
AU - Li, Jian
AU - Gu, Tianjun
AU - Liu, Yexin
AU - Zhao, Bo
AU - Lai, Jinxiang
AU - Gan, Zhenye
AU - Wang, Yabiao
AU - Wang, Chengjie
AU - Tan, Xin
AU - Ma, Lizhuang
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - In the past years, multimodal large language models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, this survey summarizes the timeline of representative efficient MLLMs, the current state of research in structures and strategies, and the applications. Finally, the limitations of current efficient MLLM research and promising future directions are discussed.
AB - In the past years, multimodal large language models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, this survey summarizes the timeline of representative efficient MLLMs, the current state of research in structures and strategies, and the applications. Finally, the limitations of current efficient MLLM research and promising future directions are discussed.
KW - Efficiency
KW - Instruction tuning
KW - Multi-modal large language lodel (MLLM)
KW - Vision token compression
UR - https://www.scopus.com/pages/publications/105024897736
U2 - 10.1007/s44267-025-00099-6
DO - 10.1007/s44267-025-00099-6
M3 - 文献综述
AN - SCOPUS:105024897736
SN - 2097-3330
VL - 3
JO - Visual Intelligence
JF - Visual Intelligence
IS - 1
M1 - 27
ER -