Efficient multimodal large language models: a survey

  • Yizhang Jin
  • , Jian Li
  • , Tianjun Gu
  • , Yexin Liu
  • , Bo Zhao
  • , Jinxiang Lai
  • , Zhenye Gan
  • , Yabiao Wang
  • , Chengjie Wang
  • , Xin Tan
  • , Lizhuang Ma*
  • *Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

Abstract

In the past years, multimodal large language models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, this survey summarizes the timeline of representative efficient MLLMs, the current state of research in structures and strategies, and the applications. Finally, the limitations of current efficient MLLM research and promising future directions are discussed.

Original languageEnglish
Article number27
JournalVisual Intelligence
Volume3
Issue number1
DOIs
StatePublished - Dec 2025
Externally publishedYes

Keywords

  • Efficiency
  • Instruction tuning
  • Multi-modal large language lodel (MLLM)
  • Vision token compression

Fingerprint

Dive into the research topics of 'Efficient multimodal large language models: a survey'. Together they form a unique fingerprint.

Cite this