TY - GEN
T1 - CMM-Math
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Liu, Wentao
AU - Pan, Qianjun
AU - Zhang, Yi
AU - Liu, Zhuo
AU - Wu, Ji
AU - Zhou, Jie
AU - Zhou, Aimin
AU - Chen, Qin
AU - Jiang, Bo
AU - He, Liang
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Large language models (LLMs) have obtained promising results in mathematical reasoning, a foundational human intelligence skill. Most previous studies focus on improving or measuring the performance of LLMs via textual math datasets (e.g., MATH, GSM8K). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., choice, fill-in-the-blank, analysis) with detailed solutions across 12 grade levels from elementary to high school in China. The problem may contain multiple images, and the visual context may be present in the questions or opinions, which makes this dataset more challenging. Our comprehensive analysis reveals that state-of-the-art LMMs on the CMM-Math face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. The Math-LMM is trained using three stages: foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets. We release the datasets on GitHub (https://github.com/ECNU-ICALK/EduChat-Math) and Huggingface (https://huggingface.co/datasets/ecnu-icalk/cmm-math).
AB - Large language models (LLMs) have obtained promising results in mathematical reasoning, a foundational human intelligence skill. Most previous studies focus on improving or measuring the performance of LLMs via textual math datasets (e.g., MATH, GSM8K). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., choice, fill-in-the-blank, analysis) with detailed solutions across 12 grade levels from elementary to high school in China. The problem may contain multiple images, and the visual context may be present in the questions or opinions, which makes this dataset more challenging. Our comprehensive analysis reveals that state-of-the-art LMMs on the CMM-Math face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. The Math-LMM is trained using three stages: foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets. We release the datasets on GitHub (https://github.com/ECNU-ICALK/EduChat-Math) and Huggingface (https://huggingface.co/datasets/ecnu-icalk/cmm-math).
KW - benchmark
KW - chinese
KW - large multimodal models
KW - mathematical reasoning
UR - https://www.scopus.com/pages/publications/105024070872
U2 - 10.1145/3746027.3758193
DO - 10.1145/3746027.3758193
M3 - 会议稿件
AN - SCOPUS:105024070872
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 12585
EP - 12591
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -