TY - GEN
T1 - Code LLMs Still Fall Short of Top Programmers
T2 - 19th ACM International Conference on Web Search and Data Mining, WSDM 2026
AU - Chen, Shisong
AU - Zhou, Ziyu
AU - Zhao, Yicong
AU - Yang, Chengyi
AU - Li, Zhixu
AU - Xiao, Yanghua
AU - Lin, Xin
AU - Meng, Xiaojun
AU - Wei, Jiansheng
AU - Liu, Kuien
N1 - Publisher Copyright:
© 2026 Owner/Author.
PY - 2026/2/21
Y1 - 2026/2/21
N2 - Evaluating the coding capabilities of models through algorithmic code generation is challenging, as it requires deep problem understanding and complex algorithm design. Current benchmarks suffer from a narrow focus on final execution results (such as pass@k), neglecting the crucial reasoning and problem-solving processes inherent in code generation. To address this limitation, we introduce a multi-phase algorithmic code generation benchmark, MUPA, structured around human computational thinking. MUPA dissects the evaluation into four distinct phases: example understanding, algorithm selection, solution description, and code generation. This framework facilitates a comprehensive assessment by providing insights into the model's intermediate problem-solving steps, rather than just the final code. We manually curated 197 high-quality competitive programming problems from Codeforces. Utilizing an LLM-as-a-judge paradigm with specialized prompts, our rigorous evaluation of several existing code generation LLMs reveals significant across-the-board challenges. Notably, we establish a positive correlation, indicating that proficiency in an earlier phase directly impacts performance in subsequent phases, underscoring the interdependency of these algorithmic skills. The benchmark is publicly available at https://github.com/cheniison/MUPA.
AB - Evaluating the coding capabilities of models through algorithmic code generation is challenging, as it requires deep problem understanding and complex algorithm design. Current benchmarks suffer from a narrow focus on final execution results (such as pass@k), neglecting the crucial reasoning and problem-solving processes inherent in code generation. To address this limitation, we introduce a multi-phase algorithmic code generation benchmark, MUPA, structured around human computational thinking. MUPA dissects the evaluation into four distinct phases: example understanding, algorithm selection, solution description, and code generation. This framework facilitates a comprehensive assessment by providing insights into the model's intermediate problem-solving steps, rather than just the final code. We manually curated 197 high-quality competitive programming problems from Codeforces. Utilizing an LLM-as-a-judge paradigm with specialized prompts, our rigorous evaluation of several existing code generation LLMs reveals significant across-the-board challenges. Notably, we establish a positive correlation, indicating that proficiency in an earlier phase directly impacts performance in subsequent phases, underscoring the interdependency of these algorithmic skills. The benchmark is publicly available at https://github.com/cheniison/MUPA.
KW - benchmark
KW - code generation
KW - large language model
UR - https://www.scopus.com/pages/publications/105033158384
U2 - 10.1145/3773966.3778008
DO - 10.1145/3773966.3778008
M3 - 会议稿件
AN - SCOPUS:105033158384
T3 - WSDM 2026 - Proceedings of the 19th ACM International Conference on Web Search and Data Mining
SP - 79
EP - 88
BT - WSDM 2026 - Proceedings of the 19th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery, Inc
Y2 - 22 February 2026 through 26 February 2026
ER -