Skip to main navigation Skip to search Skip to main content

Code LLMs Still Fall Short of Top Programmers: Evaluating Algorithmic Code Generation Through Computational Thinking

  • Shisong Chen
  • , Ziyu Zhou
  • , Yicong Zhao
  • , Chengyi Yang
  • , Zhixu Li*
  • , Yanghua Xiao
  • , Xin Lin
  • , Xiaojun Meng
  • , Jiansheng Wei
  • , Kuien Liu
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Evaluating the coding capabilities of models through algorithmic code generation is challenging, as it requires deep problem understanding and complex algorithm design. Current benchmarks suffer from a narrow focus on final execution results (such as pass@k), neglecting the crucial reasoning and problem-solving processes inherent in code generation. To address this limitation, we introduce a multi-phase algorithmic code generation benchmark, MUPA, structured around human computational thinking. MUPA dissects the evaluation into four distinct phases: example understanding, algorithm selection, solution description, and code generation. This framework facilitates a comprehensive assessment by providing insights into the model's intermediate problem-solving steps, rather than just the final code. We manually curated 197 high-quality competitive programming problems from Codeforces. Utilizing an LLM-as-a-judge paradigm with specialized prompts, our rigorous evaluation of several existing code generation LLMs reveals significant across-the-board challenges. Notably, we establish a positive correlation, indicating that proficiency in an earlier phase directly impacts performance in subsequent phases, underscoring the interdependency of these algorithmic skills. The benchmark is publicly available at https://github.com/cheniison/MUPA.

Original languageEnglish
Title of host publicationWSDM 2026 - Proceedings of the 19th ACM International Conference on Web Search and Data Mining
PublisherAssociation for Computing Machinery, Inc
Pages79-88
Number of pages10
ISBN (Electronic)9798400722929
DOIs
StatePublished - 21 Feb 2026
Event19th ACM International Conference on Web Search and Data Mining, WSDM 2026 - Boise, United States
Duration: 22 Feb 202626 Feb 2026

Publication series

NameWSDM 2026 - Proceedings of the 19th ACM International Conference on Web Search and Data Mining

Conference

Conference19th ACM International Conference on Web Search and Data Mining, WSDM 2026
Country/TerritoryUnited States
CityBoise
Period22/02/2626/02/26

Keywords

  • benchmark
  • code generation
  • large language model

Fingerprint

Dive into the research topics of 'Code LLMs Still Fall Short of Top Programmers: Evaluating Algorithmic Code Generation Through Computational Thinking'. Together they form a unique fingerprint.

Cite this