TY - JOUR
T1 - Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis Tasks
AU - Chen, Qian
AU - Yu, Chenyang
AU - Liu, Ruyan
AU - Zhang, Chi
AU - Wang, Yu
AU - Wang, Ke
AU - Su, Ting
AU - Wang, Linzhang
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/4/29
Y1 - 2024/4/29
N2 - While deep neural networks provide state-of-the-art solutions to a wide range of programming language tasks, their effectiveness in dealing with foundational program analysis tasks remains under explored. In this paper, we present an empirical study that evaluates four prominent models of code (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in two such foundational tasks: (1) alias prediction, in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction, in which models predict whether or not two programs are semantically equivalent. At the core of this study is CodeSem, a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL) and manually validated for the two prediction tasks. Results show that all models are accurate in both prediction tasks, especially CuBERT with an accuracy of 89% and 84% in alias prediction and equivalence prediction, respectively. We also conduct a comprehensive, in-depth analysis of the results of all models in both tasks, concluding that deep learning models are generally capable of performing foundational tasks in program analysis even though in specific cases their weaknesses are also evident.
AB - While deep neural networks provide state-of-the-art solutions to a wide range of programming language tasks, their effectiveness in dealing with foundational program analysis tasks remains under explored. In this paper, we present an empirical study that evaluates four prominent models of code (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in two such foundational tasks: (1) alias prediction, in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction, in which models predict whether or not two programs are semantically equivalent. At the core of this study is CodeSem, a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL) and manually validated for the two prediction tasks. Results show that all models are accurate in both prediction tasks, especially CuBERT with an accuracy of 89% and 84% in alias prediction and equivalence prediction, respectively. We also conduct a comprehensive, in-depth analysis of the results of all models in both tasks, concluding that deep learning models are generally capable of performing foundational tasks in program analysis even though in specific cases their weaknesses are also evident.
KW - Alias Analysis
KW - Deep Learning
KW - Equivalence Checking
UR - https://www.scopus.com/pages/publications/85195789435
U2 - 10.1145/3649829
DO - 10.1145/3649829
M3 - 文章
AN - SCOPUS:85195789435
SN - 2475-1421
VL - 8
JO - Proceedings of the ACM on Programming Languages
JF - Proceedings of the ACM on Programming Languages
IS - OOPSLA1
M1 - 112
ER -