ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
In recent years, with the widespread attention of academia and industry on the application of large language models (LLMs) to code-related tasks, an increasing number of large code models (LCMs) have been proposed and corresponding evaluation benchmarks have continually emerged. Although existing evaluation benchmarks are helpful for comparing different LCMs, they may not reflect the performance of LCMs in various development scenarios. Specifically, they might evaluate model performance in only one type of scenario (e.g., code generation or code completion), whereas real development contexts are diverse and may involve multiple tasks such as code generation, code completion, API recommendation, and test function generation. Additionally, the questions may not originate from actual development practices, failing to capture the programming challenges faced by developers during the development process.
To address the aforementioned issues, we propose ComplexCodeEval, a new benchmark for evaluating the performance of LCMs in various development scenarios. ComplexCodeEval includes 3,897 Java samples from 1,055 high-star GitHub repositories and 7,184 Python samples from 2,107 high-star repositories. Each sample in ComplexCodeEval contains multiple annotations (e.g., function signatures, docstrings and reference API) to accommodate various downstream tasks. Furthermore, to better reflect diverse development scenarios, each sample’s corresponding repository must depend on at least one selected library (based on popularity), and each sample must invoke at least one API from the selected library. Additionally, each sample has multiple timestamps to avoid data leakage. Based on ComplexCodeEval, we evaluate the performance of nine LCMs across four tasks (i.e., code generation, code completion, API recommendation, and test case generation) to explore their performance in real development environments. Furthermore, we conduct an in-depth analysis of the impact of context and data leakage on model performance. Our experimental results reveal several key findings. For instance, LCMs exhibit varying performance across different coding tasks. Additionally, rich contextual information can greatly enhance the performance of LCMs. Moreover, using leaked data for evaluation may lead to an overestimation of model performance, resulting in inaccurate evaluation outcomes that deviate from the performance in practice.
Tue 29 OctDisplayed time zone: Pacific Time (US & Canada) change
15:30 - 16:30 | Code generation 1Journal-first Papers / Research Papers / Industry Showcase at Camellia Chair(s): Denys Poshyvanyk William & Mary | ||
15:30 15mTalk | AACEGEN: Attention Guided Adversarial Code Example Generation for Deep Code Models Research Papers Zhong Li , Chong Zhang Nanjing University, Minxue Pan Nanjing University, Tian Zhang Nanjing University, Xuandong Li Nanjing University | ||
15:45 15mTalk | Self-collaboration Code Generation via ChatGPT Journal-first Papers | ||
16:00 15mTalk | Vehicle Domain-Specific Language: Unifying Modeling and Code Generation for Low-Code Automotive Development Industry Showcase Lei Liao GAC R&D Center, Junjie Wang Institute of Software at Chinese Academy of Sciences, Zhensheng Xu GAC R&D Center, Fangwen Mu Institute of Software, Chinese Academy of Sciences, Yukun Yang GAC R&D Center | ||
16:15 15mTalk | ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code Research Papers jiafeng University of Electronic Science and Technology of China, Jiachen Liu Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Chun Yong Chong Huawei, Chaozheng Wang The Chinese University of Hong Kong, Shan Gao Huawei, Xin Xia Huawei Link to publication DOI Pre-print Media Attached |