ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code (ASE 2024 - Research Papers)

Who

jiafeng , Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia

Track

ASE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Oct 2024 16:15 - 16:30 at Camellia - Code generation 1 Chair(s): Denys Poshyvanyk

Abstract

In recent years, with the widespread attention of academia and industry on the application of large language models (LLMs) to code-related tasks, an increasing number of large code models (LCMs) have been proposed and corresponding evaluation benchmarks have continually emerged. Although existing evaluation benchmarks are helpful for comparing different LCMs, they may not reflect the performance of LCMs in various development scenarios. Specifically, they might evaluate model performance in only one type of scenario (e.g., code generation or code completion), whereas real development contexts are diverse and may involve multiple tasks such as code generation, code completion, API recommendation, and test function generation. Additionally, the questions may not originate from actual development practices, failing to capture the programming challenges faced by developers during the development process.

To address the aforementioned issues, we propose ComplexCodeEval, a new benchmark for evaluating the performance of LCMs in various development scenarios. ComplexCodeEval includes 3,897 Java samples from 1,055 high-star GitHub repositories and 7,184 Python samples from 2,107 high-star repositories. Each sample in ComplexCodeEval contains multiple annotations (e.g., function signatures, docstrings and reference API) to accommodate various downstream tasks. Furthermore, to better reflect diverse development scenarios, each sample’s corresponding repository must depend on at least one selected library (based on popularity), and each sample must invoke at least one API from the selected library. Additionally, each sample has multiple timestamps to avoid data leakage. Based on ComplexCodeEval, we evaluate the performance of nine LCMs across four tasks (i.e., code generation, code completion, API recommendation, and test case generation) to explore their performance in real development environments. Furthermore, we conduct an in-depth analysis of the impact of context and data leakage on model performance. Our experimental results reveal several key findings. For instance, LCMs exhibit varying performance across different coding tasks. Additionally, rich contextual information can greatly enhance the performance of LCMs. Moreover, using leaked data for evaluation may lead to an overestimation of model performance, resulting in inaccurate evaluation outcomes that deviate from the performance in practice.

Link to Publication

https://dl.acm.org/doi/10.1145/3691620.3695552

Link to Preprint

https://arxiv.org/abs/2409.10280

DOI

https://doi.org/10.1145/3691620.3695552

jiafeng

University of Electronic Science and Technology of China

Jiachen Liu

Harbin Institute of Technology, Shenzhen

Cuiyun Gao

Harbin Institute of Technology

China

Chun Yong Chong

Huawei

Chaozheng Wang

The Chinese University of Hong Kong

Shan Gao

Huawei

Xin Xia

Huawei

China

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code.

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 29 Oct
Displayed time zone: Pacific Time (US & Canada) change

15:30 - 16:30	Code generation 1Journal-first Papers / Research Papers / Industry Showcase at Camellia Chair(s): Denys Poshyvanyk William & Mary

15:30 15m Talk		AACEGEN: Attention Guided Adversarial Code Example Generation for Deep Code Models Research Papers Zhong Li , Chong Zhang Nanjing University, Minxue Pan Nanjing University, Tian Zhang Nanjing University, Xuandong Li Nanjing University
15:45 15m Talk		Self-collaboration Code Generation via ChatGPT Journal-first Papers Yihong Dong Peking University, Xue Jiang , Zhi Jin Peking University, Ge Li Peking University
16:00 15m Talk		Vehicle Domain-Specific Language: Unifying Modeling and Code Generation for Low-Code Automotive Development Industry Showcase Lei Liao GAC R&D Center, Junjie Wang Institute of Software at Chinese Academy of Sciences, Zhensheng Xu GAC R&D Center, Fangwen Mu Institute of Software, Chinese Academy of Sciences, Yukun Yang GAC R&D Center
16:15 15m Talk		ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code Research Papers jiafeng University of Electronic Science and Technology of China, Jiachen Liu Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Chun Yong Chong Huawei, Chaozheng Wang The Chinese University of Hong Kong, Shan Gao Huawei, Xin Xia Huawei Link to publication DOI Pre-print Media Attached