ASE 2024
Sun 27 October - Fri 1 November 2024 Sacramento, California, United States
Tue 29 Oct 2024 16:15 - 16:30 at Camellia - Code generation 1 Chair(s): Denys Poshyvanyk

In recent years, with the widespread attention of academia and industry on the application of large language models (LLMs) to code-related tasks, an increasing number of large code models (LCMs) have been proposed and corresponding evaluation benchmarks have continually emerged. Although existing evaluation benchmarks are helpful for comparing different LCMs, they may not reflect the performance of LCMs in various development scenarios. Specifically, they might evaluate model performance in only one type of scenario (e.g., code generation or code completion), whereas real development contexts are diverse and may involve multiple tasks such as code generation, code completion, API recommendation, and test function generation. Additionally, the questions may not originate from actual development practices, failing to capture the programming challenges faced by developers during the development process.

To address the aforementioned issues, we propose ComplexCodeEval, a new benchmark for evaluating the performance of LCMs in various development scenarios. ComplexCodeEval includes 3,897 Java samples from 1,055 high-star GitHub repositories and 7,184 Python samples from 2,107 high-star repositories. Each sample in ComplexCodeEval contains multiple annotations (e.g., function signatures, docstrings and reference API) to accommodate various downstream tasks. Furthermore, to better reflect diverse development scenarios, each sample’s corresponding repository must depend on at least one selected library (based on popularity), and each sample must invoke at least one API from the selected library. Additionally, each sample has multiple timestamps to avoid data leakage. Based on ComplexCodeEval, we evaluate the performance of nine LCMs across four tasks (i.e., code generation, code completion, API recommendation, and test case generation) to explore their performance in real development environments. Furthermore, we conduct an in-depth analysis of the impact of context and data leakage on model performance. Our experimental results reveal several key findings. For instance, LCMs exhibit varying performance across different coding tasks. Additionally, rich contextual information can greatly enhance the performance of LCMs. Moreover, using leaked data for evaluation may lead to an overestimation of model performance, resulting in inaccurate evaluation outcomes that deviate from the performance in practice.

Tue 29 Oct

Displayed time zone: Pacific Time (US & Canada) change

15:30 - 16:30
Code generation 1Journal-first Papers / Research Papers / Industry Showcase at Camellia
Chair(s): Denys Poshyvanyk William & Mary
15:30
15m
Talk
AACEGEN: Attention Guided Adversarial Code Example Generation for Deep Code Models
Research Papers
Zhong Li , Chong Zhang Nanjing University, Minxue Pan Nanjing University, Tian Zhang Nanjing University, Xuandong Li Nanjing University
15:45
15m
Talk
Self-collaboration Code Generation via ChatGPT
Journal-first Papers
Yihong Dong Peking University, Xue Jiang , Zhi Jin Peking University, Ge Li Peking University
16:00
15m
Talk
Vehicle Domain-Specific Language: Unifying Modeling and Code Generation for Low-Code Automotive Development
Industry Showcase
Lei Liao GAC R&D Center, Junjie Wang Institute of Software at Chinese Academy of Sciences, Zhensheng Xu GAC R&D Center, Fangwen Mu Institute of Software, Chinese Academy of Sciences, Yukun Yang GAC R&D Center
16:15
15m
Talk
ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
Research Papers
jiafeng University of Electronic Science and Technology of China, Jiachen Liu Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Chun Yong Chong Huawei, Chaozheng Wang The Chinese University of Hong Kong, Shan Gao Huawei, Xin Xia Huawei
Link to publication DOI Pre-print Media Attached