HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation (ICSE 2025 - Research Track)

Who

Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng

Track

ICSE 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 1 May 2025 11:30 - 11:45 at 212 - AI for Analysis 3 Chair(s): Gias Uddin

Abstract

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs’ performance. In this paper, we conduct an empirical study to deeply understand LLMs’ code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model’s performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain several important findings through our experimental study. For example, we find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1% under different context acquisition methods, compared to the evolution-aware evaluation approach. Based on the findings, we give actionable suggestions for more realistic evaluation of LLMs on code generation. We also build a shared evolution-aware code generation toolbox to facilitate future research. The replication package including source code and datasets is anonymously available at https://anonymous.4open.science/r/HumanEvo/.

Dewu Zheng

Sun Yat-sen University

Yanlin Wang

Sun Yat-sen University

China

Ensheng Shi

Xi’an Jiaotong University

China

Ruikai Zhang

Huawei Cloud Computing Technologies

China

Yuchi Ma

Huawei Cloud Computing Technologies

China

Hongyu Zhang

Chongqing University

China

Zibin Zheng

Sun Yat-sen University

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	AI for Analysis 3SE In Practice (SEIP) / Research Track at 212 Chair(s): Gias Uddin York University, Canada

11:00 15m Talk		COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge Research Track Yichen LI The Chinese University of Hong Kong, Yulun Wu The Chinese University of Hong Kong, Jinyang Liu Chinese University of Hong Kong, Zhihan Jiang The Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Guangba Yu The Chinese University of Hong Kong, Michael Lyu The Chinese University of Hong Kong
11:15 15m Talk		Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding Research Track Yifeng Di Purdue University, Tianyi Zhang Purdue University
11:30 15m Talk		HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation Research Track Dewu Zheng Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Ensheng Shi Xi’an Jiaotong University, Ruikai Zhang Huawei Cloud Computing Technologies, Yuchi Ma Huawei Cloud Computing Technologies, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University
11:45 15m Talk		SEMANTIC CODE FINDER: An Efficient Semantic Search Framework for Large-Scale Codebases SE In Practice (SEIP) daeha ryu Innovation Center, Samsung Electronics, Seokjun Ko Samsung Electronics Co., Eunbi Jang Innovation Center, Samsung Electronics, jinyoung park Innovation Center, Samsung Electronics, myunggwan kim Innovation Center, Samsung Electronics, changseo park Innovation Center, Samsung Electronics
12:00 15m Talk		Time to Retrain? Detecting Concept Drifts in Machine Learning Systems SE In Practice (SEIP) Tri Minh-Triet Pham Concordia University, Karthikeyan Premkumar Ericsson, Mohamed Naili Ericsson, Jinqiu Yang Concordia University
12:15 15m Talk		UML Sequence Diagram Generation: A Multi-Model, Multi-Domain Evaluation SE In Practice (SEIP) Chi Xiao Ericsson AB, Daniel Ståhl Ericsson AB, Jan Bosch Chalmers University of Technology