An Agent-based Evaluation Framework for Complex Code Generation (ASE 2025 - Research Papers)

Who

Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu, Cuiyun Gao

Track

ASE 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 18 Nov 2025 11:30 - 11:40 at Grand Hall 1 - Code Generation 2 Chair(s): Jia Li

Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability.

To mitigate the limitations, we propose \textbf{CodeVisionary}, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: \textbf{(1) \textit{Requirement-guided multi-dimensional context distillation stage}}, which first formulates a detailed evaluation plan by decomposing task requirements, and then stepwise collects multi-dimensional contextual information for each requirement. \textbf{(2) \textit{Fine-grained scoring and summarization stage}}, which defines self-directed and negotiation-based actions, allowing multiple judges to comprehend complex code from fine-grained and diverse viewpoints, and reach a consensus through discussion. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that \framework achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://anonymous.4open.science/r/CodeVisionary.

Xinchen Wang

Harbin Institute of Technology

Pengfei Gao

ByteDance

Chao Peng

ByteDance

China

Ruida Hu

Harbin Institute of Technology, Shenzhen

Cuiyun Gao