Can LLMs replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering Tasks (ISSTA 2025 - Research Papers)

Who

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, Xin Xia

Track

ISSTA 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Jun 2025 14:50 - 15:15 at Cosmos Hall - Code and Documentation Generation Chair(s): Ying Zou

Abstract

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.

In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.

Link to Preprint

https://arxiv.org/abs/2502.06193

DOI

https://doi.org/10.1145/3728963

Ruiqi Wang

Harbin Institute of Technology, Shenzhen

Jiyu Guo

Harbin Institute of Technology, Shenzhen

Cuiyun Gao

Harbin Institute of Technology

China

Guodong Fan

Shandong Agriculture and Engineering University

China

Chun Yong Chong

Huawei

Xin Xia

Zhejiang University

China

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:30	Code and Documentation GenerationResearch Papers / Tool Demonstrations at Cosmos Hall Chair(s): Ying Zou Queen's University, Kingston, Ontario

14:00 25m Talk		The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation Research Papers Yingjie Fu Peking University, Bozhou Li Peking University, Linyi Li Simon Fraser University, Wentao Zhang Peking University, Tao Xie Peking University DOI
14:25 25m Talk		VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models Research Papers Jiawei Guo University at Buffalo, SUNY, Haoran Yang Washington State University, Haipeng Cai University at Buffalo, SUNY DOI
14:50 25m Talk		Can LLMs replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering Tasks Research Papers Ruiqi Wang Harbin Institute of Technology, Shenzhen, Jiyu Guo Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Guodong Fan Shandong Agriculture and Engineering University, Chun Yong Chong Huawei, Xin Xia Zhejiang University DOI Pre-print
15:15 15m Demonstration		Code2API: A Tool for Generating Reusable APIs from Stack Overflow Code Snippets Tool Demonstrations Yubo Mai Zhejiang University, Zhipeng Gao Shanghai Institute for Advanced Study - Zhejiang University, Xing Hu Zhejiang University, Lingfeng Bao Zhejiang University, Jingyuan Chen , JianLing Sun Zhejiang University

Information for Participants

Thu 26 Jun 2025 14:00 - 15:30 at Cosmos Hall - Code and Documentation Generation Chair(s): Ying Zou

Info for room Cosmos Hall:

This is the main event hall of Clarion Hotel, which will be used to host keynote talks and other plenary sessions. The FSE and ISSTA banquets will also happen in this room.

The room is just in front of the registration desk, on the other side of the main conference area. The two large doors with numbers “1” and “2” provide access to the Cosmos Hall.