ISSTA 2025
Wed 25 - Sat 28 June 2025 Trondheim, Norway
co-located with FSE 2025
Thu 26 Jun 2025 14:50 - 15:15 at Cosmos Hall - Code and Documentation Generation Chair(s): Ying Zou

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.

In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.

Thu 26 Jun

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:30
Code and Documentation GenerationResearch Papers / Tool Demonstrations at Cosmos Hall
Chair(s): Ying Zou Queen's University, Kingston, Ontario
14:00
25m
Talk
The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation
Research Papers
Yingjie Fu Peking University, Bozhou Li Peking University, Linyi Li Simon Fraser University, Wentao Zhang Peking University, Tao Xie Peking University
DOI
14:25
25m
Talk
VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
Research Papers
Jiawei Guo University at Buffalo, SUNY, Haoran Yang Washington State University, Haipeng Cai University at Buffalo, SUNY
DOI
14:50
25m
Talk
Can LLMs replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering Tasks
Research Papers
Ruiqi Wang Harbin Institute of Technology, Shenzhen, Jiyu Guo Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Guodong Fan Shandong Agriculture and Engineering University, Chun Yong Chong Huawei, Xin Xia Zhejiang University
DOI Pre-print
15:15
15m
Demonstration
Code2API: A Tool for Generating Reusable APIs from Stack Overflow Code Snippets
Tool Demonstrations
Yubo Mai Zhejiang University, Zhipeng Gao Shanghai Institute for Advanced Study - Zhejiang University, Xing Hu Zhejiang University, Lingfeng Bao Zhejiang University, Jingyuan Chen , JianLing Sun Zhejiang University

Information for Participants
Thu 26 Jun 2025 14:00 - 15:30 at Cosmos Hall - Code and Documentation Generation Chair(s): Ying Zou
Info for room Cosmos Hall:

This is the main event hall of Clarion Hotel, which will be used to host keynote talks and other plenary sessions. The FSE and ISSTA banquets will also happen in this room.

The room is just in front of the registration desk, on the other side of the main conference area. The two large doors with numbers “1” and “2” provide access to the Cosmos Hall.

:
:
:
: