ASE 2025
Sun 16 - Thu 20 November 2025 Seoul, South Korea

This program is tentative and subject to change.

Tue 18 Nov 2025 11:20 - 11:30 at Vista - SE4AI & AI4SE 2

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts.

In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess, which span three popular SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 34.4% to 113.0% over existing automatic metrics. Furthermore, SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair tasks. These findings underscore SE-Jury’s potential as a scalable and reliable alternative to human evaluation in these SE tasks.

This program is tentative and subject to change.

Tue 18 Nov

Displayed time zone: Seoul change

11:00 - 12:30
SE4AI & AI4SE 2Research Papers at Vista
11:00
10m
Talk
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
Research Papers
Chenyan Liu Shanghai Jiao Tong University; National University of Singapore, Yun Lin Shanghai Jiao Tong University, Yuhuan Huang Shanghai Jiao Tong University, Jiaxin Chang Shanghai Jiao Tong University, Binhang Qi National University of Singapore, Bo Jiang Bytedance Network Technology, Zhiyong Huang National University of Singapore, Jin Song Dong National University of Singapore
11:10
10m
Talk
Coding-Fuse: Efficient Fusion of Code Pre‑Trained Models for Classification Tasks
Research Papers
Yu Zhao , Lina Gong Nanjing University of Aeronautics and Astronautic, Zhiqiu Huang Nanjing University of Aeronautics and Astronautics, Yuchen Jin Nanjing University of Aeronautics and Astronautics, Mingqiang Wei Nanjing University of Aeronautics and Astronautics
11:20
10m
Talk
SE-Jury: An LLM-as-Ensemble-Judge Metric for Narrowing the Gap with Human Evaluation in SE
Research Papers
Xin Zhou Singapore Management University, Singapore, Kisub Kim DGIST, Ting Zhang Monash University, Martin Weyssow Singapore Management University, Luis F. Gomes Carnegie Mellon University, Guang Yang , Kui Liu Huawei, Xin Xia Zhejiang University, David Lo Singapore Management University
11:30
10m
Talk
iKnow: an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation
Research Papers
Junjie Huang The Chinese University of Hong Kong, Yuedong Zhong Sun Yat-sen University, Guangba  Yu The Chinese University of Hong Kong, Zhihan Jiang The Chinese University of Hong Kong, Minzhi Yan HCC Lab, Huawei Cloud Computing Technology Co., Ltd, Wenfei Luan HCC Lab, Huawei Cloud Computing Technology Co., Ltd, Tianyu Yang HCC Lab, Huawei Cloud Computing Technology Co., Ltd, Rui Ren Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Michael Lyu The Chinese University of Hong Kong
11:40
10m
Talk
Aligning LLMs to Fully Utilize the Cross-file Context in Repository-level Code Completion
Research Papers
Jia Li Tsinghua University, Hao Zhu Peking University, Huanyu Liu , Xianjie Shi Peking University, He Zong aiXcoder, Yihong Dong Peking University, Kechi Zhang Peking University, China, Siyuan Jiang , Zhi Jin Peking University, Ge Li Peking University
11:50
10m
Talk
From Sparse to Structured: A Diffusion-Enhanced and Feature-Aligned Framework for Coincidental Correctness Detection
Research Papers
Huan Xie Chongqing University, Chunyan Liu Chongqing University, Yan Lei Chongqing University, Zhenyu Wu School of Big Data & Software Engineering, Chongqing University, Jinping Wang Chonqing University
12:00
10m
Talk
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents
Research Papers
Benjamin Rombaut Centre for Software Excellence, Huawei Canada, Sogol Masoumzadeh Mcgill University, Kirill Vasilevski Huawei Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Ahmed E. Hassan Queen’s University
12:10
10m
Talk
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories
Research Papers
Islem BOUZENIA University of Stuttgart, Michael Pradel CISPA Helmholtz Center for Information Security
12:20
10m
Talk
Triangle: Empowering Incident Triage with Multi-Agent
Research Papers
Zhaoyang Yu Tsinghua University, Aoyang Fang Chinese University of Hong Kong, Shenzhen, Minghua Ma Microsoft, Jaskaran Singh Walia Microsoft, Chaoyun Zhang Microsoft, Shu Chi Tsinghua University, Ze Li Microsoft Azure, Murali Chintalapati Microsoft Azure, Xuchao Zhang Microsoft, Rujia Wang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft, Qingwei Lin Microsoft, Shenglin Zhang Nankai University, Dan Pei Tsinghua University, Pinjia He Chinese University of Hong Kong, Shenzhen