ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil
Tue 14 Apr 2026 17:30 - 17:35 at Oceania I - Agents (Virtual Session) Chair(s): Kexin Pei

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.

Tue 14 Apr

Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:40 - 17:40
Agents (Virtual Session)LLM4Code at Oceania I
Chair(s): Kexin Pei The University of Chicago

Zoom Link: https://us06web.zoom.us/j/89846180915

16:40
10m
Talk
MAsFL: Data-Secure, Efficient and Accurate Fault Localization with Multi-Agent Small Language Models
LLM4Code
PHAM DUC DUONG National Defense Academy of Japan, HIROSHI SATO National Defense Academy of Japan, MASAO KUBO National Defense Academy of Japan
16:50
10m
Talk
Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice ArchitecturesVirtual Attendance
LLM4Code
Amirkia Rafiei Oskooei Yildiz Technical University / Intellica, S. Selcan Yukcu Intellica Business Intelligence, Mehmet Cevheri Bozoglan Intellica Business Intelligence, Mehmet S. Aktas Yildiz Technical University
17:00
10m
Talk
RAG Against the Machine: Zero-Shot Software Vulnerabilities Classification using LLMs
LLM4Code
Edvin Nordqvist KTH Royal Institute of Technology, Changjie Wang KTH Royal Institute of Technology, Simone Ferlin Red Hat, Mariano Scazzariello RISE Research Institutes of Sweden, Marco Chiesa KTH Royal Institute of Technology
17:10
10m
Talk
Learning Functional Equivalence via Supervised Contrastive Code-Problem Alignment
LLM4Code
Siu Wun Cheung Lawrence Livermore National Laboratory, Harshitha Menon Lawrence Livermore National Lab
17:20
5m
Talk
Towards LLM-guided Semantic Validation of Autonomous Driving Safety Policies
LLM4Code
Qingzhao Zhang University of Arizona, Morley Mao University of Michigan
17:25
5m
Talk
ContextPilot: Code Context Engineering with Memory-Augmented Exploration Agents
LLM4Code
Shuzheng Gao Chinese University of Hong Kong, Chaozheng Wang The Chinese University of Hong Kong, Shuqing Li The Chinese University of Hong Kong, Yun Peng The Chinese University of Hong Kong, Michael Lyu The Chinese University of Hong Kong
17:30
5m
Talk
Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents
LLM4Code
Divyanshu Saxena UT Austin, Rishikesh Maurya Microsoft, Gagan Somashekar Microsoft, Shachee Mishra Gupta Microsoft, Chetan Bansal Microsoft Research, Aditya Akella University of Texas at Austin