LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation (ICSE 2025 - Journal-first Papers)

Who

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Shuvendu K. Lahiri

Track

ICSE 2025 Journal-first Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 30 Apr 2025 15:30 - 16:00 at Canada Hall 3 Poster Area - Wed Afternoon Break Posters 15:30-16:00
Thu 1 May 2025 13:30 - 14:00 at Canada Hall 3 Poster Area - Thu Lunch Posters 13:30-14:00
Fri 2 May 2025 12:15 - 12:22 at Canada Hall 1 and 2 - AI for SE 3 Chair(s): Ying Zou

Abstract

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.

Link to Publication

https://arxiv.org/pdf/2404.10100

Sarah Fakhoury

Microsoft Research

United States

Aaditya Naik

University of Pennsylvania

United States

Georgios Sakkas

University of California at San Diego

United States

Saikat Chakraborty

Microsoft Research

United States

Shuvendu K. Lahiri

Microsoft Research

United States

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 30 Apr
Displayed time zone: Eastern Time (US & Canada) change

15:30 - 16:00	Wed Afternoon Break Posters 15:30-16:00Journal-first Papers / SE In Practice (SEIP) / Research Track at Canada Hall 3 Poster Area

15:30 30m Poster		Non-Autoregressive Line-Level Code Completion Journal-first Papers Fang Liu Beihang University, Zhiyi Fu Peking University, Ge Li Peking University, Zhi Jin Peking University, Hui Liu Beijing Institute of Technology, Yiyang Hao Silicon Heart Tech Co., Li Zhang Beihang University
15:30 30m Poster		FlatD: Protecting Deep Neural Network Program from Reversing Attacks SE In Practice (SEIP) Jinquan Zhang The Pennsylvania State University, Zihao Wang Penn State University, Pei Wang Independent Researcher, Rui Zhong Palo Alto Networks, Dinghao Wu Pennsylvania State University
15:30 30m Talk		Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-PracticeSE for AI Journal-first Papers Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal, Houari Sahraoui DIRO, Université de Montréal DOI Pre-print File Attached
15:30 30m Poster		Predicting the First Response Latency of Maintainers and Contributors in Pull Requests Journal-first Papers SayedHassan Khatoonabadi Concordia University, Montreal, Ahmad Abdellatif University of Calgary, Diego Elias Costa Concordia University, Canada, Emad Shihab Concordia University, Montreal
15:30 30m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
15:30 30m Poster		RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code Research Track Pantazis Deligiannis Microsoft Research, Akash Lal Microsoft Research, Nikita Mehrotra Microsoft Research, Rishi Poddar Microsoft Research, Aseem Rastogi Microsoft Research
15:30 30m Talk		QuanTest: Entanglement-Guided Testing of Quantum Neural Network SystemsQuantum Journal-first Papers Jinjing Shi Central South University, Zimeng Xiao Central South University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Xuelong LI China Telecom Link to publication

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

13:30 - 14:00	Thu Lunch Posters 13:30-14:00Journal-first Papers / New Ideas and Emerging Results (NIER) / Research Track at Canada Hall 3 Poster Area

13:30 30m Poster		Non-Autoregressive Line-Level Code Completion Journal-first Papers Fang Liu Beihang University, Zhiyi Fu Peking University, Ge Li Peking University, Zhi Jin Peking University, Hui Liu Beijing Institute of Technology, Yiyang Hao Silicon Heart Tech Co., Li Zhang Beihang University
13:30 30m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
13:30 30m Talk		SusDevOps: Promoting Sustainability to a First Principle in Software Delivery New Ideas and Emerging Results (NIER) Istvan David McMaster University / McMaster Centre for Software Certification (McSCert)
13:30 30m Poster		Predicting the First Response Latency of Maintainers and Contributors in Pull Requests Journal-first Papers SayedHassan Khatoonabadi Concordia University, Montreal, Ahmad Abdellatif University of Calgary, Diego Elias Costa Concordia University, Canada, Emad Shihab Concordia University, Montreal
13:30 30m Poster		RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code Research Track Pantazis Deligiannis Microsoft Research, Akash Lal Microsoft Research, Nikita Mehrotra Microsoft Research, Rishi Poddar Microsoft Research, Aseem Rastogi Microsoft Research
13:30 30m Talk		Relevant information in TDD experiment reporting Journal-first Papers Fernando Uyaguari Instituto Superior Tecnológico Wissen, Silvia Teresita Acuña Castillo Universidad Autónoma de Madrid, John W. Castro Universidad de Atacama, Davide Fucci Blekinge Institute of Technology, Oscar Dieste Universidad Politécnica de Madrid, Sira Vegas Universidad Politecnica de Madrid

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	AI for SE 3New Ideas and Emerging Results (NIER) / Journal-first Papers / Research Track / SE In Practice (SEIP) at Canada Hall 1 and 2 Chair(s): Ying Zou Queen's University, Kingston, Ontario

11:00 15m Talk		A First Look at Conventional Commits Classification Research Track Qunhong Zeng Beijing Institute of Technology, Yuxia Zhang Beijing Institute of Technology, Zhiqing Qiu Beijing Institute of Technology, Hui Liu Beijing Institute of Technology
11:15 15m Talk		ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples Research Track Chunhao Dong Beijing Institute of Technology, Yanjie Jiang Peking University, Yuxia Zhang Beijing Institute of Technology, Yang Zhang Hebei University of Science and Technology, Hui Liu Beijing Institute of Technology
11:30 15m Talk		SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing Research Track Wenchao Gu The Chinese University of Hong Kong, Ensheng Shi Xi’an Jiaotong University, Yanlin Wang Sun Yat-sen University, Lun Du Microsoft Research, Shi Han Microsoft Research, Hongyu Zhang Chongqing University, Dongmei Zhang Microsoft Research, Michael Lyu The Chinese University of Hong Kong
11:45 15m Talk		UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation New Ideas and Emerging Results (NIER) Liangying Shao School of Informatics, Xiamen University, China, Yanfu Yan William & Mary, Denys Poshyvanyk William & Mary, Jinsong Su School of Informatics, Xiamen University, China
12:00 15m Talk		How is Google using AI for internal code migrations? SE In Practice (SEIP) Stoyan Nikolov Google, Inc., Daniele Codecasa Google, Inc., Anna Sjovall Google, Inc., Maxim Tabachnyk Google, Siddharth Taneja Google, Inc., Celal Ziftci Google, Satish Chandra Google, Inc
12:15 7m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
12:22 7m Talk		The impact of Concept drift and Data leakage on Log Level Prediction Models Journal-first Papers Youssef Esseddiq Ouatiti Queen's university, Mohammed Sayagh ETS Montreal, University of Quebec, Noureddine Kerzazi Ensias-Rabat, Bram Adams Queen's University, Ahmed E. Hassan Queen’s University, Youssef Esseddiq Ouatiti Queen's university