An Exploratory Evaluation of Large Language Models Using Empirical Software Engineering Tasks
In empirical software engineering (EMSE), various activities require human participation, such as data collection, processing, analysis, and comprehension. On one hand, these processes are time-consuming and labor-intensive. On the other hand, human participation may introduce bias. With the rise of large language models (LLMs) like ChatGPT, the potential for these models to enhance productivity has become apparent. However, the auxiliary capabilities and effectiveness of LLMs in EMSE tasks have rarely been explored. To fill this gap, in this paper, we evaluate the performance of LLMs by using scenarios of human participation in EMSE tasks, i.e., EMSEbench. We conduct replication experiments using four LLMs (ChatGPT4.0, ERNIE Bot4.0, Gemini3.0, and ChatGLM4.0), evaluating the difference in performance across seven scenarios collected from papers published in top SE venues. In the experiments, we perform three types of prompts, i.e., zero-shot, one-shot, and optimized one-shot. Besides, we leverage the concept of multi-agent workflow to explore the performance improvement and limitations of LLMs. Our study summarizes six findings, which facilitate the understanding of the auxiliary of LLMs in EMSE tasks.
Wed 24 JulDisplayed time zone: Beijing, Chongqing, Hong Kong, Urumqi change
11:20 - 12:35 | Session 1: AI for Software EngineeringResearch Track / Tool Demonstration Track / New Idea Track at Main Conference Room Chair(s): Yongqiang Tian The Hong Kong University of Science and Technology | ||
11:20 15mFull-paper | An Empirical Study on Code Search Pre-trained Models: Academic Progresses vs. Industry Requirements Research Track | ||
11:35 15mFull-paper | CRABS-former: Cross-Architecture Binary Code Similarity Detection based on Transformer Research Track Yuhong Feng Shenzhen University, Haoran Li Shenzhen University, Yixuan Cao ShenZhen University, Yufeng Wang ShenZhen University, Haiyue Feng College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China | ||
11:50 15mFull-paper | On the Heterophily of Program Graphs: A Case Study of Graph-based Type Inference Research Track Senrong Xu , Jiamei Shen , Yunfang Li , Yuan Yao Nanjing University, Ping Yu , Feng Xu Nanjing University, Xiaoxing Ma Nanjing University | ||
12:05 15mFull-paper | An Exploratory Evaluation of Large Language Models Using Empirical Software Engineering Tasks Research Track Wenjun Liang Nanjing University of Aeronautics and Astronautics, China, Guanping Xiao Nanjing University of Aeronautics and Astronautics | ||
12:20 15mFull-paper | LLM-Enhanced Theorem Proving with Term Explanation and Tactic Parameter Repair Research Track Xingpeng Liu , Hengzhu Liu , Xiaodong Yi , Ji Wang School of Computer, National University of Defense Technology, China |