FSE 2025
Mon 23 - Fri 27 June 2025 Trondheim, Norway
co-located with ISSTA 2025
Mon 23 Jun 2025 16:40 - 17:00 at Andromeda - Repairs Chair(s): Michael Pradel

Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a \textit{low-probability event} under the \textit{null hypothesis} that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.

Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.

Mon 23 Jun

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

16:00 - 18:00
RepairsResearch Papers / Journal First at Andromeda
Chair(s): Michael Pradel University of Stuttgart
16:00
20m
Talk
HornBro: Homotopy-like Method for Automated Quantum Program Repair
Research Papers
Siwei Tan Zhejiang University, Liqiang Lu Zhejiang University, Debin Xiang Zhejiang University, Tianyao Chu Zhejiang University, Congliang Lang Zhejiang University, Jintao Chen Zhejiang University, Xing Hu Zhejiang University, Jianwei Yin Zhejiang University
DOI
16:20
20m
Talk
RePurr: Automated Repair of Block-Based Learners' Programs
Research Papers
Sebastian Schweikl University of Passau, Gordon Fraser University of Passau
DOI
16:40
20m
Talk
Demystifying Memorization in LLM-based Program Repair via a General Hypothesis Testing Framework
Research Papers
Jiaolong Kong Singapore Management University, Xiaofei Xie Singapore Management University, Shangqing Liu Nanyang Technological University
DOI
17:00
20m
Talk
IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models
Research Papers
Sayem Mohammad Imtiaz Iowa State University, Astha Singh Dept. of Computer Science, Iowa State University, Fraol Batole Tulane University, Hridesh Rajan Tulane University
DOI
17:20
20m
Talk
Repairs and Breaks Prediction for Deep Neural Networks
Journal First
Yuta Ishimoto Kyushu University, Masanari Kondo Kyushu University, Lei Ma The University of Tokyo & University of Alberta, Naoyasu Ubayashi Waseda University, Yasutaka Kamei Kyushu University
17:40
20m
Talk
Element-Based Automated DNN Repair with Fine-Tuned Masked Language Model
Research Papers
Xu Wang Beihang University; Zhongguancun Laboratory; Ministry of Education, Mingming Zhang Beihang University, Xiangxin Meng Beihang University, Jian Zhang Nanyang Technological University, Yang Liu Nanyang Technological University, Chunming Hu Beihang University
DOI

Information for Participants
Mon 23 Jun 2025 16:00 - 18:00 at Andromeda - Repairs Chair(s): Michael Pradel
Info for room Andromeda:

Andromeda is located close to the restaurant and the bar, at the end of the corridor on the side of the bar.

From the registration desk, go towards the restaurant, turn left towards the bar, walk until the end of the corridor.

:
:
:
: