Demystifying Memorization in LLM-based Program Repair via a General Hypothesis Testing Framework
Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a \textit{low-probability event} under the \textit{null hypothesis} that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.
Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.
Mon 23 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
16:00 - 18:00 | RepairsResearch Papers / Journal First at Andromeda Chair(s): Michael Pradel University of Stuttgart | ||
16:00 20mTalk | HornBro: Homotopy-like Method for Automated Quantum Program Repair Research Papers Siwei Tan Zhejiang University, Liqiang Lu Zhejiang University, Debin Xiang Zhejiang University, Tianyao Chu Zhejiang University, Congliang Lang Zhejiang University, Jintao Chen Zhejiang University, Xing Hu Zhejiang University, Jianwei Yin Zhejiang University DOI | ||
16:20 20mTalk | RePurr: Automated Repair of Block-Based Learners' Programs Research Papers DOI | ||
16:40 20mTalk | Demystifying Memorization in LLM-based Program Repair via a General Hypothesis Testing Framework Research Papers Jiaolong Kong Singapore Management University, Xiaofei Xie Singapore Management University, Shangqing Liu Nanyang Technological University DOI | ||
17:00 20mTalk | IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models Research Papers Sayem Mohammad Imtiaz Iowa State University, Astha Singh Dept. of Computer Science, Iowa State University, Fraol Batole Tulane University, Hridesh Rajan Tulane University DOI | ||
17:20 20mTalk | Repairs and Breaks Prediction for Deep Neural Networks Journal First Yuta Ishimoto Kyushu University, Masanari Kondo Kyushu University, Lei Ma The University of Tokyo & University of Alberta, Naoyasu Ubayashi Waseda University, Yasutaka Kamei Kyushu University | ||
17:40 20mTalk | Element-Based Automated DNN Repair with Fine-Tuned Masked Language Model Research Papers Xu Wang Beihang University; Zhongguancun Laboratory; Ministry of Education, Mingming Zhang Beihang University, Xiangxin Meng Beihang University, Jian Zhang Nanyang Technological University, Yang Liu Nanyang Technological University, Chunming Hu Beihang University DOI |
Andromeda is located close to the restaurant and the bar, at the end of the corridor on the side of the bar.
From the registration desk, go towards the restaurant, turn left towards the bar, walk until the end of the corridor.