CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage large language models (LLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train LLMs, severely undermining the credibility of performance evaluations. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. We develop CodeCleaner, including 11 operators for Python in multiple scales and 4 for Java. We elaborate on the rationale for why these operators could work to resolve data contamination and use both data-wise (e.g., N-gram matching overlap ratio) and model-wise metrics (e.g., perplexity) to quantify the efficacy after operators are applied. A drop of 75% overlap ratio is found when applying all operators in CodeCleaner, demonstrating their effectiveness in addressing data contamination. Besides, we migrate four operators to Java, showing their generalizability to another language. We also observed an average of 19% decrease in LLMs’ performance after applying our operators. We make CodeCleaner online available.
Fri 20 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
14:00 - 15:30 | Session2: AI for Software Engineering IResearch Track at Cosmos 3A Chair(s): Jialun Cao Hong Kong University of Science and Technology | ||
14:00 15mTalk | Code Retrieval with Mixture of Experts Prototype Learning Based on Classification Research Track Feng Ling School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Guoheng Huang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Jingchao Wang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Xiaochen Yuan Faculty of Applied Sciences, Macau Polytechnic University, Macau, China, Xuhang Chen School of Computer Science and Engineering, Huizhou University, Huizhou 516001, China, XueYong Zhang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Fanlong Zhang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Chi-Man Pun Department of Computer and Information Science, University of Macau, Macau, China | ||
14:15 15mTalk | Issue Retrieval and Verification Enhanced Supplementary Code Comment Generation Research Track Yanzhen Zou Peking University, Xianlin Zhao Peking University, Xinglu Pan Peking University, Bing Xie Peking University Pre-print | ||
14:30 15mTalk | CodeCleaner: Mitigating Data Contamination for LLM Benchmarking Research Track Jialun Cao Hong Kong University of Science and Technology, Songqiang Chen The Hong Kong University of Science and Technology, Wuqi Zhang MegaETH, Hau Ching Lo The Hong Kong University of Science and Technology, Yeting Li Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shing-Chi Cheung Hong Kong University of Science and Technology Pre-print Media Attached | ||
14:45 15mTalk | LASER:Script Execution by Autonomous Agents for On-demand Traffic Simulation Research Track Hao Gao Nanjing University, Jingyue Wang Nanjing University, Wenyang Fang Nanjing University, Jingwei Xu , Yunpeng Huang Nanjing University, Taolue Chen Birkbeck, University of London, Xiaoxing Ma Nanjing University Pre-print | ||
15:00 15mTalk | Tech-ASan: Two-stage check for Address Sanitizer Research Track Yixuan Cao ShenZhen University, Yuhong Feng Shenzhen University, Huafeng Li Shenzhen University, Chongyi Huang Shenzhen University, Fangcao Jian Shenzhen University, Haoran Li Shenzhen University, Xu Wang Shenzhen University Pre-print Media Attached |
Cosmos 3A is the first room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.