Internetware 2025
Fri 20 - Sun 22 June 2025 Trondheim, Norway
co-located with FSE 2025
Fri 20 Jun 2025 14:30 - 14:45 at Cosmos 3A - Session2: AI for Software Engineering I Chair(s): Jialun Cao

Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage large language models (LLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train LLMs, severely undermining the credibility of performance evaluations. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. We develop CodeCleaner, including 11 operators for Python in multiple scales and 4 for Java. We elaborate on the rationale for why these operators could work to resolve data contamination and use both data-wise (e.g., N-gram matching overlap ratio) and model-wise metrics (e.g., perplexity) to quantify the efficacy after operators are applied. A drop of 75% overlap ratio is found when applying all operators in CodeCleaner, demonstrating their effectiveness in addressing data contamination. Besides, we migrate four operators to Java, showing their generalizability to another language. We also observed an average of 19% decrease in LLMs’ performance after applying our operators. We make CodeCleaner online available.

Fri 20 Jun

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:30
Session2: AI for Software Engineering IResearch Track at Cosmos 3A
Chair(s): Jialun Cao Hong Kong University of Science and Technology
14:00
15m
Talk
Code Retrieval with Mixture of Experts Prototype Learning Based on Classification
Research Track
Feng Ling School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Guoheng Huang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Jingchao Wang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Xiaochen Yuan Faculty of Applied Sciences, Macau Polytechnic University, Macau, China, Xuhang Chen School of Computer Science and Engineering, Huizhou University, Huizhou 516001, China, XueYong Zhang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Fanlong Zhang School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, Chi-Man Pun Department of Computer and Information Science, University of Macau, Macau, China
14:15
15m
Talk
Issue Retrieval and Verification Enhanced Supplementary Code Comment Generation
Research Track
Yanzhen Zou Peking University, Xianlin Zhao Peking University, Xinglu Pan Peking University, Bing Xie Peking University
Pre-print
14:30
15m
Talk
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
Research Track
Jialun Cao Hong Kong University of Science and Technology, Songqiang Chen The Hong Kong University of Science and Technology, Wuqi Zhang MegaETH, Hau Ching Lo The Hong Kong University of Science and Technology, Yeting Li Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shing-Chi Cheung Hong Kong University of Science and Technology
Pre-print Media Attached
14:45
15m
Talk
LASER:Script Execution by Autonomous Agents for On-demand Traffic Simulation
Research Track
Hao Gao Nanjing University, Jingyue Wang Nanjing University, Wenyang Fang Nanjing University, Jingwei Xu , Yunpeng Huang Nanjing University, Taolue Chen Birkbeck, University of London, Xiaoxing Ma Nanjing University
Pre-print
15:00
15m
Talk
Tech-ASan: Two-stage check for Address Sanitizer
Research Track
Yixuan Cao ShenZhen University, Yuhong Feng Shenzhen University, Huafeng Li Shenzhen University, Chongyi Huang Shenzhen University, Fangcao Jian Shenzhen University, Haoran Li Shenzhen University, Xu Wang Shenzhen University
Pre-print Media Attached

Information for Participants
Fri 20 Jun 2025 14:00 - 15:30 at Cosmos 3A - Session2: AI for Software Engineering I Chair(s): Jialun Cao
Info for room Cosmos 3A:

Cosmos 3A is the first room in the Cosmos 3 wing.

When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.

:
:
:
: