ICSE 2025 (series) / APR 2025 (series) / APR 2025 /
Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR
The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several large APR datasets and benchmarks, including, for instance, duplicates or “bogus bugs”. We briefly discuss the potential impact of these problems on repair performance and propose possible remedies. We believe that more data-focused approaches could improve the performance and robustness of current and future APR systems.
Tue 29 AprDisplayed time zone: Eastern Time (US & Canada) change
Tue 29 Apr
Displayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 30mOther | Discussion APR Chao Peng ByteDance | ||
14:30 20mTalk | Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR APR | ||
14:50 20mTalk | LLM-Based Repair of C++ Implicit Data Loss Compiler Warnings: An Industrial Case Study APR | ||
15:10 20mTalk | Scholia - An XAI Framework for APR APR Nethum Lamahewage University of Moratuwa, Sri Lanka, Nimantha Cooray University of Moratuwa, Sri Lanka, Ridwan Salihin Shariffdeen National University of Singapore, Sandareka Wickramanayake University of Moratuwa, Sri Lanka, Nisansa de Silva University of Moratuwa, Sri Lanka |