ICSE 2025 (series) /  FORGE 2025 (series) /  Data and Benchmarking / 
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
Sun 27 Apr 2025 16:54 - 17:00 at 207 - Session2: FM for Software Quality Assurance & Testing Chair(s): Feifei Niu
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
Sun 27 AprDisplayed time zone: Eastern Time (US & Canada) change
Sun 27 Apr
Displayed time zone: Eastern Time (US & Canada) change
| 16:00 - 17:30 | Session2: FM for Software Quality Assurance & TestingResearch Papers / Data and Benchmarking at 207 Chair(s): Feifei Niu University of Ottawa | ||
| 16:0012m Long-paper | Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements Research Papers | ||
| 16:1212m Long-paper | Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models Research Papers Marc Bruni University of Applied Sciences and Arts Northwestern Switzerland, Fabio Gabrielli University of Applied Sciences and Arts Northwestern Switzerland, Mohammad Ghafari TU Clausthal, Martin Kropp University of Applied Sciences and Arts Northwestern SwitzerlandPre-print | ||
| 16:2412m Long-paper | Vulnerability-Triggering Test Case Generation from Third-Party Libraries Research Papers Yi Gao Zhejiang University, Xing Hu Zhejiang University, Zirui Chen , Tongtong Xu Nanjing University, Xiaohu Yang Zhejiang University | ||
| 16:366m Short-paper | Microservices Performance Testing with Causality-enhanced Large Language Models Research Papers Cristian Mascia University of Naples Federico II, Roberto Pietrantuono Università di Napoli Federico II, Antonio Guerriero Università di Napoli Federico II, Luca Giamattei Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II | ||
| 16:426m Short-paper | MaRV: A Manually Validated Refactoring Dataset Data and Benchmarking Henrique Gomes Nunes Universidade Federal de Minas Gerais, Tushar Sharma Dalhousie University, Eduardo Figueiredo Federal University of Minas Gerais | ||
| 16:486m Short-paper | PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection Data and Benchmarking Domenico Cotroneo University of Naples Federico II, Giuseppe De Rosa University of Naples Federico II, Pietro Liguori University of Naples Federico II | ||
| 16:546m Short-paper | The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Data and Benchmarking Jonathan Katzy Delft University of Technology, Răzvan Mihai Popescu Delft University of Technology, Arie van Deursen TU Delft, Maliheh Izadi Delft University of Technology | ||
| 17:0012m Long-paper | ELDetector: An Automated Approach Detecting Endless-loop in Mini Programs Research Papers Nan Hu Xi’an Jiaotong University, Ming Fan Xi'an Jiaotong University, Jingyi Lei Xi'an Jiaotong University, Jiaying He Xi'an Jiaotong University, Zhe Hou China Mobile System Integration Co. | ||
| 17:1212m Long-paper | Testing Android Third Party Libraries with LLMs to Detect Incompatible APIs Research Papers Tarek Mahmud Texas State University, Bin Duan University of Queensland, Meiru Che Central Queensland University, Anne Ngu Texas State University, Guowei Yang University of Queensland | ||


