Are Large Language Models Memorizing Bug Benchmarks? (LLM4Code 2025)

Who

Daniel Ramos, Claudia Mamede, Kush Jain, Paulo Canelas, Catarina Gamboa, Claire Le Goues

Track

LLM4Code 2025 Large Language Models for Code

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 10:10 - 10:20 at 214 - Opening / Keynote 1 / Paper Session 1 Chair(s): Zijian Wang

Abstract

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage.

In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular CodeGen, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LlaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Daniel Ramos

Carnegie Mellon University

Claudia Mamede

Carnegie Mellon University

Portugal

Kush Jain

Carnegie Mellon University

United States

Paulo Canelas

Carnegie Mellon University

United States

Catarina Gamboa

Carnegie Mellon University

Portugal

Claire Le Goues

Carnegie Mellon University

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 3 May
Displayed time zone: Eastern Time (US & Canada) change

09:00 - 10:30	Opening / Keynote 1 / Paper Session 1LLM4Code at 214 Chair(s): Zijian Wang AWS AI Labs

09:00 10m Day opening		Opening LLM4Code Lingming Zhang University of Illinois at Urbana-Champaign, Prem Devanbu University of California at Davis, Zijian Wang AWS AI Labs
09:10 60m Keynote		Keynote 1: Building the Hybrid Human-AI Developer: From Code Completion to Agents (zoom talk) LLM4Code Federico Cassano Cursor AI
10:10 10m Talk		Are Large Language Models Memorizing Bug Benchmarks? LLM4Code Daniel Ramos Carnegie Mellon University, Claudia Mamede Carnegie Mellon University, Kush Jain Carnegie Mellon University, Paulo Canelas Carnegie Mellon University, Catarina Gamboa Carnegie Mellon University, Claire Le Goues Carnegie Mellon University
10:20 10m Talk		RepairBench: Leaderboard of Frontier Models for Program Repair LLM4Code André Silva KTH Royal Institute of Technology, Martin Monperrus KTH Royal Institute of Technology