The Fault in our Stars: Quality Assessment of Code Generation Benchmarks (SCAM 2024 - Research Track)

Who

Mohammed Latif Siddiq, Simantika Bhattacharjee Dristi, Joy Saha, Joanna C. S. Santos

Track

SCAM 2024 Research Track

Time Zone

The program is currently displayed in (GMT-07:00) Arizona.

Use conference time zone: (GMT-07:00) ArizonaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 8 Oct 2024 14:04 - 14:20 at Fremont - Testing & Debugging Chair(s): Wesley Assunção

Abstract

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks’ prompts affects a model’s performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark’s trustworthiness. We found that code generation evaluation bench- marks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers’ prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers’ intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models may have data contamination issues.

Link to Preprint

https://s2e-lab.github.io/preprints/scam24-benchmarks-preprint.pdf

Mohammed Latif Siddiq

University of Notre Dame

United States

Simantika Bhattacharjee Dristi

BRAC University

Bangladesh

Joy Saha