SCAM 2024
Mon 7 - Tue 8 October 2024
co-located with ICSME 2024
Tue 8 Oct 2024 14:04 - 14:20 at Fremont - Testing & Debugging Chair(s): Wesley Assunção

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks’ prompts affects a model’s performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark’s trustworthiness. We found that code generation evaluation bench- marks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers’ prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers’ intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models may have data contamination issues.

Tue 8 Oct

Displayed time zone: Arizona change

13:30 - 15:00
Testing & DebuggingResearch Track at Fremont
Chair(s): Wesley Assunção North Carolina State University
13:30
16m
Research paper
Migrating Unit Tests Across Java Applications
Research Track
Ajay Jha North Dakota State University, Sarah Nadi New York University Abu Dhabi, University of Alberta
Pre-print
13:47
16m
Research paper
PROZE: Generating Parameterized Unit Tests Informed by Runtime Data
Research Track
Deepika Tiwari KTH Royal Institute of Technology, Yogya Gamage Universtité de Montréal, Martin Monperrus KTH Royal Institute of Technology, Benoit Baudry Université de Montréal
Pre-print
14:04
16m
Research paper
The Fault in our Stars: Quality Assessment of Code Generation Benchmarks
Research Track
Mohammed Latif Siddiq University of Notre Dame, Simantika Bhattacharjee Dristi BRAC University, Joy Saha BRAC University, Joanna C. S. Santos University of Notre Dame
Pre-print
14:21
16m
Research paper
Breaking-Good: Explaining Breaking Dependency Updates with Build Analysis
Research Track
Frank Reyes Garcia KTH Royal Institute of Technology, Benoit Baudry Université de Montréal, Martin Monperrus KTH Royal Institute of Technology
Pre-print
14:40
20m
Live Q&A
Discussion (Testing & Debugging)
Research Track