How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study (ICSE 2025 - New Ideas and Emerging Results (NIER))

Who

Alejandro Velasco, Daniel Rodriguez-Cardenas, David Nader Palacio, Lutfar Rahman Alif, Denys Poshyvanyk

Track

ICSE 2025 New Ideas and Emerging Results (NIER)

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 2 May 2025 15:15 - 15:30 at 214 - AI for Testing and QA 6 Chair(s): Ladan Tahvildari

Abstract

Large Language Models (LLMs) have shown significant potential in automating software engineering tasks, particularly in code generation. However, current evaluation benchmarks, which primarily focus on accuracy, fall short in assessing the quality of the code generated by these models, specifically their tendency to produce code smells. To address this limitation, we introduce CodeSmellEval, a benchmark designed to evaluate the propensity of LLMs for generating code smells. Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData. To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral. The results reveal that both models tend to generate code smells, such as simplifiable-condition and consider-merging-isinstance. These findings highlight the effectiveness of our benchmark in evaluating LLMs, providing valuable insights into their reliability and their propensity to introduce code smells in code generation tasks.

Link to Preprint

https://arxiv.org/abs/2412.18989

Alejandro Velasco

William & Mary

United States

Daniel Rodriguez-Cardenas

William & Mary

United States

David Nader Palacio

William & Mary

United States

Lutfar Rahman Alif

University of Dhaka

Bangladesh

Denys Poshyvanyk

William & Mary

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	AI for Testing and QA 6Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) at 214 Chair(s): Ladan Tahvildari University of Waterloo

14:00 15m Talk		Treefix: Enabling Execution with a Tree of Prefixes Research Track Beatriz Souza Universität Stuttgart, Michael Pradel University of Stuttgart Pre-print
14:15 15m Talk		Assessing Evaluation Metrics for Neural Test Oracle Generation Journal-first Papers Jiho Shin York University, Hadi Hemmati York University, Moshi Wei York University, Song Wang York University
14:30 15m Talk		Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement Journal-first Papers Saurabhsingh Rajput Dalhousie University, Tim Widmayer University College London (UCL), Ziyuan Shang Nanyang Technological University, Maria Kechagia National and Kapodistrian University of Athens, Federica Sarro University College London, Tushar Sharma Dalhousie University
14:45 15m Talk		Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality Journal-first Papers Hao Li Queen's University, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Cor-Paul Bezemer University of Alberta Link to publication DOI Pre-print
15:00 15m Talk		Evaluating the Generalizability of LLMs in Automated Program Repair New Ideas and Emerging Results (NIER) Fengjie Li Tianjin University, Jiajun Jiang Tianjin University, Jiajun Sun Tianjin University, Hongyu Zhang Chongqing University Pre-print
15:15 15m Talk		How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study New Ideas and Emerging Results (NIER) Alejandro Velasco William & Mary, Daniel Rodriguez-Cardenas William & Mary, David Nader Palacio William & Mary, Lutfar Rahman Alif University of Dhaka, Denys Poshyvanyk William & Mary Pre-print