ESEIW 2024
Sun 20 - Fri 25 October 2024 Barcelona, Spain

This paper describes first insights on the effectiveness of ChatGPT to detect bad smells in Java projects. We use a large dataset comprising four code smells (Blob, Data Class, Feature Envy, and Long Method) classified into three severity levels. We run two different prompts to assess ChatGPT’s proficiency: i) a generic prompt to verify whether the model can detect smells and ii) a prompt specifying the smells classified in the dataset. We apply evaluation metrics in terms of precision, recall, and F1-score to quantify the ChatGPT abilities’ in identifying the aforementioned smells. Our preliminary results the odds of ChatGPT providing a correct outcome with a specific are 2.54 times higher compared to a generic prompt. Moreover, ChatGPT is more effective at detecting smells with critical severity (F1-score reaching 0.52) than smells with minor severity (F1-score equals to 0.43). To conclude, we discuss the implications of our results and highlight future work in leveraging large language models for detecting code smells.