An Empirical Investigation on the use of Large Language Models for Performance Bug Detection
Performance bugs are non-functional defects that significantly impact software performance. Identifying such bugs can be challenging, as they are typically harder to detect than other types of software defects and often require specialized expertise that may not be readily available within software organizations.
Previous approaches have attempted to automate performance bug detection using static analysis or traditional machine learning models trained on static code metrics. However, despite the growing potential and widespread adoption of Large Language Models (LLMs) for automating various software engineering tasks, no studies have directly investigated their capabilities in detecting performance bugs.
In this paper, we aim to fill this gap by exploring the potential of LLMs–such as CodeLlama, Qwen, and Artigenz–for detecting performance bugs directly from source code. We focus on zero-shot and few-shot prompting, as well as supervised fine-tuning, to evaluate these models using Java code from open-source projects with labeled performance bugs. Our results highlight the limitations of current LLMs in this domain: they achieved low F1 scores, with few-shot prompting providing only marginal improvements over zero-shot configurations, and fine-tuning yielding slight gains.