How well LLM-based test generation techniques perform with newer LLM versions?
The rapid evolution of Large Language Models (LLMs) has significantly impacted software engineering, leading to a growing number of studies exploring their use in automated unit test generation. However, the standalone use of LLMs without post-processing has proven insufficient, often resulting in a high number of tests that fail to compile or fail to achieve high coverage. Several techniques/tools have been proposed to address these issues, reporting substantial improvements in test compila- tion and coverage. While interesting and important, LLM-based test generation techniques have been evaluated in relation to relatively weak baselines (for todays’ standards), i.e., old LLM versions and relatively weak prompts, which may exacerbate the performance contribution of the approaches. In other words, it is likely that the use of stronger (newer) LLMs may obviate any advantage that these techniques bring. We investigate this issue by replicating four state-of-the-art LLM-based test generation tools, HITS, SymPrompt, TestSpark, and CoverUp that include engineering components aimed at guiding the test generation process through test compilation and execution feedback, and evaluate their relative effectiveness and efficiency over a plain LLM test generation method. We integrate the current versions of LLMs in all the approaches, which are later versions than the ones used by their initial studies, and conduct an experiment using a dataset comprising 393 classes and 3,657 methods. Perhaps surprising, our results show that the plain LLM-based approach can outperform previous state-of-the-art approaches in all test effectiveness metrics we used: line coverage (by 17.72%), branch coverage (by 19.80%) and mutation score (by 20.92%), and it does so at a comparable cost (number of LLM queries). We also observe that the level of granularity where the plain LLM- based is applied has a significant impact on the involved cost. We therefore propose targeting first the program classes, where test generation is more efficient, and then the uncovered methods as a possible way to reduce the number of LLM requests. We find that such an approach achieves test effectiveness comparable (slightly higher) to the other methods while requiring approximately 20% less requests to the LLM.