SANER 2025
Tue 4 - Fri 7 March 2025 Montréal, Québec, Canada
Wed 5 Mar 2025 11:45 - 12:00 at L-1710 - Empirical Studies & LLM Chair(s): Diego Elias Costa

Despite recent advancements in Large Language Models (LLMs) for code generation, their inherent non-determinism remains a significant obstacle for reliable and reproducible software engineering research. Prior work has highlighted the high degree of variability in LLM-generated code, even when prompted with identical inputs. This non-deterministic behavior can undermine the validity of scientific conclusions drawn from LLM-based experiments. In contrast to prior research, this paper showcases the Tree of Thoughts (ToT) prompting strategy as a promising alternative for improving the predictability and quality of code generation results. By guiding the LLM through a structured Thoughts process, ToT aims to reduce the randomness inherent in the generation process and improve the consistency of the output. Our experimental results on GPT-3.5 Turbo model using 829 code generation problems from benchmarks such as CodeContests, APPS (Automated Programming Progress Standard) and HumanEval demonstrate a substantial reduction in non-determinism compared to previous findings. Specifically, we observed a significant decrease in the number of coding tasks that produced inconsistent outputs across multiple requests. Nevertheless, we show that the reduction in semantic variability was less pronounced for HumanEval (69%), indicating unique challenges present in this dataset that are not fully mitigated by ToT.

Wed 5 Mar

Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30
11:00
15m
Talk
Beyond pip install: Evaluating LLM agents for the automated installation of Python projects
Research Papers
Louis Mark Milliken KAIST, Sungmin Kang National University of Singapore, Shin Yoo Korea Advanced Institute of Science and Technology
Pre-print
11:18
12m
Talk
On the Compression of Language Models for Code: An Empirical Study on CodeBERT
Research Papers
Giordano d'Aloisio University of L'Aquila, Luca Traini University of L'Aquila, Federica Sarro University College London, Antinisca Di Marco University of L'Aquila
Pre-print
11:30
15m
Talk
Can Large Language Models Discover Metamorphic Relations? A Large-Scale Empirical Study
Research Papers
Jiaming Zhang University of Science and Technology Beijing, Chang-ai Sun University of Science and Technology Beijing, Huai Liu Swinburne University of Technology, Sijin Dong University of Science and Technology Beijing
11:45
15m
Talk
Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model
Reproducibility Studies and Negative Results (RENE) Track
Salimata Sawadogo Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Aminata Sabané Université Joseph KI-ZERBO, Centre d'Excellence CITADELLE, Rodrique Kafando Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Tegawendé F. Bissyandé University of Luxembourg
12:00
15m
Talk
Language Models to Support Multi-Label Classification of Industrial Data
Industrial Track
Waleed Abdeen Blekinge Institute of Technology, Michael Unterkalmsteiner , Krzysztof Wnuk Blekinge Institute of Technology , Alessio Ferrari CNR-ISTI, Panagiota Chatzipetrou