Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model (SANER 2025 - Reproducibility Studies and Negative Results (RENE) Track )

Tue 4 - Fri 7 March 2025 Montréal, Québec, Canada

Who

Salimata Sawadogo, Aminata Sabané, Rodrique Kafando, Tegawendé F. Bissyandé

Track

SANER 2025 Reproducibility Studies and Negative Results (RENE) Track

Abstract

Despite recent advancements in Large Language Models (LLMs) for code generation, their inherent non-determinism remains a significant obstacle for reliable and reproducible software engineering research. Prior work has highlighted the high degree of variability in LLM-generated code, even when prompted with identical inputs. This non-deterministic behavior can undermine the validity of scientific conclusions drawn from LLM-based experiments. In contrast to prior research, this paper showcases the Tree of Thoughts (ToT) prompting strategy as a promising alternative for improving the predictability and quality of code generation results. By guiding the LLM through a structured Thoughts process, ToT aims to reduce the randomness inherent in the generation process and improve the consistency of the output. Our experimental results on GPT-3.5 Turbo model using 829 code generation problems from benchmarks such as CodeContests, APPS (Automated Programming Progress Standard) and HumanEval demonstrate a substantial reduction in non-determinism compared to previous findings. Specifically, we observed a significant decrease in the number of coding tasks that produced inconsistent outputs across multiple requests. Nevertheless, we show that the reduction in semantic variability was less pronounced for HumanEval (69%), indicating unique challenges present in this dataset that are not fully mitigated by ToT.

Salimata Sawadogo

Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL)