Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation (DeepTest 2025)

Who

Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, Alexey Svyatkovskiy

Track

DeepTest 2025 Deep Learning <-> Testing

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 16:30 - 17:00 at 213 - Paper Presentation 3 Chair(s): Matteo Biagiola

Abstract

Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells — up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing.

Benjamin Steenhoek

Microsoft

United States

Michele Tufano

Google

United States

Neel Sundaresan

Microsoft

United States

Alexey Svyatkovskiy

Google DeepMind

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 3 May
Displayed time zone: Eastern Time (US & Canada) change

16:00 - 17:30	Paper Presentation 3DeepTest at 213 Chair(s): Matteo Biagiola Università della Svizzera italiana

16:00 30m Talk		OpenCat: Improving Interoperability of ADS Testing DeepTest Qurban Ali University of Milano-Bicocca, Andrea Stocco Technical University of Munich, fortiss, Leonardo Mariani University of Milano-Bicocca, Oliviero Riganelli University of Milano - Bicocca Pre-print
16:30 30m Talk		Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation DeepTest Benjamin Steenhoek Microsoft, Michele Tufano Google, Neel Sundaresan Microsoft, Alexey Svyatkovskiy Google DeepMind