Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code (ICST 2025 - Research Papers)

Mon 31 March - Fri 4 April 2025 Naples, Italy

Who

Shahin Honarvar, Mark van der Wilk, Alastair F. Donaldson

Track

ICST 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 3 Apr 2025 11:30 - 11:45 at Aula Magna (AM) - Testing ML Systems and Fault Localisation Chair(s): Atif Memon

Abstract

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language question templates, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated test oracle that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a neighbourhood of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM’s code generation abilities to be identified, including anomalies where the LLM correctly solves almost all questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting robustness issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Shahin Honarvar

Imperial College London

United Kingdom

Mark van der Wilk

University of Oxford

Alastair F. Donaldson