ICST 2025
Mon 31 March - Fri 4 April 2025 Naples, Italy
Thu 3 Apr 2025 11:30 - 11:45 at Aula Magna (AM) - Testing ML Systems and Fault Localisation Chair(s): Atif Memon

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language question templates, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated test oracle that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a neighbourhood of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM’s code generation abilities to be identified, including anomalies where the LLM correctly solves almost all questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting robustness issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Thu 3 Apr

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:30
Testing ML Systems and Fault LocalisationIndustry / Research Papers at Aula Magna (AM)
Chair(s): Atif Memon Apple
11:00
15m
Talk
On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering
Research Papers
Lauren Lyons Auburn University, Ali Ghanbari Auburn University
Pre-print
11:15
15m
Talk
Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems
Research Papers
Stefano Carlo Lambertenghi Technische Universität München, fortiss GmbH, Hannes Leonhard Technical University of Munich, Andrea Stocco Technical University of Munich, fortiss
Pre-print
11:30
15m
Talk
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Research Papers
Shahin Honarvar Imperial College London, Mark van der Wilk University of Oxford, Alastair F. Donaldson Imperial College London
11:45
15m
Talk
Taming Uncertainty for Critical Scenario Generation in Automated Driving
Industry
Selma Grosse DENSO Automotive GmbH, Dejan Nickovic Austrian Institute of Technology, Cristinel Mateis AIT Austrian Institute of Technology GmbH, Alessio Gambi Austrian Institute of Technology (AIT), Adam Molin DENSO AUTOMOTIVE
12:00
15m
Talk
Multi-Project Just-in-Time Software Defect Prediction Based on Multi-Task Learning for Mobile Applications
Research Papers
Feng Chen Chongqing University of Posts and Telecommunications, Ke Yuxin Chongqing University of Posts and Telecommunications, Liu Xin Chongqing University of Posts and Telecommunications, Wei Qingjie Chongqing University of Posts and Telecommunications
12:15
15m
Talk
Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces
Industry
Neetha Jambigi University of Cologne, Bartosz Bogacz SAP SE, Moritz Mueller SAP SE, Thomas Bach SAP, Michael Felderer German Aerospace Center (DLR) & University of Cologne
:
:
:
: