Benchmarking Generative AI Models for Deep Learning Test Input Generation (ICST 2025 - Research Papers)

Mon 31 March - Fri 4 April 2025 Naples, Italy

Who

Maryam Maryam, Matteo Biagiola, Andrea Stocco, Vincenzo Riccio

Track

ICST 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 3 Apr 2025 12:00 - 12:15 at Room A - LLMs in Testing Chair(s): Valerio Terragni

Abstract

Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training.

In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.

Link to Preprint

https://arxiv.org/abs/2412.17652

Maryam Maryam

University of Udine

Italy

Matteo Biagiola

Università della Svizzera italiana

Andrea Stocco

Technical University of Munich, fortiss

Germany

Vincenzo Riccio