LLMs for Test Input Generation for Semantic Applications (CAIN 2024 - Research and Experience Papers)

Who

Zafaryab Rasool, Scott Barnett, David Willie, Stefanus Kurniawan, Sherwin Balugo, Srikanth Thudumu, Mohamed Abdelrazek

Track

CAIN 2024 Research and Experience Papers

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 15 Apr 2024 14:30 - 14:40 at Pequeno Auditório - LLMs and Testing Chair(s): Roland Weiss

Abstract

Large language models (LLMs) enable state-of-the-art semantic capabilities to be added to software systems such as semantic search of unstructured documents and text generation. However, these models are computationally expensive. At scale, the cost of serving thousands of users increases massively affecting also user experience. To address this problem, semantic caches are used to check for answers to similar queries (that may have been phrased differently) without hitting the LLM service. Due to the nature of these semantic cache techniques that rely on query embeddings, there is a high chance of errors impacting user confidence in the system. Adopting semantic cache techniques usually requires testing the effectiveness of a semantic cache (accurate cache hits and misses) which requires a labelled test set of similar queries and responses which is often unavailable.

In this paper, we present VaryGen, an approach for using LLMs for test input generation that produces similar questions from unstructured text documents. Our novel approach uses the reasoning capabilities of LLMs to 1) adapt queries to the domain, 2) synthesise subtle variations to queries, and 3) evaluate the synthesised test dataset. We evaluated our approach in the domain of a student question and answer system by qualitatively analysing 100 generated queries and result pairs, and conducting an empirical case study with an opensource semantic cache. Our results show that query pairs satisfy human expectations of similarity and our generated data demonstrates failure cases of a semantic cache. Additionally, we also evaluate our approach on Qasper dataset. This work is an important first step into test input generation for semantic applications and presents considerations for practitioners when calibrating a semantic cache.

Zafaryab Rasool

Applied Artificial Intelligence Institute, Deakin University

Scott Barnett

Applied Artificial Intelligence Institute, Deakin University

David Willie

Applied Artificial Intelligence Institute, Deakin University

Stefanus Kurniawan

Deakin University

Sherwin Balugo

Applied Artificial Intelligence Institute, Deakin University

Srikanth Thudumu

Deakin University

Mohamed Abdelrazek

Deakin University, Australia

Australia

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 15 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	LLMs and TestingResearch and Experience Papers at Pequeno Auditório Chair(s): Roland Weiss ABB

14:00 15m Talk		A Combinatorial Testing Approach to Hyperparameter OptimizationDistinguished paper Award Candidate Research and Experience Papers Krishna Khadka The University of Texas at Arlington, Jaganmohan Chandrasekaran Virginia Tech, Jeff Yu Lei University of Texas at Arlington, Raghu Kacker National Institute of Standards and Technology, D. Richard Kuhn National Institute of Standards and Technology
14:15 15m Talk		Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs Research and Experience Papers Ziyu Li University of Sheffield, Donghwan Shin University of Sheffield
14:30 10m Talk		LLMs for Test Input Generation for Semantic Applications Research and Experience Papers Zafaryab Rasool Applied Artificial Intelligence Institute, Deakin University, Scott Barnett Applied Artificial Intelligence Institute, Deakin University, David Willie Applied Artificial Intelligence Institute, Deakin University, Stefanus Kurniawan Deakin University, Sherwin Balugo Applied Artificial Intelligence Institute, Deakin University, Srikanth Thudumu Deakin University, Mohamed Abdelrazek Deakin University, Australia
14:40 10m Talk		(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs Research and Experience Papers MA Wanqin The Hong Kong University of Science and Technology, Chenyang Yang Carnegie Mellon University, Christian Kästner Carnegie Mellon University
14:50 10m Talk		Welcome Your New AI Teammate: On Safety Analysis by Leashing Large Language Models Research and Experience Papers Ali Nouri Volvo cars & Chalmers University of Technology, Beatriz Cabrero-Daniel University of Gothenburg, Fredrik Torner Volvo cars, Hakan Sivencrona Zenseact AB, Christian Berger Chalmers University of Technology, Sweden
15:00 10m Talk		ML-On-Rails: Safeguarding Machine Learning Models in Software Systems – A Case Study Research and Experience Papers Hala Abdelkader Applied Artificial Intelligence Institute, Deakin University, Mohamed Abdelrazek Deakin University, Australia, Scott Barnett Applied Artificial Intelligence Institute, Deakin University, Jean-Guy Schneider Monash University, Priya Rani RMIT University, Rajesh Vasa Deakin University, Australia
15:10 20m Live Q&A		Test - Q&A Session Research and Experience Papers