"No Free Lunch" when using Large Language Models to Verify Self-Generated Programs (AIST 2024)

Mon 27 - Fri 31 May 2024 Canada

Who

Sol Zilberman, Betty H.C. Cheng

Track

AIST 2024

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 28 May 2024 11:00 - 11:22 at Room 1 - Session 2 (Papers 2)

Abstract

Large Language Models (LLMs) have shown great success in a wide range of text-generation tasks including the synthesis of code from natural language descriptions. As LLM-based techniques continue to grow in popularity, especially amongst entry-level developers, LLM-generated code has the potential to be deployed in a diverse set of application domains. While LLMs can generate syntactically correct code output, recent work has shown the presence of nonsensical and faulty reasoning in LLM-generated text. As such, overreliance on LLMs for software generation may potentially result in the deployment of faulty software leading to critical system failures. This study explores the capabilities of a single LLM to generate both software and corresponding test suites from the same initial program descriptions, which can be considered analogous to an individual developer coding and unit testing for a given piece of software. We present an empirical framework and evaluation methodology to assess the usefulness of LLM-generated test cases for verifying programs generated by the same LLM. Our findings indicate that LLMs frequently generate irrelevant tests that suffer from numerous quality concerns.

Sol Zilberman

Michigan State University

Betty H.C. Cheng