Towards Evaluation Guidelines for Empirical Studies involving LLMs (WSESE 2025 - 2nd International Workshop on Methodological Issues with Empirical Studies in Software Engineering)

Who

Stefan Wagner, Marvin Muñoz Barón, Falessi Davide, Sebastian Baltes

Track

WSESE 2025 Empirical Studies in SE

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 10:05 - 10:17 at 203 - Keynote & ESE4ML Chair(s): Sira Vegas

Abstract

In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process (e.g., for data annotation) or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of what our community standards are for high-quality empirical studies involving LLMs.

Stefan Wagner

Technical University of Munich

Germany

Marvin Muñoz Barón

Technical University of Munich

Germany

Falessi Davide

University of Rome Tor Vergata

Sebastian Baltes

University of Bayreuth

Germany

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 3 May
Displayed time zone: Eastern Time (US & Canada) change

09:00 - 10:30	Keynote & ESE4MLWSESE at 203 Chair(s): Sira Vegas Universidad Politecnica de Madrid

09:00 15m Other		Welcome WSESE Sira Vegas Universidad Politecnica de Madrid, Andreas Jedlitschka Fraunhofer IESE
09:15 50m Keynote		The Methodological Implications of Using Generative AI in Software Engineering Research WSESE Margaret-Anne Storey University of Victoria
10:05 12m Talk		Towards Evaluation Guidelines for Empirical Studies involving LLMs WSESE Stefan Wagner Technical University of Munich, Marvin Muñoz Barón Technical University of Munich, Falessi Davide University of Rome Tor Vergata, Sebastian Baltes University of Bayreuth
10:17 13m Live Q&A		Keynote & ESE4ML: Discussion WSESE