Benchmarking large language models for automated labeling: The case of issue report classification (ESEIW 2025 - ESEM - Journal First Track)

Who

Giuseppe Colavito, Filippo Lanubile, Nicole Novielli

Track

ESEIW 2025 ESEM - Journal First Track

Time Zone

The program is currently displayed in (GMT-10:00) Hawaii.

Use conference time zone: (GMT-10:00) HawaiiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 2 Oct 2025 14:20 - 14:35 at Kaiulani I - LLMs for Classification, Detection, and Recommendations Chair(s): Fabio Calefato

Abstract

Context: Issue labeling is a fundamental task for software development as it is critical for the effective management of software projects. This practice involves assigning a label to issues, such as bug or feature request, denoting a task relevant to the project management. To date, large language models (LLMs) have been proposed to automate this task, including both fine-tuned BERT-like models and zero-shot GPT-like models.

Objectives: In this paper, we investigate which LLMs offer the best trade-off between performance, response time, hardware requirements, and quality of the responses for issue report classification.

Methods: We design and execute a comprehensive benchmark study to assess 22 generative decoder-only LLMs and 2 baseline BERT-like encoder-only models, which we evaluate on two different datasets of GitHub issues.

Results: Generative LLMs demonstrate potential for zero-shot classification. However, their performance varies significantly across datasets and they require substantial computational resources for deployment. In contrast, BERT-like models show more consistent performance and lower resource requirements.

Conclusions: Based on the empirical evidence provided in this study, we discuss implications for researchers and practitioners. In particular, our results suggest that fine-tuning BERT-like encoder-only models enables achieving consistent, state-of-the-art performance across datasets even in presence of a small amount of labeled data available for training.

Link to Publication

https://www.sciencedirect.com/science/article/pii/S0950584925000977

Giuseppe Colavito

University of Bari, Italy

Italy

Filippo Lanubile

University of Bari

Italy

Nicole Novielli

University of Bari

Italy

Time Zone

The program is currently displayed in (GMT-10:00) Hawaii.

Use conference time zone: (GMT-10:00) HawaiiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 2 Oct
Displayed time zone: Hawaii change

13:50 - 14:50	LLMs for Classification, Detection, and RecommendationsESEM - Industry, Government, and Community Track / ESEM - Technical Track / ESEM - Emerging Results and Vision Track / ESEM - Journal First Track / at Kaiulani I Chair(s): Fabio Calefato University of Bari

13:50 15m Talk		Contribution History as a Key Feature in OSS Task Recommendation: an LLM-Based Empirical Study ESEM - Emerging Results and Vision Track Md Abdul Hannan Colorado State University, Mohammad Habibullah Rakib Colorado State University, Khondaker Masfiq Reza Colorado State University, Fabio Marcos De Abreu Santos Colorado State University, USA
14:05 15m Talk		Exploring LLMs for Stakeholder-Specific Insight Generation from Software Contracts ESEM - Industry, Government, and Community Track Jyoti Shukla TCS Research, Aditya Kahol TCS Research, Mohit Chaudhary TCS Research, Preethu Rose Anish TCS Research
14:20 15m Talk		Benchmarking large language models for automated labeling: The case of issue report classification ESEM - Journal First Track Giuseppe Colavito University of Bari, Italy, Filippo Lanubile University of Bari, Nicole Novielli University of Bari Link to publication
14:35 15m Talk		Secret Breach Detection in Source Code with Large Language Models ESEM - Technical Track Md Nafiu Rahman Bangladesh University of Engineering and Techonology, Sadif Ahmed Bangladesh University of Engineering and Techonology, Zahin Wahab The University of British Columbia, S. M. Sohan Google Inc, Rifat Shahriyar Bangladesh University of Engineering and Technology Dhaka, Bangladesh Pre-print