Benchmarking Open-source Large Language Models For Log Level Suggestion (ICST 2025 - Research Papers)

Mon 31 March - Fri 4 April 2025 Naples, Italy

Who

Yi Wen HENG, Zeyang Ma, Zhenhao Li, Dong Jae Kim, Tse-Hsun (Peter) Chen

Track

ICST 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 3 Apr 2025 11:30 - 11:45 at Room A - LLMs in Testing Chair(s): Valerio Terragni

Abstract

Large Language Models (LLMs) have become a focal point of research across various domains, including software engineering, where their capabilities are increasingly leveraged. Recent studies have explored the integration of LLMs into software development tools and frameworks, revealing their potential to enhance performance in text and code-related tasks. Log level is a key part of a logging statement that allows software developers control the information recorded during system runtime. Given that log messages often mix natural language with code-like variables, LLMs’ language translation abilities could be applied to determine the suitable verbosity level for logging statements. In this paper, we undertake a detailed empirical analysis to investigate the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion. We opted for open-source models because they enable us to utilize in-house code while effectively protecting sensitive information and maintaining data security. We examine several prompting strategies, including Zero-shot, Few-shot, and fine-tuning techniques, across different LLMs to identify the most effective combinations for accurate log level suggestions. Our research is supported by experiments conducted on 9 large-scale Java systems. The results indicate that although smaller LLMs can perform effectively with appropriate instruction and suitable techniques, there is still considerable potential for improvement in their ability to suggest log levels.

Yi Wen HENG

Concordia University

Canada

Zeyang Ma

Concordia University

Canada

Zhenhao Li

York University

Canada

Dong Jae Kim

DePaul University

United States

Tse-Hsun (Peter) Chen

Concordia University

Canada

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 3 Apr
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:22	LLMs in TestingResearch Papers / Short Papers, Vision and Emerging Results at Room A Chair(s): Valerio Terragni University of Auckland

11:00 15m Talk		Improving the Readability of Automatically Generated Tests using Large Language Models Research Papers Matteo Biagiola Università della Svizzera italiana, Gianluca Ghislotti Università della Svizzera italiana, Paolo Tonella USI Lugano
11:15 15m Talk		Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation Research Papers Azat Abdullin JetBrains Research, TU Delft, Pouria Derakhshanfar JetBrains Research, Annibale Panichella Delft University of Technology
11:30 15m Talk		Benchmarking Open-source Large Language Models For Log Level Suggestion Research Papers Yi Wen HENG Concordia University, Zeyang Ma Concordia University, Zhenhao Li York University, Dong Jae Kim DePaul University, Tse-Hsun (Peter) Chen Concordia University
11:45 15m Talk		Understanding and Enhancing Attribute Prioritization in Fixing Web UI Tests with LLMs Research Papers Zhuolin Xu Concordia University, Qiushi Li Concordia University, Shin Hwei Tan Concordia University
12:00 15m Talk		Benchmarking Generative AI Models for Deep Learning Test Input Generation Research Papers Maryam Maryam University of Udine, Matteo Biagiola Università della Svizzera italiana, Andrea Stocco Technical University of Munich, fortiss, Vincenzo Riccio University of Udine Pre-print
12:15 7m Talk		Leveraging Large Language Models for Explicit Wait Management in End-to-End Web Testing Short Papers, Vision and Emerging Results Dario Olianas DIBRIS, University of Genova, Italy, Maurizio Leotta DIBRIS, University of Genova, Italy, Filippo Ricca Università di Genova