Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework (SANER 2025 - Research Papers)

Who

Qiaolin Qin, Roozbeh Aghili, Heng Li, Ettore Merlo

Track

SANER 2025 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 6 Mar 2025 11:15 - 11:30 at L-1720 - Software Analysis & Recommendation Systems Chair(s): Brittany Reid

Abstract

Log parsing has been long studied by researchers, given its high importance in the software engineering community: the process identifies dynamic variables and constructs log templates with static components. Prior work has proposed many statistic-based log parsers (e.g., Drain), which are highly efficient; they, unfortunately, met the bottleneck of parsing performance in comparison to semantic-based log parsers, which require labeling and more computational resources. In the meanwhile, we noticed that previous works on log parsing mainly focused on the parsing stage and usually used an ad hoc preprocessing step (e.g., masking numbers or IP addresses). However, we argue that both preprocessing and parsing are essential for log parsers to identify dynamic variables. The lack of understanding of log preprocessing may prevent the optimal use of log parsers and hinder future research in developing parsing algorithms and configuring their preprocessing step. Therefore, our work first studied the existing approaches for log preprocessing, in particular, by analyzing the existing preprocessing steps used for different log datasets provided in Loghub, a popular log parsing benchmark. We then developed preprocessing framework based on our findings and evaluated its impact on log parsing. According to our experiment, our preprocessing framework can significantly boost the overall performance of the four state-of- the-art statistic-based parsers examined in the study. The best statistic-based log parser, Drain, obtained improvement on all four parsing metrics (e.g., the F1 score of template accuracy, FTA, increases by 108.9%). Moreover, in comparison to the optimal semantic-based log parsers, it obtained a 28.3% improvement in grouping accuracy (GA), 38.1% enhancement on the F1 score of grouping accuracy (FGA), and an 18.6% increment on the FTA. Our work pioneered studying the process of log preprocessing and provided a generalizable framework to enhance the state-of- the-art of log parsing.

Link to Preprint

https://arxiv.org/abs/2412.05254

Qiaolin Qin

Polytechnique Montréal

Canada

Roozbeh Aghili

Polytechnique Montréal

Canada

Heng Li

Polytechnique Montréal

Canada

Ettore Merlo

Polytechnique Montreal

Canada

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 6 Mar
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	Software Analysis & Recommendation SystemsResearch Papers / Industrial Track / Early Research Achievement (ERA) Track at L-1720 Chair(s): Brittany Reid Nara Institute of Science and Technology

11:00 15m Talk		A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions Research Papers Dongming Jin Key Lab of High-Confidence of Software Technologies (PKU), Ministry of Education, NIANYU LI ZGC Lab, China, Kai Yang Zhongguancun Laboratory, Minghui Zhou Peking University, Zhi Jin Peking University
11:15 15m Talk		Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework Research Papers Qiaolin Qin Polytechnique Montréal, Roozbeh Aghili Polytechnique Montréal, Heng Li Polytechnique Montréal, Ettore Merlo Polytechnique Montreal Pre-print
11:30 7m Talk		Boosting Large Language Models for System Software Retargeting: A Preliminary Study Early Research Achievement (ERA) Track Ming Zhong SKLP, Institute of Computing Technology, CAS, Fang Lv Institute of Computing Technology, Chinese Academy of Sciences, Lulin Wang , Lei Qiu SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, Hongna Geng SKLP, Institute of Computing Technology, CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences, Xiaobing Feng ICT CAS
11:37 15m Talk		Analyzing Logs of Large-Scale Software Systems using Time Curves Visualization Industrial Track Dmytro Borysenkov , Adriano Vogel , Sören Henning Johannes Kepler University Linz, Esteban Pérez Wohlfeil
11:52 15m Talk		Building Your Own Product Copilot: Challenges, Opportunities, and Needs Industrial Track Chris Parnin Georgia Tech, Gustavo Soares Microsoft, Rahul Pandita GitHub, Inc., Sumit Gulwani Microsoft, Jessica Rich , Austin Henley University of Tennessee
12:07 15m Talk		Filter-based Repair of Semantic Segmentation in Safety-Critical Systems Industrial Track Sebastian Schneider , Tomas Sujovolsky , Paolo Arcaini National Institute of Informatics , Fuyuki Ishikawa National Institute of Informatics, Truong Vinh Truong Duy