Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework
Log parsing has been long studied by researchers, given its high importance in the software engineering community: the process identifies dynamic variables and constructs log templates with static components. Prior work has proposed many statistic-based log parsers (e.g., Drain), which are highly efficient; they, unfortunately, met the bottleneck of parsing performance in comparison to semantic-based log parsers, which require labeling and more computational resources. In the meanwhile, we noticed that previous works on log parsing mainly focused on the parsing stage and usually used an ad hoc preprocessing step (e.g., masking numbers or IP addresses). However, we argue that both preprocessing and parsing are essential for log parsers to identify dynamic variables. The lack of understanding of log preprocessing may prevent the optimal use of log parsers and hinder future research in developing parsing algorithms and configuring their preprocessing step. Therefore, our work first studied the existing approaches for log preprocessing, in particular, by analyzing the existing preprocessing steps used for different log datasets provided in Loghub, a popular log parsing benchmark. We then developed preprocessing framework based on our findings and evaluated its impact on log parsing. According to our experiment, our preprocessing framework can significantly boost the overall performance of the four state-of- the-art statistic-based parsers examined in the study. The best statistic-based log parser, Drain, obtained improvement on all four parsing metrics (e.g., the F1 score of template accuracy, FTA, increases by 108.9%). Moreover, in comparison to the optimal semantic-based log parsers, it obtained a 28.3% improvement in grouping accuracy (GA), 38.1% enhancement on the F1 score of grouping accuracy (FGA), and an 18.6% increment on the FTA. Our work pioneered studying the process of log preprocessing and provided a generalizable framework to enhance the state-of- the-art of log parsing.
Thu 6 MarDisplayed time zone: Eastern Time (US & Canada) change
11:00 - 12:30 | Software Analysis & Recommendation SystemsResearch Papers / Industrial Track / Early Research Achievement (ERA) Track at L-1720 Chair(s): Brittany Reid Nara Institute of Science and Technology | ||
11:00 15mTalk | A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions Research Papers Dongming Jin Key Lab of High-Confidence of Software Technologies (PKU), Ministry of Education, NIANYU LI ZGC Lab, China, Kai Yang Zhongguancun Laboratory, Minghui Zhou Peking University, Zhi Jin Peking University | ||
11:15 15mTalk | Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework Research Papers Qiaolin Qin Polytechnique Montréal, Roozbeh Aghili Polytechnique Montréal, Heng Li Polytechnique Montréal, Ettore Merlo Polytechnique Montreal Pre-print | ||
11:30 7mTalk | Boosting Large Language Models for System Software Retargeting: A Preliminary Study Early Research Achievement (ERA) Track Ming Zhong SKLP, Institute of Computing Technology, CAS, Fang Lv Institute of Computing Technology, Chinese Academy of Sciences, Lulin Wang , Lei Qiu SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, Hongna Geng SKLP, Institute of Computing Technology, CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences, Xiaobing Feng ICT CAS | ||
11:37 15mTalk | Analyzing Logs of Large-Scale Software Systems using Time Curves Visualization Industrial Track Dmytro Borysenkov , Adriano Vogel , Sören Henning Johannes Kepler University Linz, Esteban Pérez Wohlfeil | ||
11:52 15mTalk | Building Your Own Product Copilot: Challenges, Opportunities, and Needs Industrial Track Chris Parnin Georgia Tech, Gustavo Soares Microsoft, Rahul Pandita GitHub, Inc., Sumit Gulwani Microsoft, Jessica Rich , Austin Henley University of Tennessee | ||
12:07 15mTalk | Filter-based Repair of Semantic Segmentation in Safety-Critical Systems Industrial Track Sebastian Schneider , Tomas Sujovolsky , Paolo Arcaini National Institute of Informatics
, Fuyuki Ishikawa National Institute of Informatics, Truong Vinh Truong Duy |