Developing a Taxonomy for Advanced Log Parsing Techniques
Logs play a crucial role in software engineering, supporting tasks such as debugging, system comprehension, failure prediction, and anomaly detection. However, the inherently unstructured nature of logs presents significant challenges for extracting actionable insights. Despite the development of numerous log parsing techniques—including machine learning, pattern recognition, and heuristic approaches—consistent accuracy remains elusive, particularly when dealing with complex, diverse log formats. In this study, we address these challenges by conducting an in-depth analysis of the characteristics that lead to parsing errors. Using 16 log datasets and 8 distinct log parsers, we apply open coding to develop a comprehensive taxonomy of Log Event Characteristics (LECs) that frequently cause parsing inaccuracies. We evaluate how these characteristics impact different parsers and examine how the distribution of LECs contributes to the complexity of a dataset. The resulting taxonomy not only provides a foundation for developing more effective log parsing tools but also offers valuable insights for creating machine-friendly and human-readable logs, ultimately improving system diagnostics and reliability.