How Much Logs Does My Source Code File Need? Learning to Predict the Density of Logs
Software logging is the practice of recording different events that occur within a software system, which are useful for several activities such as the analysis of the system behaviour, failure prediction and anomaly detection. However, determining the optimal location for such logging statements is a critical yet complex task. In fact, striking the right balance between logging and system overhead is challenging. That is, insufficient logging can make different maintenance tasks difficult due to missing crucial system execution data, while excessive logging can mask the real issues and cause notable performance overhead. Prior work has conducted various machine learning-based solutions to suggest where to insert logging statements. But most importantly, before answering the question ``where to log?’’, practitioners first need to determine whether a file needs logging at the first place. To do so, we conduct in this paper an empirical study to characterize the log density (i.e., ratio of log lines over the total lines of code) in seven open-source software projects. Then, we propose a deep learning based approach to predict the log density based on syntactic and semantic features of the source code. We find that the percentage of files with at least one log line ranges from 5% to 33% across the studied projects. Additionally, the median log density in the files with at least one log line ranges from 0.95% to 1.85% across the seven projects and can go up to 18%. Furthermore, files without logs are less maintained and tend to have a lower median number of bugs compared to files with logs. Our findings resonate with the hypothesis that not all source code files require logging. On the other hand, our log density models achieve an average accuracy of 84%. Whereas our cross-project log density prediction results show a promising performance with an average accuracy of 72%, which represents over 86% (ratio of cross/within) of the corresponding within-project predictions using syntactic features. Our results show that we can accurately predict whether a file needs logging and such predictions may be generalized across projects.