ASE 2024
Sun 27 October - Fri 1 November 2024 Sacramento, California, United States

Existing methods for detecting anomalies in log data face significant challenges: (1) The current state-of-the-art method is a supervised approach that requires thousands of abnormal examples to achieve its performance, which is impractical to obtain. (2) Self-supervised methods rely on normal examples but heavily depend on error-prone log parsers and fixed log templates, despite the constantly evolving nature of log data. To overcome these challenges, we reformulate log-based anomaly detection as a semantic similarity task. By generating pairwise similarity scores with a pre-trained language model and augmenting them with ground-truth binary labels, we supervise a student encoder model trained for semantic similarity. We evaluate our method on several log datasets commonly used for benchmarking anomaly detection baselines. Our method increases the F1 score of supervised approaches by 0.24-0.78, when trained with a realistic number of abnormal examples (100), and achieves comparable results to the best-performing model when trained with the entire set of anomalies (> 200K abnormal samples). We also outperform self-supervised methods across datasets without relying on template extraction or fixed vocabularies. Our code and trained models will be made publicly available upon acceptance.