Intelligent Triage: Interpretable Incident Triage Workflow using LLM Extracted Triage Reasoning
This program is tentative and subject to change.
Large-scale cloud services frequently experience incidents with significant impact on their stability. Incorrectly transferring incidents can be extremely costly, making triage automation profitable. Previous methods primarily seek similar triage solution in the past via machine learning and information retrieval. In prior works,similar end-to-end solutions don’t exist or extremely hard to retrieve, yet when we break down past solutions, the step-wise reasoning logic can be reused. Another critical aspect to consider is the ecosystem of incident management, including upstream tools that perform incident analysis (e.g. data analysis and cleaning) and downstream teams which handle the triaged incidents. Upstream tools rely on triage’s feedback to pinpoint their errors while the downstream teams rely on triage’s justification to learn incident’s context. Providing a concise and explainable triage will greatly improve the efficiency of these tools and teams. These considerations motivate us to decompose an end-to-end triage solution to a refined triage reasoning called triage rule. Intelligent Triage workflow uses triage rule as its core data representation and has the capability to offer justifications to up-stream tools and downstream teams while ensuring high accuracy. We have deployed Intelligent Triage workflow across various services in continuous operation for more than six months in Microsoft Azure. The offline and online evaluations have shown that it achieves an accuracy rate 10 percentage higher and reduced Time to Mitigation (TTM). We conducted comprehensive case studies to analyze and validate the results.