FaultWeave: Bounded Resilience Testing with Failure Diagnosis Capability for Microservice Applications
Microservice architecture has become the de-facto standard for developing cloud-native applications, yet its complex inter-service dependencies make systems highly fragile to cascading failures. Resilience testing, which validates system behavior by injecting various faults, is therefore critical to improve the robustness of target systems. We present FaultWeave, a practical and effective resilience testing framework with failure diagnosis capability for microservice applications. Based on small scope hypothesis, FaultWeave designs an efficient fault space exploration technique which incrementally explores fault combinations up to a bounded depth, while taking full advantage of previous fault injection results to speed up test execution and reduce redundant test scenarios. This incremental strategy naturally identifies Minimal Failure Sets (MFS)—the smallest fault combinations that trigger resilience failures—which provide structured differential profiles for LLM-assisted failure diagnosis. Industrial deployment on an enterprise level cloud native application (containing 512 microservices) over three months discovered 237 resilience vulnerabilities, with 89% requiring multi-fault scenarios to trigger—a critical blind spot for traditional single-fault testing. The evaluation demonstrates significant improvements in testing efficiency and discovered resilience failures compared to existing manual practices.