Network partitions are inevitable in large-scale cloud systems. Despite developers’ efforts in handling network partitions throughout designing, implementing and testing cloud systems, bugs caused by network partitions, i.e., partition bugs, still exist and cause severe failures in production clusters. It is challenging to expose these partition bugs because they often require network partitions to start and stop at specific timings.
In this paper, we propose Consistency-Guided Fault Injection (CoFI), a novel technique that smartly injects network partitions to effectively expose partition bugs. We observe that, network partitions can leave cloud systems at inconsistent states, where partition bugs are more likely to occur. Based on this observation, CoFI first infers invariants (i.e., consistent states) among different nodes in a cloud system. Once observing a violation to the inferred invariants (i.e., inconsistent states) while running the cloud system, CoFI injects network partitions to prevent the cloud system from recovering back to consistent states, and thoroughly tests whether the cloud system still proceeds correctly at inconsistent states.We have applied CoFI to three widely-deployed cloud systems, i.e., Cassandra, HDFS, and YARN. CoFI has detected 7 previously-unknown bugs, and three of them have been confirmed by developers.
Wed 23 Sep Times are displayed in time zone: (UTC) Coordinated Universal Time change
|01:10 - 01:30|
|01:30 - 01:50|
Michael C. GertenIowa State University, James I. LathropIowa State University, Myra CohenIowa State University, Titus H. KlingeDrake UniversityPre-print
|01:50 - 02:00|