Scalable Data-Flow Modeling and Validation of Distributed-Memory Algorithms
Distributed-memory programs that use the Message Passing Interface (MPI) often introduce various kinds of correctness anomalies. This work focuses on the type of anomalies detectable through data-flow modeling. We present a new tool and Domain-Specific Language to describe the data-flow of computations based on collective operations, such as the broadcast or all-gather in MPI. Our tool, CollectCall, models key aspects of distributed-memory algorithms, namely the processor space, symbolic communicators, data, its partitioning and mapping, and a set of communication primitives. Using these concepts, we build constraint systems that model the initial data placement and communication steps of the algorithm. Systems are built and solved with the Z3 SMT and the Integer Set Library (ISL) to decide the correctness of sequences of collective operations. We formalize the correctness requirements for a class of collective communication operations, and demonstrate the effectiveness of our approach on several micro-benchmarks and on well-known distributed algorithms from the literature while comparing against ITAC, MPI-Checker and PSE, state-of-the-art tools.