Abstract. Dynamic verification methods are the natural choice for debugging real world programs when model extraction and maintenance are expensive. Message passing programs written using the MPI library fall under this category. Partial order reduction can be very effective for MPI programs because for each process, all its local computational steps, as well as many of its MPI calls, commute with the corresponding steps of all other processes. However, when dependencies arise among MPI calls, they are often a function of the runtime state. While this suggests the use of dynamic partial order reduction (DPOR), three aspects of MPI make previous DPOR algorithms inapplicable: (i) many MPI calls are allowed to complete out of program order; (ii) MPI has global synchronization operations (e.g., barrier) that have a special weak semantics; and (iii) the runtime of MPI cannot, without intrusive modifications, be forced to pursue a specific interleaving because of MPI's liberal message matching rules, especially pertaining to 'wildcard receives'. We describe our new dynamic verification algorithm 'POE' that exploits the out of order completion semantics of MPI by delaying the issuance of MPI calls, issuing them only according to the formation of match-sets, which are ample 'big-step' moves. POE guarantees to manifest any feasible interleaving by dynamically rewriting wildcard receives by specific-source receives. This is the first dynamic model-checking algorithm with reductions for (a large subset of) MPI that guarantees to catch all deadlocks and local assertion violations, and is found to work well in practice.
Introduction
MPI [1] programs are an important class of concurrent programs used for the distributed programming of virtually all high performance computing clusters in the world. MPI will also be widely used for programming peta-scale supercomputers under construction [2] . Typical MPI programs are C programs (or C++/Fortran programs) that create a fixed number of processes at inception. These processes then perform computations in their private stores, invoking various flavors of send and receive API functions in the MPI library to exchange data, and also invoke global synchronization operations in the MPI library. Most MPI programs create processes that eventually terminate.
MPI programs can contain many types of errors, including deadlocks, local assertion violations, resource leaks, and numerical inaccuracies. The primary goal of our work is to develop efficient methods to detect deadlocks and local assertion violations in MPI programs. Dynamic verification methods are the natural choice for verifying MPI programs because model extraction and model maintenance of MPI programs can be very expensive. This paper presents the first dynamic verification algorithm called POE (Partial Order reduction avoiding Elusive interleavings) for MPI that guarantees soundness (within the practical limits of runtime verification) and employs an effective partial order reduction algorithm. Of the many features of POE, the manner in which it guarantees coverage and implements reduction during dynamic verification are our main contributions. A good partial order reduction approach is crucial for verifying MPI programs because these programs mostly perform their computations in private stores, invoking MPI operations for message exchanges, where most (but not all) of these operations commute. Also, MPI calls occur with a high static and dynamic frequency, thanks to the many for loops in which MPI calls occur.
In this context, our verification tool ISP that uses the POE algorithm detects deadlocks missed by existing state-of-the-art tools. While the MPI 2.0 library itself supports over 300 MPI functions, ISP can handle 24 of the most commonly used MPI functions. In this paper, we describe the handling of five of these functions, namely MPI_Isend, MPI_Irecv, MPI_Barrier, MPI_Wait, and MPI_Test, and refer to them as 'send, receive, barrier, wait, and test.' Send and receive are, respectively, non-blocking operations, meaning that the issuing process can start the activity and proceed to execute later instructions while the send/receive proceeds in the background. The primary arguments of send are the destination process (this may not be a compile-time constant), the data being shipped, and a 'handle.' (Note: We do not detail some of the function arguments allowed by MPI calls, such as MPI 'tags' that affect message matching. Our implementation handles all allowed MPI arguments.) The issuing process may wait on the handle or test the handle. A wait blocks till the send operation finishes, while test returns false unless the send has finished, at which time it returns true. A send is deemed to have finished when the background process of copying the message out of the memory space of the sending process has finished. The arguments of receive are the source process ID (not necessarily a compile-time constant), the data receipt buffer, and a handle (with a semantics similar to the send handle). Instead of specifying a specific source process, receive can also mention '*', which is a wildcard receive that is open for receipt from any send that targets the receiving process. In effect, send and receive are split operations.
When an MPI process invokes a sequence of MPI calls, some of the calls may complete out of program order. For instance, if a process P0 invokes two consecutive non-blocking send operations targeting P1 and P2 respectively, the second send is allowed to finish before the first one (especially if the second send is shipping a much smaller amount of data). However, if both sends target the Dynamic verification with persistent set based reductions was introduced in [5] . The dynamic partial order reduction algorithm (DPOR) [6] allows these dependencies to be accurately computed based on runtime state. This algorithm works by generating one interleaving of the program (maintained as a stack trace) and generating its interleaving variants. It ensures that the set of transitions explored from a state s forms a persistent set as follows. Consider the transitions t i and t j of processes p i and p j respectively such that i < j in the current interleaving (this means that in the current interleaving t i is executed before transition t j ). If t i and t j are dependent (i.e., t i can either enable or disable t j and vice versa), and t i and t j are co-enabled, then p j is added to the pre-state of t i hoping to eventually execute t j . . . t i . This approach does not work with MPI, as explained with the help of a short example (Figure 1) .
In this example, MPI processes P0 and P2 are targeting P1 which entertains a 'wildcard match,' i.e., can receive from any process that has a concurrently enabled ISend targeting P1. As soon as one such send is chosen (say P0's), the other send is not eligible to match with this receive of P1 (it has to match another receive of P1 coming later). This disabling behavior of the sends induces a dependency between them, as can be seen from the fact that the particular send that matches may or may not cause error1 to be triggered. Consider some i < j < k, and a trace t where the ith action of t, namely t i , is P2's send, and similarly t j is P1's receive, and t k is P0's send. In this trace, it is not necessary that P0's receive is matched with P2's send just because t i is executed before t k . MPI implements its own buffering mechanism that can cause one send to race ahead of the other send. Formally, unlike in DPOR, MPI's program order does not imply happens-before [7] in an MPI program's execution. Hence, it is possible that t j is matched with t k . There is no way in an MPI run-time (short of making intrusive modifications to the MPI library, which is often impossible because of the proprietary nature of the libraries) to force a match either way (both sends matching the receive in turn) by just changing the order of executing sends from P2 and P0.
Roadmap: Section 2 presents an overview of POE and discusses related work. Section 3 presents POE formally. Section 4 provides a summary of experimental results. Section 5 concludes the paper. Figure 2 , and requires special considerations in the design of POE. In this example, one MPI_Isend issued by P0, shown as S0, and another issued by P2, shown as S2, target a wildcard receive issued by P1 1 . The following execution is possible: (i) S0(to P1, h0) is issued, (ii) R(*, h1) is issued, (iii) each process fully executes its own barrier, (B0, B1, or B2), and this "collective operation" finishes (all the B's indeed form an atomic set of events), (iv) S2(to P1, h2) is issued, (v) now both sends and the receive are alive, and hence S0 and S2 become dependent, requiring a dynamic algorithm to pursue both matches. Notice that S0 can finish after B0 and R can finish after B1. (Note: Because of the placement of this barrier that is after P0's send and P1's receive, but before P2's send, we sometimes refer to such barriers as 'crooked barriers.') To recapitulate, MPI respects program ordering between any MPI operation x ∈ {barrier, wait, test} and the MPI operation immediately following x in program order. A dynamic verification algorithm for MPI must therefore maintain a completes-before relation ≺ (defined in Section 3.2), and use it to determine, at runtime, all senders that can match a wildcard receive.
Overview of POE, and Related Work

2
POE Algorithm:
We now present an overview of POE, as implemented by our verification scheduler (called the POE scheduler) that can intercept MPI calls and send them into the MPI run-time as and when needed:
• The POE scheduler executes C program statements along each process. All C statements are executed in program order. When the scheduler encounters an MPI operation, it simply records this operation, but does not execute it. This process continues till the scheduler arrives, within each process, at an MPI operation that is program ordered with respect to some previously collected (but not issued) MPI operation (we call these points fences).
• While at a fence point for all processes, since all senders that match a wildcard receive are known, rewrite the receives into specific receives. In our example, R(*) is rewritten into R(from P0) and R(from P2).
• Form match-sets. Each match-set is either a single big-step move (as in operational semantics) or a set of big-step moves. Each big-step move is a set of actions that are issued collectively into the MPI run-time by the POE scheduler (we enclose them in . . . ). In our example, the match-sets are:
• Execute the match-sets in priority order, with all big-step moves executed first. The execution of a big-step move consists of executing all its constituent MPI operations. When no more big-step moves are left, then for each remaining set of big-step moves, recursively explore (according to depth-first search) all the big-step moves contained in it. In our example, this results in the big-step move B0, B1, B2 from being performed first. Subsequently, both the big-step moves in { S0(to P1), R(from P0) , S2(to P1), R(from P2) } are pursued.
Thus, one can notice that POE never actually issues into the MPI run-time any wildcard receive operations it encounters. It always dynamically rewrites these operations into receives with specific sources, and pursues each specific receive paired with the corresponding matching send as a match-set in a depth-first manner.
Additional Points About Barriers: It must be observed that the code snippet in Figure 1 can be verified with DPOR if the technique of dynamic rewriting of the wildcard receives is employed. However, the code snippet in Figure 2 cannot be verified with DPOR even with dynamic rewriting of wildcard receives employed. Due to the presence of the barrier, the send S2 can never be executed before the send S0, whereas in DPOR, we will need dependent actions to be replayable in both orders. In any interleaving of this example, however, S0 will always be issued before S2. The POE algorithm overcomes this problem by executing the big-step move B0, B1, B2 , and then forming the match-set { S0(to P1), R(from P0) , S2(to P1), R(from P2) }.
Related Work
In [8] , it was observed that DPOR may offer a way to determine, at runtime, which sends and receives can match in MPI programs. However, since no dynamic verification tool was built, the issues discussed in Section 1 pertaining to the
Fig. 3. An Example MPI Program
difficulties of forcing specific send/receive matches were not faced. In [9] , nothing more than the standard DPOR of [6] was needed, as we handled only some of the shared memory features of MPI for which a DPOR-like approach works. In our 2-page tools paper [10] , we actually implemented DPOR for many of MPI's communication commands, and in the process observed the unsoundness resulting from our inability to force specific send/receive matches. The POE algorithm takes advantage of our formal understanding of MPI (as captured in an extensive TLA+ model for MPI we are building [11] ), precisely builds the completes-before relation ≺, uses it to discover potential send/receive matches precisely, and employs dynamic rewriting to force desired matches.
While MPI-SPIN [12, 13, 14] , which is based on SPIN [15] , can detect the kinds of errors that POE can detect, this approach inherently requires major effort on the part of users in building, by hand, verification models of their MPI programs in Promela [15] . Given the extensive number of C constructs and user-level library calls used in writing many MPI programs, this effort is impractical in those cases. MPI-SPIN does provide a reduction algorithm called the Urgent Algorithm that allows all MPI send/receive channels to be treated as rendezvous channels. However, this algorithm applies only to programs that do not use wildcard receives (which are extensively used by many MPI program types). In general, MPI-SPIN relies on SPIN's POR algorithm which, unfortunately, does not "understand" the commuting properties of MPI calls. In its favor, MPI-SPIN supports a symbolic execution facility to compare a sequential algorithm against an MPI implementation of the algorithm to detect numerical inaccuracies -a feature not supported by ISP.
Other works [16, 17, 18, 19] do not seem to run into the problems we run into with MPI, including out-of-order completion, barriers, split operations, or runtime scheduling realities.
The plethora of concurrency libraries catering to 'multicore programming' suggests that dealing with complex APIs will become important. Yet, most tools in this area are based on the conventional 'testing' approach. ISP can now handle 24 MPI function types (detailed on our website). We have successfully handled all 69 examples in the Umpire [4] tool distribution. These are examples for which Umpire itself, and approaches such as Jitterbug [20] do not offer coverage guarantees (conventional verification tools for MPI that we surveyed [21] are unsound). Inserting randomized 'padding' delays to potentially perturb MPI's internal schedules (as done in ConTest [22] , Jitterbug, Marmot [3] , and Umpire) is highly unreliable, and slows down testing by adding delays into computational paths. For instance, for many of our examples containing wildcard receives provided on our website, Marmot missed generating many feasible schedules that actually contain deadlocks.
Formal Presentation of POE
Abstract Syntax
Let N at = {0, 1, 2 . . .}, Bool = {0, 1}, and Bool ⊥ = {0, 1, ⊥}. Given P ∈ N at MPI programs, their P ID ("MPI rank" of each process) set is {1 . . . P }, and P ID * is the set {1 . . . P } ∪ { * } ( * is to model 'wildcard receives'; see below). Let L ∈ P ID → N at be the lengths of the given programs, each program being viewed as a sequence of instructions. For any function f , its application to any argument i, f (i), is often written f i for brevity; for example L(1) (often written L 1 ) is the length of the first program. Also, a function f of two arguments can be applied to two arguments i and j, written f i,j , or partially applied to one argument i, and that is written f i (this partial application returns a function which later "expects" a j). Let p ∈ P ID → N at → I (where I is the set of MPI instructions defined in this sequel) be the programs. Thus p 1 . . . p P are the P programs, and the jth instruction of the ith program is p i,j . Let l ∈ P ID → N at be the program counters (PC) 
Let h ∈ P ID → N at → Bool ⊥ be the handles h 1 . . . h P . In our formal model, every instruction has a handle; it is only the case that W and T (MPI wait and test instructions defined in this sequel) happen to use this handle in a specific way. Handle h i,j is initially ⊥. In our description of POE, we use the setting of h i,j to 0 to model POE encountering (collecting) instruction any i,j (. . .) in program order, and the setting to 1 to model POE issuing (executing) this instruction. POE will (i) set h i,j to 1 out of program order (but still correctly so according to ≺), and (ii) dynamically rewrite the wildcards before forming match-sets and executing them. The total system state is l, h (we keep track of the PC values and the handle array status).
The set of MPI instructions I is the smallest set that include the following: Barrier, written B i,j , Send, written S i,j (k, i, j ), where k ∈ P ID is the process targeted, and i, j is the handle used to track the progress of this Send, Receive, written R i,j (k, i, j ) where k ∈ P ID * is the process from which the message is sourced ( * means 'wildcard receive,' i.e., the message is sourced from any process), and i, j is the handle (as with send) to track the progress of this Receive. We do not show the data payloads for sends S and receives R; when needed in discussions, they will be shown as a third argument. For S (send) and R (receive), their handle i, j is used by a following W instruction, or tested by a following T instruction (not required to exist by the MPI standard, and we also do not require the W /T to exist). I also includes Wait, written W i,j ( m, n ) where m, n refers to a handle. W i,j ( m, n ) blocks till h m,n is set to 1. This event occurs when the instruction which set h m,n to 0 finishes. (This earlier instruction is an S or R.) I also includes Test, written T i,j ( m, n , l) where m, n refers to a handle and l is a PC. T i,j ( m, n , l) blocks till h m,n is set to 1, and this occurs when the instruction that set h m,n to 0 (an earlier S or R) finishes, in which case the control transfers to the new PC l. Finally, I includes a conditional goto to model loops (space prevents further discussion of goto and T ). Figure 3 illustrates our syntax. Process P1 has seven sequential commands, and P2 and P3 each have five each. All proper MPI programs start with MPI_INIT, and terminate with MPI_FINALIZE, and both these essentially have the semantics of a barrier. Thus, the set B 1,1 , B 2,1 , and B 3,1 models MPI_INIT. Likewise, the set B 1,7 , B 2,5 , and B 3,5 models MPI_FINALIZE. The set B 1,3 , B 2,2 , and B 3,3 is a 'crooked barrier'. Thus, notice that the two sends S 2,3 (1, 2, 3 ) and S 3,2 (1, 3, 2 ) both target P1, and they can both potentially match with R 1,2 ( * , 1, 2 ).
Illustration:
In this example, if R 1,4 ( * , 1, 4 ) is changed to R 1,4 (2, 1, 4 ), it is possible that R 1,2 ( * , 1, 2 ) matches S 2,3 (1, 2, 3 ), and then S 3,2 (1, 3, 2 ) cannot match R 1,4 (2, 1, 4 ) (this receive expects a message from P2, not P3). This results in a deadlock. Such deadlocks cannot be detected through static analysis alone, because in MPI, send targets (i.e., the 1 in S 3,2 (1, 3, 2 )) and receive sources can be computed at runtime.
Completes-before Relation of MPI
MPI guarantees process-pair-wise message delivery ordering with respect to the issue orders of sends and receives. To illustrate this idea, consider two sends that are issued by process i both targeting process j, and two matching receives that are issued by process j, hoping to source from i. These sends and receives must be carried out in program order. It is only when send operations target receive operations in different processes, or receive operations source from different processes, that program order can be relaxed.
Specifically, suppose process i has a send S i,m (j, i, m , d 1 ), and another send S i,n (j, i, j , d 2 ), for n > m. Here, d 1 and d 2 are the data payloads. Suppose process j has a receive R j,u (i, j, u , x 1 ), and another receive R j,v (i, j, v , x 2 ), for v > u. Here, x 1 and x 2 are j's receive buffers, MPI guarantees FIFO message ordering and ensure that x 1 is bound to d 1 and x 2 to d 2 during execution. The POE algorithm must never issue these sends and receives out of order. In fact, the POE algorithm can 'fire and forget' these operations in program order, and be guaranteed that the MPI runtime will match them in this appropriate order. Now consider a slightly different example where there are three processes i, j, and k in the system. The receives are R j,u (k, j, u , x 1 ) and R j,v ( * , j, v , x 2 ), where k = i, and furthermore, let process k never issue a send to process j. In this case, the first receive (which cannot match any of the offers made by i) will be trumped by the second receive, which now goes ahead; the result will be that x 2 is bound to d 1 . The POE algorithm has to be aware of this 'trumping rule.' A third variant of our example is one where the sends are as above, the receives are R j,u (k, j, u , x 1 ) and R j,v ( * , j, v , x 2 ), where k = i, but now there is a third process k which issues a send,
, thus binding x 1 to d 3 , and x 2 to d 1 . POE has to be aware of this lack of trumping, as well. Thus, we note that when the sequence R j,u (k, j, u , x 1 ) ; . . . R j,v ( * , j, v , x 2 ) appears in process j, the second receive can conditionally complete before the first one, in a manner that depends on the runtime state of the system.
We now define the completes-before relation, ≺. The POE algorithm presented in Section 3.4 will be based on ≺. A variant of ≺ called conditionally completes (≺ c ) is used to model the concept of trumping discussed earlier. We do not discuss ≺ c any more in this paper, for the sake of simplicity (it is of course incorporated into our implementation of POE, in forming match-sets according to ≺ c ).
FIFO Lemma: Any MPI program execution respecting ≺ * , the transitive closure of ≺, guarantees the required FIFO message orderings between MPI processes.
Match-Set Formation
Fence Instructions: For an instruction j ∈ I, f ence(j) holds exactly when for all succeeding instructions k ∈ I in program order, j ≺ * k. Notice that 'wait' and 'barrier' act as fences, and depending on the MPI program, other instructions may attain a fence status.
Ancestor Relation:
The ancestor of an instruction i is some instruction j where j ≺ * i. The set ancestors(i) is the set of indices of i's ancestors. To exploit the FIFO Lemma, POE issues instruction i to the MPI system only after all its ancestors j have been issued. POE can issue any instruction not connected by ≺ * out of order, as the MPI system itself considers such instructions semantically unordered (and hence may reorder them).
Match-set definitions:
We now define all match-set types. The match-set type M S * R will be a set of big-step moves. The match-set type M S B will be one bigstep move containing all the matching barriers. Match-set type M S R will contain exactly one send S i,u (j, . . .), and its matching non wild-card receive S j,v (i, . . .) ). Match-set type M S W will be a big-step move of exactly one wait, and M S T will be a big-step move of exactly one test. Consider the big-step moves . . . themselves to be sets.
The main difficulty in forming match-sets is to determine which sends can match a wildcard receive. To compute M S * R , we start with a set containing just the wildcard receive in question. We then seek the maximal number of additional sends that we can add to this set, without hitting a fence. Finally we break * into specific instances of PIDs. We also must make sure that for the members of any MS, all its ancestors have been issued into the MPI system. Modeling this requires the state of the h array.
Formal Definition of M S(l, h):
We define match-sets as a function of l (the array of PCs) and h (the array of handles). In our definitions, we often refer to a "band" of past PC values where the MS might lie; this is what the function ρ used below denotes:
Priority Scheme: Let M S(l, h) be an abbreviation for invoking M S B (l, h), M S R (l, h), and M S W (l, h) in some order. If this invocation returns ∅, we will explicitly invoke M S * R (l, h) and pursue the contents of this set, if any. The above is the priority search scheme that POE uses (postpone wildcard receives until all senders are discovered).
The POE Algorithm
We present the transition relation as an inference system which infers new states. Let l, h ∈ Rch mean that the state l, h has been reached. We invariantly maintain that h i,li = 0. In the following, h i,j is set to 1 only by match-set moves. Non-MS moves are PC advances, and they result only in h i,j being set to 0 (the instruction is encountered but not issued). For a process i, a PC advance move is permitted if the instruction at its current PC is not a fence, or if the instruction has been issued (handle is set). The atomic transitions are the one of the M S(l, h) moves, a PC move, or all the moves within M S * R (l, h). Also move(l, h, R) takes a system state l, h , an atomic transition (set of instructions) R, sets the handle bits at the indices of the instruction. It does not advance the PC, as that will be done by the 'PC move' transition. Formally, let α ∈ I → P ID and β ∈ I → N at be such that for instruction r ∈ I, r = p α(r),β(r) . Then,
Init: l 0 , h 0 ∈ Rch, where l 0 = λi.1 and h = (λij.if j = 1 then 0 else ⊥).
Step: for l, h ∈ Rch // All the deterministic singleton ample-set moves if M S(l, h) = ∅ then move(l, h, M S(l, h)) ∈ Rch // PC move which is also a singleton ample-set move
Illustration of POE: POE will form match-sets (MS) from only those instructions that have a handle value of 0. In system state l, h , if there exists a MS other than M S * R (will be a subset of I), POE picks any such set and invokes its operations (sets h i,j for that instruction to 1). M S * R is a set of subsets of I, and POE recursively invokes each member set in any order (in the implementation, these are backtrack points). If no MS can be built in the current system state, if possible, POE advances the PC l i of some process i; else, the system is deadlocked. In our example (Figure 3 ), the first MS will be B 1,1 , B 2,1 , B 3,1 , and these barrier calls are issued, setting h 1,1 , h 2,1 and h 3,1 to 1. When R 1,2 ( * , 1, 2 ) from P1 is encountered, h 1,2 is set to 0 (instruction encountered, but recorded for future issue). Likewise, from P3, we encounter S 3,2 (1, 3, 2 ), and set h 3,2 = 0; we do not issue this send, as we have not carved out the maximal MS and we have not hit a fence. The system advances the PCs, finds the next MS B 1,3 , B 2,2 , B 3,3 , and invokes it, setting the handle bits to 1. Following this, it will encounter S 2,3 (1, 2, 3 ), setting h 2,3 = 0. At this point, further PC advancement will place P1's PC facing R 1,4 ( * , . . .), which is ≺ ordered after R 1,2 ( * , . . .), and hence serves as a fence within P1. Now P2 encounters fence W 2,4 , and P3 encounters fence W 3,4 ( 3, 2 ) . At this point, the set S 2,3 (1, 2, 3 ), S 3,2 (1, 3, 2 ), R 1,2 ( * , 1, 2 ) is promoted to an MS status. The dynamic rewriting process produces two MSs (actually a set containing two MS sets) R 1,2 (2, 1, 2 ), S 2,3 (1, 2, 3 ) and R 1,2 (2, 1, 2 ), S 3,2 (1, 3, 2 ) , and recursively invokes POE with these MSs. When the last MS-B is encountered, this corresponds to MPI_FINALIZE. At this time, if any handle is still a 0, and no more MS remains, an invalid end-state error is reported. In this example, no deadlock is encountered.
Correctness of POE:
The correctness of POE consists of two steps. First, we must ensure that we abide by the FIFO Lemma in all scheduling decisions. This follows from POE never issuing actions contrary to ≺. However, whenever ≺ does not hold, POE may issue actions out of order. Second, we must ensure that we are executing according to conditions C0-C2 ( [23] ) of a correct partial order reduction algorithm (we do not require C3 owing to the acyclicity of MPI's state space). C2 is satisfied because local assertions only observe local process steps which are singleton ample. The priority scheme on Page 75 ensures that all singleton ample-sets contributed to by match-sets other than M S * R are exhausted. These preserve C1. Finally, the dependencies among the sends targeting a wildcard receive are correctly handled by doing a full recursive expansion of the constituents of M S * R , which also preserves C1.
Summary of Experimental Results
We have implemented the POE algorithm in our ISP runtime model-checker for MPI that is downloadable, along with our examples, from our website. A summary of our results is as follows:
• In all the 69 examples from the Umpire test suite, ISP produces the same theoretical number of interleavings required by our formal algorithm. This number is far smaller than the number of interleavings without reduction.
• Existing MPI program testing approaches (e.g., Umpire, Marmot) cannot detect deadlocks with assurance on many simple examples. In all these cases, the POE algorithm detects the deadlocks (see our webpage for the results).
• For some examples with several hundreds of lines of code that have no wildcard receives (where the code checks for local assertions), POE requires exactly one interleaving. Existing testing tools will wastefully explore multiple interleavings where the MPI operations have no dependencies. • POE's setting of handle bits turns into collecting MPI operations without issuing them. These book-keeping steps of ISP have negligible overheads. The main overhead of ISP is that of restarting MPI for each replay. In [10] , we provide techniques that can dramatically reduce this overhead. This technique will be integrated into our current ISP version.
• ISP supports 24 MPI functions, including many collective operations, MPI communicators, and non-deterministic wait functions such as MPI_WAIT_ANY. However, in a significant number of cases, we can allow an MPI program to issue operations even outside of this set. These extra functions (such as MPI_TYPE_CREATE) can still be issued into the MPI run-time without being trapped by the verification scheduler of POE. • POE's scheduler is designed to be parallelized using MPI in future versions of ISP. Also a static analysis package to remove computations that do not affect control flow has been prototyped and will be integrated into ISP.
Concluding Remarks
We have described an algorithm for handling out of order execution and barrier semantics in verifying MPI programs for deadlocks and local assertions. We emphasize that POE works on unaltered MPI source programs. [24] ), we are in a position to rigorously prove the MPI semantics described in this paper.
