Abstract -A collection of parallel processors is said to be coordinated if each write from one processing element (PE) to another is answered by a read. We report on an efficient algorithm to test coordination for parallel programs in which the code for each PE is a loop. We also test a weaker predicate for parallel algorithms with oblivious PE codes and we show that the general problem is PSPACE-hard.
II. THE MODEL OF PARALLEL PROGRAMS
We postulate a parallel processor composed of m processing elements (PE's) M1, M2,... , Mm which communicate with read and write operations. The PE's are all of the same type and since we are concerned only with interprocessor input-output behavior, it is sufficient to let them be devices capable of defining a regular set. We assume that the PE's execute synchronously and that, on each step, a PE can simultaneously execute a set of optrations.
We model such systems as interprocessor communication (IC) systems.' An IC system is completely defined by a set of reduced Moore-type machines VI, V2,... , Vm, each describing the interprocess input-output behavior of a single PE. The ith machine describes the behavior of the ith PE. The alphabet of the machines consists of symbols denoting sets of operations that are to be executed simultaneously. Each symbol is an element of the power set of2 {ri,a i GE [m] A crE E} U {Wi, Ili E [m] A cr E E} where Z is a finite set of values, rj, a denotes an operation that reads a-from PEB, wj a denotes an operation that writes ar to PE j, and p, the empty set, takes the place of any operation not involved in interprocessor communication (including operations that transfer values to and from the external environment).
As an example, we model a systolic processor for band matrix-vector multiplication [1] . For a matrix with bandwidth b, the processor consists of a linear array of b processing elements connected as in Fig. 1 . Data are passed through the matrix in three directions, advancing one position on each time step. The input matrix A is passed from north to south, the input vector X from west to east, and the resulting vector Y from east to west. Each processor uses three registers (Ra, RX, and Ry)
repeatedly executing the simple algorithm given in Fig. 2 . Even and odd processors (numbering from the left with 1) alternate their execution. The IC system modeling this systolic processor is shown in Fig. 3 . Only interprocessor reads and writes appear in the 'IC systems can be defined more generally [6] but for the purposes of this paper, we present only a limited version. 2 [m] denotes the set {l, 2,, m}. Note that we use standard set notation to represent both sets and the symbols of our alphabet; the distinction will be One of the most complex aspects of programming for parallel processors is the problem of ensuring that the resulting system is correctly coordinated. In this paper we address this problem by providing algorithms to questions of the form Given an IC system, is it strongly (weakly) coordinated?
We consider the problem for a sequence of cases, based on increasingly complex IC system structure. For the first two cases, which are sufficient to cover most of the existing parallel algorithms, we present efficient algorithms to test coordination. For the third, general case, we show that the problem is computationally intractable. Given an oblivious IC system, is there a potential coordination error?
If our algorithm reports CORRECT, then the system is coordinated; if our algorithm reports POTENTIAL ERROR, it is possible that the detected error will never show up in any legal computation of the system. To test worst case coordination along a single communication link, we form the "cross product" machine for the two PE's involved as in Fig. 5 Fig. 6 shows the computation tree for the link from PE A to PE B of the cross product machine depicted in Fig. 5 .
We otherwise.
The following lemma relates the outcome of paths to potential weak coordination errors. Lemma 1: The computation tree for a link contains no potential weak coordination errors if and only if all paths from the root have an outcome of either 0 or 1. Proof: For a given link i,j, consider the IC system that is composed of the two PE's using i,j with all of their other I/O operations replaced with p. This system has a weak coordination error if and only if the original IC system had such an error on the given link. Each path in the computation tree of the cross product machine for the new system corresponds directly to one of its execution sequences and it is easily shown by induction that, for all 1 that is, if and only if there is no potential weak coordination error on level 1 of the tree. As a result of this lemma, we can reduce the question of potential weak coordination errors to the following question.
Is there a path from the root in the computation tree for the link that has outcome -1 or I ?
We now introduce a series of lemmas to show that we can determine the answer to this question after having seen only a finite amount of the computation tree. We show the following lemma.
Lemma 2: No path between two nodes in an error-free computation tree has an outcome of I.
Proof: Let p = tl, t2,* * , tr be the shortest path in the tree with outcome I; p must have at least two nodes. Since p is shortest, there are three possibilities for the outcome of the path p' = tl, t2, * * , tr_ i) p' has outcome 1: then there is one more write than read along p'. Since reads and writes must alternate on any path in an error-free tree, the last I/O operation in p ' must be a write and tr cannot contain a write. The outcome of p must be 0 or 1, a contradiction.
ii) p' has outcome 0: as defined, the outcome ofp cannot be I.
iii) p' has outcome -1: then as in case i), p must have outcome either 0 or 1, a contradiction. o Thus, p cannot have outcome I which is a contradiction. Defining a cycle in a computation tree to be a path from a node to (but not including) the next occurrence of a node labeled with the same state, we show the following lemma.
Lemma 3: The outcome of a cycle in an error-free computation tree must be 0.
Proof: Suppose there is a cycle from node v, to node v2 with outcome other than 0. By Lemma 2, the outcome of the cycle must be either 1 or -1. If the outcome is 1, then there must be one more write than read on the cycle and so there must be at least one node v' on the cycle labeled by a state containing a write operation but no read operation.
V2
Let pI be the path from v, to the node immediately preceding v' and let P2 be the path from the node immediately following v' to V2. Pi and P2 together must contain an equal number of reads and writes. In addition, there must be a path in the tree which starts at v' and has the same labels as v' followed by P2 followed by Pi followed by v' again. This path must have an outcome I since it has two more writes than it has reads. By Lemma The cycle from the first occurrence ot s' to the next along this path must have outcome 0 or, by Lemma 3, we could have detected the existence of a coordination error when the cycle was encountered. If the cycle outcome is 0, the paths from the root to the two nodes labeled s' must have the same outcome and so the outcome at the node labeled s must be the same as the outcome at a node reached by a shorter path which does not contain the cycle S The outcome at s was I and so there must be a I outcome before level I which is a contradiction. For a cross product machine with at most q states, the following lemma bounds the number of nodes that must be retained on any given level.
Lemma 5: The computation tree for determining worst case weak coordination requires at most 2q nodes per level.
Proof: Consider two nodes with the same label s on some level of the tree and the subtrees below them
The subtrees T1 and T2 must be identical. Furthermore, if the paths from the root to these nodes have the same outcomes, the outcomes at all corresponding nodes in those subtrees must also be the same. Thus, we can "merge" the two nodes and examine only one of the subtrees to detect coordination errors. Since there are two possible legitimate outcomes for each node and q possible states, we need to retain at most 2q nodes per level.
E
We will use these results in constructing an algorithm for testing worst case weak coordination. The algorithm maintains a descriptor for each node on the current level which contains i) the union of the outcomes of all paths from the root to the node, ii) for each of the q states, an indication of whether or not that state has appeared on any of the paths to this node, and iii) for every state, a set containing the possible outcomes from all nodes labeled with that state on some path to (but not including) the current state.
The complete algorithm for finite state machines with q states is as follows. 2) Report CORRECT and HALT.
Theorem 3: Algorithm 2 correctly detects all worst case weak coordination errors for the given link.
Proof: Part I. Suppose that the algorithm halts after reporting an error. This could happen in either of two ways: as a result of step a) or as a result of step b). If it occurred as a result of step a) then, by Lemma 2, the tree contains an error. If it occurred as a result of step b) then the tree also contains an error because, by Lemma 3, the outcomes of all cycles in an error-free tree must be 0. Thus, whenever the algorithm reports an error, there is an error in the tree.
Part We show the following theorem. Theorem 5: For an arbitrary system of interconnected processors, the problem of testing communication interfaces for strong coordination is PSPACE-hard.
Proof: We reduce the language recognition problem for linear bounded automata (lba's), which is known to be PSPACE-complete [5] , to the coordination problem. Given an Iba and an input string, we construct an IC system which has a coordination error if and only if the Iba accepts the given string.
The IC system has the structure shown in Fig. 7 where the memory PE's are storage devices and the control PE's are modified instances of the Iba. There is one memory PEcontrol PE pair for each tape square. The memory PE keeps track of the current symbol written on its corresponding tape square and the symbol is transferred back and forth between the two PE's, enabling the control PE to read and branch on its value. Thefsa for the memory PE's is shown in Fig. 8 . We use a two-symbol alphabet (0 and 1) and the appropriate initial state is determined by the initial value of the tape square. We identify the destination of I/O operations directionally (e for east, co for west, n for north, and s for south) rather than explicitly.
The control PE's, in addition to reading the current tape symbol from their corresponding memory PE's, also read two tokens from their adjacent neighbors; one indicates whether the tape square corresponding to this control PE will have the read head on the next state and the other is the index of the next state. As indicated in Fig. 9 , all control PE's, except for the PE at the square where the head initially re- The control PE's read from the memory PE's (the fourth level of nodes in Fig. 9 ) and then write a value back (the fifth level of nodes). If the PB does not have the head, the input value is echoed back. If the PE does have the head, the new value, determined from the transition function of the lba, is written (oj denotes the value written for state j and symbol i).
On the next step (the sixth level of nodes), the control PE's simultaneously write and read to their east and west neighbors, indicating the head movement; h and h represent the messages "head" and "no head," respectively. Finally, the PE that has the head writes the updated state information (the eighth level of nodes) to the PE receiving the head (8j denotes the next state for state j and symbol i). Since the identity of the receiving PE depends on the Iba, we represent the direction of this write with a variable x.
At this point, the control PE that has the head is on the null state at the seventh level of the figure, the PE that is about to receive the head is either at the state labeled B or the state labeled C (depending on whether the head is to be passed from the east or the west), and the remainder of the PE's are at the state labeled A. In the next step, the new state information is passed to the PE receiving the head and the cycle repeats.
The IC system continues simulating the behavior of the lba until a halt is reached. The halt is passed to the control PE with the head instead of a next state, causing that PE to repeat the read which in turn causes a coordination error.
VI. DIscussION
Although the complexity theory results indicate that coordination testing is a very complex task, it is important to notice that many recently developed parallel algorithms are covered by Theorem 2. The testing algorithms presented here are currently being implemented and we expect that they will be of significant assistance to programmers working on parallel algorithms. In addition, as libraries of wellunderstood and well-tested parallel modules become available, we expect that these same testing algorithms can be used to check automatically the interface compatibilities of modules.
