Abstract. The time complexity of several fundamental problems on the sub-bus mesh parallel computer with p processors is investigated. The problems include computing the PARITY and
MINIMUM is computable in 2(log p) time. The lower bound for MINIMUM on the one-dimensional sub-bus mesh uses the same technique as the lower bound for PARITY.
We then show that PARITY and MAJORITY are computable in time 2( log p log log p ) on a two-dimensional sub-bus mesh computer. The lower bounds follow from the fact that a CRCW PRAM can simulate a sub-bus mesh computer to within a constant factor of the time and within a polynomial number of processors. Thus, the CRCW PRAM lower bounds on PARITY and MAJORITY [0] apply to the sub-bus mesh computer. The upper bounds follow from an algorithm for SUM, the sum of p numbers of length O(log p), which runs in time O( log p log log p ). The SUM algorithm is non-trivial, using mixed radix arithmetic, the Chinese remainder theorem and recursion to achieve the result. The obvious algorithm for SUM takes time 2(log p). The two-dimensional bound of ( log p log log p ) for SUM on the CRCW PRAM follows from the lower bound on PARITY.
1.2. Related Results. The mesh or array parallel computer architecture has been investigated for a number of years, with numerous articles on its many variants [0, 0, 0, 0, 0, 0, 0]. The sub-bus mesh architecture was rst investigated by Reisis and Prasanna Kumar where they gave constant time algorithms for the OR of p bits and the MINIMUM of p p numbers (all on one row), and an O(log p) algorithm for combining p data items with an associative operator [0] . Two variants of the mesh computer are closely related to the sub-bus mesh. First, there is the full-bus mesh where processors can broadcast vertically or horizontally, but on a vertical (horizontal) broadcast at most one processor per column (row) can be active. The MPP of Goodyear and NASA is an example of a full-bus two-dimensional mesh computer [0] . Full-bus meshes are generally less powerful than sub-bus meshes. Both PARITY and MINIMUM require (p ) time for some > 0 on full-bus meshes [0, 0] . Second, there is the recongurable mesh, which allows the topology of the mesh to be changed by the executing program [0]. Several prototype, but no commercial, recongurable mesh computers have been built. PARITY can be computed in constant time on the recongurable, two-dimensional mesh [0] . Thus, our results demonstrate that the sub-bus mesh computer architecture is strictly more powerful than the fullbus mesh computer architecture, but strictly less powerful than the recongurable mesh computer. In other work on the PARITY function, MacKenzie [0] independently obtained a lower bound of (logp=k) for computing PARITY on a restricted p 2 k recongurable mesh model, which is exactly our one-dimensional sub-bus mesh model for k = 1. However, he did not extend this to other symmetric functions.
The SUM function has also been previously studied on the recongurable mesh. Nakano [0] and Nakano, Masuzawa and Tokura [0] developed algorithms for SUM on the recongurable mesh that also use Chinese remaindering. Their results do not apply directly to the sub-bus mesh architecture. MINIMUM has also been previously studied on the two-dimensional recongurable mesh, and the techniques used can be applied directly to show that MINIMUM can be computed in 2(log logp) time on the two-dimensional sub-bus mesh. From the work of Hao, MacKenzie and Stout [0] , a lower bound of (log logp) is obtained for computing MINIMUM on a two-dimensional sub-bus mesh. Their proof is based on a PRAM simulation of the mesh model, and applies a result of Fich et al [0] in which an equivalent lower bound is proved for the CRCW PRAM. However, this proof requires that the inputs be very large. Another lower bound of (log logp) for computing MINIMUM on the two-dimensional sub-bus mesh follows from the general lower bound for the paral-lel comparison model of Valiant [0] and applies to comparison-based algorithms. A matching upper bound is due to Miller et al [0] , and is basically an implementation of the parallel MINIMUM algorithm of Valiant [0] .
The parallel random access machine (PRAM) is probably the most well studied theoretical model of parallel computation. There are a number of variants of the PRAM depending on whether reads or writes to the same memory location can be done concurrently. The variant most closely related to the sub-bus mesh is the CRCW PRAM (concurrent read/concurrent write PRAM) (see [0] for an introduction to the PRAM model). In this version more than one processor can read or write the same memory location at the same time. A simultaneous write can be resolved in a number of ways. The ability of the sub-bus mesh to broadcast on segments of the bus is very much like a combination of a concurrent read and a concurrent write. Those processors which are actively broadcasting are executing a concurrent write while those which are inactive are executing a concurrent read. Indeed the sub-bus mesh can compute OR in constant time just as a CRCW PRAM can. We show that CRCW PRAM can simulate a sub-bus mesh computer to within a constant factor of the time. This simulation immediately implies that lower bounds for the CRCW PRAM are also lower bounds for sub-bus mesh. Interestingly, some CRCW PRAM algorithms can be translated into sub-bus mesh algorithms. For example, the constant time OR algorithm and Valiant's MINIMUM algorithm can be implemented on the subbus mesh. By contrast, some CRCW PRAM algorithms, such as the constant time CRCW PRAM MINIMUM algorithm of Fich, Radge, and Wigderson [0], appear to be impossible to implement on the sub-bus mesh. More generally, the CRCW PRAM is strictly more powerful than the sub-bus mesh regardless of the number of dimensions. This is because problems like SORT require time (p ) on mesh computers, but can be done in time O(log p) on a PRAM.
The sub-bus model is also incomparable with the EREW (exclusive read, exclusive write) PRAM model. To see this, note that computing the OR of p bits requires (log p) time on a CREW PRAM [0], whereas the sub-bus mesh can compute OR in constant time. By contrast, p integers can be sorted in O(log p) time on an EREW PRAM with p processors [0, 0, 0], but, by a simple bisection bandwidth argument, this task requires time (p ) on mesh computers [0]. 1.3 . Organization of the Paper. In section 2 we present our model of the subbus mesh computer. In section 3 we prove our upper and lower bounds for the onedimensional sub-bus mesh computer. In section 4 we give two-dimensional algorithms for PARITY and SUM. In section 5 we give our simulation of a two-dimensional subbus mesh by a CRCW PRAM thereby yielding our two-dimensional lower bounds for PARITY, MAJORITY, and SUM. In section 6 we present two-dimensional upper and lower bounds for MINIMUM. Finally, we have our conclusions in section 7.
2. Sub-Bus Mesh Computer Model. For the purposes of this paper we present a simple version of the sub-bus mesh computer architecture. Actual machines have a richer organization.
The sub-bus architecture can be easily explained for the one-dimensional mesh or linear array of processors. There are p processors, numbered consecutively 0 to p 0 1 on a circle. Processor 0 is the front-end which runs the parallel program. Each processor is a RAM with its own memory which is referenced using plural variables. In addition, there are singular variables for which there is only one copy which is stored at the front-end processor. There is a special plural variable PID which always holds the processor's number. Processor operations include direct and indirect Boolean operations, arithmetic operations, shifts, and comparisons. In addition, the front-end can perform normal branching operations and issue parallel instructions.
A parallel instruction issued by the front-end has the form \if <condition> then <statement>." Each processor evaluates the condition, which can be any sequence of non-branching operations on plural or singular data which evaluates to a Boolean value. If the condition is true then the processor is said to be active, otherwise it is said to be inactive. Only the active processors execute the statement part of the instruction.
There Table 2 describes the result of a segmented broadcast.
broadcast right [2] .y x PID 0 1 2 3 4 5 6 7 active no no yes yes no no no yes Table 1 Demonstration of segmented broadcast. The * indicates that the value of y did not change because of the broadcast.
In two dimensions there are also p processors where p is a square number. The mesh processors are arranged in a p p 2 p p array. The coordinates of a mesh processor's number are stored in PIDx and PIDy. A mesh processor's number is stored in PID = PIDy * p p + PIDx. The sub-busses go in four directions up, down, right, and left. Processor (x; y) is immediately up from processor (x; (y + 1) mod p p) and immediately to the left of processor ((x+1) mod p p; y). So vertical busses go up and down, while horizontal busses go right and left all in a circular fashion. The front-end processor is processor (0; 0). 2.1. Time of Sub-Bus Mesh Algorithms. For the purpose of analyzing our algorithms we consider time to be evaluated using the unit cost RAM criterion where the values operated upon must have length O(log p). Each sequential operation by the front-end, each parallel operation used in evaluating the condition in a parallel instruction, and each statement of a parallel instruction costs 1 in our model. We do not charge for the broadcast of parallel instructions by the front-end to the mesh processors. We assume that cost is dominated by the cost of executing the parallel instruction.
As mentioned earlier, the RAM instructions include the usual direct and indirect Boolean operations, arithmetic operations, shifts, and comparison. In addition, we permit any xed nite set of RAM instructions for our processors, provided each in-struction can be implemented in uniform NC [0, 0] , that is, each instruction can be built from log O(1) p hardware and runs in time log O(1) logp. The set of RAM operations is independent of p. Thus, the total hardware in the sub-bus mesh computer of p processors is p log O(1) p. The running time of log O(1) log p per RAM instruction is fast enough to be considered to be constant time in the mesh of processors.
For the purpose of proving our lower bounds we allow our model to be even more general. We do not restrict the length of the values operated on and do not restrict the RAM operations in any way. There is one exception. The two-dimensional lower bound for MINIMUM is done in the so-called \comparison model" where the only RAM operations allowed on input values are comparison, copy, and broadcast. Thus, with one exception, the lower bounds reect the cost of computing functions due to the sub-bus mesh architecture, not any limitations on the individual processors. The two-dimensional lower bound for MINIMUM is still quite general, but is limited to the comparison model of the sub-bus mesh computer. In CONSTANT-TIME-OR, if any of the values x i are true, then one or more of the processors will make sure to broadcast true into all processors' y's. However, if none of the x i bits are true, then no processor will run the broadcast step, and so all the y i 's will remain false. Clearly, using DeMorgan's law AND can also be computed in constant time.
Our In CONSTANT-TIME-MINIMUM, the rst two broadcasts have the eect of setting x i;j = x i;0 and y i;j = x j;0 . The comparison x i;j > y i;j is then equivalent to the comparison x i;0 > x j;0 . If such a comparison holds then x i;0 is not the minimum. The statement \if t then..." computes, in one step, the \or" of the outcomes of these comparisons. Thus, after the broadcast up, if t i;0 = false then x i;0 is the minimum. This minimum is then broadcast to the rst row of the mesh. 3 . One-Dimensional Bounds. In this section we give precise upper and lower bounds on the parallel time to compute PARITY, MAJORITY, SUM, and MINIMUM on the one-dimensional sub-bus mesh computer. 3.2. Lower Bounds in One Dimension. Our lower bounds for the one-dimensional sub-bus mesh computer are based on the limited communication bandwidth of this architecture. Thus, in proving our lower bounds, we use a simplied model, in which internal computations in a processor are \free" and only the time taken for communication is measured. It will be clear that any lower bound for this model applies also to the upper bound model.
As before, there are p processors, numbered 0; 1; : : :; p01, connected by a circular sub-bus. The computation proceeds in rounds; we charge 1 time unit for a round. In each round of the computation, the processors rst communicate and then perform arbitrary internal computation. The communication is controlled by the front-end, processor 0, just as in the upper bound model described in section 2. Once this is done, processors can do arbitrary internal computation that does not require communication. There is no bound on the length of values broadcast or computed by the processors.
An algorithm consists of both the algorithm that determines the sequence of communication instructions broadcast by processor 0, and the algorithms of processors 0; : : :; p01 that determine the internal computations at each round. Since processor 0 can read any information on the bus that passes in either direction between processor p01 and processor 1, the communication instructions may depend on this information.
Let f be a function with domain D p . We say algorithm A computes f if for all (x 0 ; x 1 ; : : :; x p01 ) 2 D p , if at the start of the algorithm each processor i has in its memory the value x i , then at the end of the algorithm, every processor has in a special memory location the value f(x 0 ; x 1 ; : : :x p01 ). The tuple (x 0 ; x 1 ; : : :; x p01 ) is called the input. In all of the results of this section, we assume that jDj 2.
Fix an input x = (x 0 ; x 1 ; : : :; x p01 ). Processor k's view on input x at time t is a sequence k; x k ; (1; v 1 ); (2; v 2 ); : : :; (t; v t ) where v i is the value received by processor k at time i during the broadcast instruction. String v i is a special symbol, say , if no value is received. We denote by View k (x; t) the view of processor k at time t. For a xed input x, we say x i is unknown to processor k at time t if View k (x; t) = View k (x 0 ; t) for all x 0 that diers from x only at component i. Otherwise, we say x i is known to processor k at time t.
Our main result is the following:
Theorem 3.2. On a one-dimensional sub-bus mesh with p processors and for any algorithm A, there exists an input x such that for some i; 1 i p 0 1, x i is unknown to processor 0 at time log p 0 1. This result is true, regardless of what function is being computed by A, as long as the domain size jDj 2. Hence, the result immediately yields a lower bound of log p for the time to compute functions f with the property that for any (x 0 ; x 1 ; : : :; x p01 ) 2 D p and any i; 0 i p 0 1, there is some x 0 i 2 D such that f(x 0 ; x 1 ; : : :; x i01 ; x i ; x i+1 ; : : :; x p01 ) 6 = f(x 0 ; x 1 ; : : :; x i01 ; x 0 i ; x i+1 ; : : :; x p01 ): Clearly PARITY is an example of such a function, where D = f0; 1g, and so theorem 3.2 implies a lower bound of logp for PARITY. Also, the MINIMUM function over the integers is an example of such a function, so again theorem 3.2 implies a lower bound of log p for computing MINIMUM.
We now describe informally the ideas in the proof of theorem 3.2. Note that, for all inputs x and processors k, if i 6 = k then x i is unknown to k at time 0. Consider a processor i at the rst round. We consider two possibilities. The rst is that i is inactive at round 1, regardless of its input value x i . This is good since then, for any processor k 6 = i, x i is still unknown to processor k at time 1.
The other possibility is that i is \potentially active"; that is, i is active on at least one possible value of its input. Then, unfortunately, at the end of round 1, x i may be known to some, and possibly all, other processors. We can use this to our advantage, however, by setting i's input x i to force i to be active. Then, the broadcast of processor i will block any other broadcast which might have otherwise sent information through i.
For the purpose of this informal discussion, suppose that at round 1, all processors are potentially active. Our strategy in this case will be to x the values of alternate processors, in order to force them to be active. These xed values determine a partial assignment 2 (D[f3g) p , and partition the processors into xed and free processors. On any input x consistent with the partial assignment , the broadcasts of the xed, active processors block the free processors from revealing any information about their values to too many processors.
In general, for any t; 0 t logp01, we will dene a partial assignment which xes the input at all but b(p 0 1)=2 t c free processors. On any input x consistent with the partial assignment , the input x i of a free processor i will be known only to a set of contiguous processors containing i at time t.
We now state and prove the main lemma leading to the proof of theorem 3.2. Lemma 3.3. Fix an algorithm A. For any t; 0 t log p 0 1, there is a partial assignment with at least b(p01)=2 t c free processors with the following property. On any input x consistent with , the input x i of a free processor i will be known only to a set of contiguous processors S i containing i at time t, where 0 6 2 S i . Moreover, for any two distinct free processors i and j, S i \ S j = 0.
Proof. The proof is by induction on t. The base case is when t = 0. In this case, since no communication has taken place, it is immediate that if is the partial assignment which is not xed anywhere, then all possible inputs x are consistent with , all processors in the range 1; : : :; p 0 1 are free, and the value x i of every processor i is known only to processors in the set S i = fig. Suppose the lemma is true for t 0 1 0, and let be the partial assignment as in the statement of the lemma. Suppose that at round t, active processors broadcast to the right (the other case, when active processors broadcast to the left, is handled similarly).
We will dene a partial assignment 0 which extends and satises the lemma for time t. To do this, we consider the free processors at time t 0 1 in order from that with the largest index to that with the smallest index. (We consider processors in the opposite order in the case that the broadcast is to the left.) If free processors j; i occur consecutively in this ordering, with j > i, we say that j is i's free neighbor to the right at time t 0 1.
For each of these processors i in turn, we will determine whether i remains free at round t, and if not, we will extend to x input value x i . If i does remain free, we will dene a corresponding set S 0 i containing i, and will show that at time t, on any input x consistent with 0 , x i is known only to processors in S 0 i , that 0 6 2 S 0 i and that S 0 i \ S 0 j are disjoint, for free processors i 6 = j. Hence consider some processor i that is free at time t 0 1. We say that S i is potentially active if there is some input consistent with such that some processor in S i broadcasts at round t with that input. Otherwise S i is said to be inactive.
If S i is potentially active, then we dene i to be free at round t if and only if the following conditions hold: (i) it is not the largest numbered free processor at time t 01, and (ii) processor i's free neighbor to the right at time t 01 is not free at time t. (Note that since i's free neighbor to the right has index j > i, and since we consider the free processors in order from the largest to the smallest, it is already determined whether j is free at time t.) The corresponding set S 0 i is dened to be the smallest contiguous set containing S i and S j , where j is i's free neighbor to the right at time t 0 1. Otherwise, i is not free at time t and the value of x i is xed in 0 , to force some processor in S i to be active at round t. It is important to note that since the processors in S i do not know the values of x j for the free processors j 6 = i at time t01, then some assignment to the input x i will force some processor in S i to be active at round t regardless of any assignment to other inputs whose processors are free at time t 0 1. If S i inactive, then we dene i to be free at round t and the corresponding set of processors S 0 i to be equal to S i . This completes the description of 0 and the set S 0 i for each free processor i. We now argue that 0 satises the lemma at time t. It is straightforward to see from the construction that for each free processor i at time t, i 2 S 0 i and S 0 i is a set of contiguous processors. Also, for any two distinct free processors, i and j, S 0 i \ S 0 j = 0. This is because the S i are contiguous, non-overlapping sets, and each S 0 i is either S i or is formed by \collapsing" two neighboring sets S i and S j , where processor j is free at time t 0 1 but not at time t. Finally, using the fact that no set S i contains processor 0, we show that no set S 0 i contains processor 0. This is easy to see if S 0 i equals S i , since we know S i does not contain processor 0. Otherwise, S 0 i is the smallest contiguous set containing S i and S j , where j is i's free neighbor to the right at time t 0 1. Since 0 < i < j, processor 0 cannot lie between the contiguous sets S i and S j . This, together with the fact that neither S i nor S j contain processor 0, implies that S 0 i does not contain processor 0.
We next show that for any x consistent with 0 , if i is free at time t then x i is known only to those processors in S 0 i . First note that since processor 0 is not in S i for any processor i that is free at time t 0 1, the instruction broadcast by 0 at time t does not reveal any information about the values of processors which are free at time t 0 1. Also, it is clear that if S i is inactive at time t, for all x consistent with 0 , then x i is still known only to those processors in S i at time t.
Consider the other case, where S i is potentially active at time t. Then i's free neighbor to the right at time t01, say processor j, is free at time t01 but not at time t. Moreover, by our construction of 0 , on any x consistent with 0 there is a processor b in S j which broadcasts at time t. Hence, on any input x consistent with 0 , any broadcast of a processor in set S i reaches only processors in the segment between this active processor and processor b. The processors in this segment are contained in the smallest contiguous set containing both S i and S j . Hence, x i is known only to processors in S 0 i at time t. To complete the proof, it remains to show that there are b(p 0 1)=2 t c free processors at time t. By the inductive hypothesis, there are b(p 0 1)=2 t01 c free processors at time t 0 1. If i and j are two neighboring free processors at time t 0 1, then at least one of these is still free at time t. To see this, suppose that i < j and that j is not free at time t. Then either S i is inactive at time t in which case i is free at time t, or S i is potentially active, in which case both conditions (i) and (ii) are satised, so again i is free at time t. Hence the number of free processors at time t is at least bb(p 0 1)=2 t01 c=2c = b(p 0 1)=2 t c as required.
The proof of theorem 3.2 now follows easily from lemma 3.3. If p 2 then b(p 0 1)=2 log p01 c 1. Hence by the lemma, there is a partial assignment which is not xed at one free processor, say i, with the following property. On any input x consistent with , at time logp 0 1, the input x i will be known only to a set of processors S i , where 0 6 2 S i . Hence, x i is unknown to processor 0 at time log p 0 1.
Lower bounds of log p time for PARITY and MINIMUM follow immediately from theorem 3.2, as discussed after the statement of that theorem. The same lower bound must also hold for SUM since PARITY can be computed from SUM without any communication. Thus, we have:
Theorem 3.4. On a one-dimensional sub-bus mesh with p processors, the time to compute PARITY, SUM, and MINIMUM is at least logp. In order to obtain lower bounds for MAJORITY and many other symmetric Boolean functions we need to modify lemma 3.3. If is a partial assignment dene 0 to be the number of inputs xed to 0 in and 1 to be the number of inputs xed to 1 in . We say a partial assignment is b-balanced if 0 b 0 10b 1. That is, is 1-balanced if the number of inputs assigned to 1 in is equal to or one greater than the number of inputs assigned to 0 in . Similarly, is 0-balanced if the number of inputs assigned to 0 in is equal to or one greater than the number of inputs assigned to 1 in . Lemma 3.5. Fix an algorithm A. For any bit b and for any t; 0 t log 3 p 0 1, there is a b-balanced partial assignment with at least b(p 0 1)=3 t c free processors with the following property. On any input x consistent with , the input x i of a free processor i will be known only to a set of contiguous processors S i containing i at time t, where 0 6 2 S i . Moreover, for any two distinct free processors i and j, S i \ S j = 0.
Proof. This proof is similar to that of lemma 3.3. Assume we have a b-balanced partial assignment at time t 0 1 and a number n of free processors with their associated segments satisfying the condition of the lemma. Assume also, that at time t there is a broadcast to the right. As in the proof of lemma 3.3 a segment is inactive if no processor in the segment would become active on any input consistent with and is potentially active otherwise. As before any processor i which is free at time t 0 1 and whose segment S i is inactive remains free at time t. Assume the free processors at time t01 are indexed by i 1 ; i 2 ; :::; i n where i j > i j+1 for 1 j n. We consider these processors three at a time, largest index to smallest, to determine which potentially active processors remain free at time t and for those that do not remain free, what their inputs will be assigned to in the new b-balanced partial assignment 0 . If n is divisible by 3 then this process will end simply. If not, there will be a remaining group of 1 or 2 which must be dealt with.
Assume that for j 3m, it has already been determined whether i j is free at time t and if not, what the assignment to x ij is in the assignment 0 . We now consider the processors i 3m+1 ; i 3m+2 and i 3m+3 where 3m+3 n. There are four cases to consider depending on how many of the segments S i3m+1 ; S i3m+2 ; S i3m+3 are potentially active. If none are potentially active then there is nothing to do. If exactly one, say S i3m+k , is potentially active then set x i3m+k to a value to make 0 b-balanced. That is, if 0 0 = 0 1 then assign x i3m+k to b, otherwise set it 1 0 b. If exactly two of the segments are potentially active, then assign the input associated with one to 0 and the other to 1. Finally, if all three are potentially active then i 3m+3 remains free, x i3m+2 is set so as force a processor in the segment S i3m+2 to be active, and then x i3m+1 is set to make 0 b-balanced.
Once the groups of three have been processed there may be one or two remaining free processors. If exactly one of the segments in this remaining group is potentially active, then assign the input of that processor to a value to make the partial assignment b-balanced. If exactly two of the segments of the free processors in this remaining group are potentially active, then assign the two inputs of the free processors to opposite values.
At the end of this process, at least bn=3c of the processors are free. If i is free at time t and S i is inactive, then S 0 i = S i . If i is free at time t and S i is potentially active, then the corresponding set S 0 i is dened to be the smallest contiguous set containing S i and S j , where j is i's free neighbor to the right at time t 0 1. This happens in the fourth case above when i = i 3m+3 and j = i 3m+2 .
It should be clear that 0 is b-balanced partial assignment with at least b(p01)=3 t c unassigned inputs. Furthermore, for the same reasons as in the proof of lemma 3.3, for any input x consistent with 0 , the input x i of a free processor i will be known only to a set of contiguous processors S i containing i at time t, where 0 6 2 S i . Clearly, the segments at time t are disjoint.
As a consequence of lemma 3.5 we have the following theorem which allows us to nd a b-balanced input with a component unknown to processor 0 at a time slightly less than the maximum time to nd just some input with a component unknown to processor 0 as in theorem 3.2. Theorem 3.6. On a one-dimensional sub-bus mesh with p processors and for any algorithm A and bit b, there exists a b-balanced input x such that for some i; 1 i < p 0 1, x i is set to b, but is unknown to processor 0 at time log 3 p 0 1. Proof. If p 2 then b(p 0 1)=3 log 3 p01 c 1. By lemma 3.5, there is a (1 0 b)-balanced partial assignment which is not xed at a free processor i. Set x i = b, then set the remaining unassigned inputs so as to make the input b-balanced. Since 0 is not a member of the segment S i at time log 3 p 0 1, then x i is not known to processor 0. Theorem 3.7. On a one-dimensional sub-bus mesh with p processors, the time to compute MAJORITY is at least log 3 p.
Proof. Let A be any algorithm for MAJORITY. There are two cases to consider depending on whether p is even or odd. If p is even then by theorem 3.6 select a 0-balanced input x and an i such that x i = 0 and x i is not known to processor 0 at time log 3 p 01. Clearly, processor 0 cannot have computed the majority of the inputs by time log 3 p 0 1 since its computation would be identical for the input x and x 0 which is identical to input x except that x 0 i = 1. The latter input has a majority of 1's while the former does not. If p is odd then select a 1-balanced input x and an i such that x i = 1 and x i is not known to processor 0 at time log 3 p01. The remainder of the argument is similar to that above.
For any symmetric Boolean function f on p inputs dene m(f) to be the k such that p 2 0k 0 is minimal and for some bit b the value of f on an input with exactly k inputs equal to b diers from the value of f on an input with k + 1 inputs equal to b. For example, m(MAJORITY) = m(PARITY) = bp=2c and m(OR) = m(AND) = 0. Corollary 3.8. On a one-dimensional sub-bus mesh with p processors, the time to compute any symmetric function f is at least log 3 (2m(f)).
Proof. Let f be given. Let b be such that if m(f) inputs have the value b then f has one value and if m(f) + 1 inputs have the value b then f has another value. For simplicity consider the case in which m(f) inputs equal 0 implies the value of f is 0 and m(f) + 1 inputs equal 0 implies the value of f is 1. If exactly p 0 2m(f) inputs are set to 0 then the restricted function has 2m(f) inputs. By a proof identical to the proof of theorem 3.7 any algorithm to compute the restricted function must take time log 3 (2m(f)). The argument for b = 1 is similar.
Dene THRESHOLD k to be the Boolean function which is 0 with k or fewer inputs set to 1 and 1 otherwise. Clearly, m(THRESHOLD k ) = min(k; p 0 k). Thus, we have the following: to compute THRESHOLD k is at least log 3 (2 min(k; p 0 k)). 4 . Algorithms for PARITY and SUM. In this section we present asymptotically optimal algorithms for PARITY and SUM on the two-dimensional sub-bus mesh computer. We start with the PARITY algorithm. It is the simpler of the two, and introduces some of the key ideas which are useful in the SUM algorithm.
4.1. PARITY Algorithm. We will introduce a series of problems, in increasing order of diculty. The algorithm for each problem will lead to the next one with some fresh tricks. This will help us concentrate on one idea at a time.
Each of the algorithms below can be executed on a sub-mesh of the p p 2 p p mesh. By an array or sub-array we mean a sub-mesh of the full mesh which may be non-square and non-contiguous. In case it is non-contiguous it is assumed that the processors between any two processors in the sub-array are inactive so as not to interfere with communication between the processors in the sub-array. Furthermore, any of the algorithms below can be executed in parallel on disjoint and properly aligned sub-arrays of the full p p 2 p p array. If the algorithm is executed on a m 2 n sub-array, then we say processor (i; j) is the processor in the (i; j)-th position (the i-th column and j-th row) of the sub-array where 0 i < m and 0 j < n. Although it is not generally the case that processor (i; j) has its PIDx = i and PIDy = j, it will always be the case that i, j, and dimensions of the sub-array can be computed from the PID of the processor and other local data in constant time.
Lemma 4.1. On an n 2 2 n array with each processor in the top row having an input bit, the parity of the input bits can be computed in constant time.
Proof. There are 2 n possible inputs, so we will make row j of the array responsible for determining whether the input, thought of as an integer x with 0 x < 2 n 0 1, actually equals j. In particular processor (i; j) determines if the input in processor (i; 0) equals the i-th bit of j. A downward broadcast of the input gives processor (i; j) knowledge of the input in processor (i; 0). Then processor (i; j) compares the input of processor (i; 0) with the i-th bit of j. A constant time AND of the outcomes of these comparisons in all the rows in parallel, tells processor (0; j) whether the input, thought of as an n bit number, equals j. This information can then be broadcast up to processor (0; 0). Since 2 n p p, we know j logp so that processor (0; 0) can compute the sum of the bits in j in constant time, using the fact that computing the sum of the bits of an input is in uniform NC. The parity of the input bits is the parity of this sum.
We saw that with exponentially many rows we can compute the parity in constant time. In general, if we have more than a constant number of rows, we can beat the straightforward O(log n) time algorithm.
Lemma 4.2. On an n 2 m array with each processor in the top row having an input bit, the parity of the input bits can be computed in time O( log n log log m ). Proof. We can think of the n 2 m array as n= logm sub-arrays of dimension log m 2 m placed side by side. As in the previous proof, we can compute the parity of groups of log m bits in constant time. This leaves n= logm partial results in the rst row of an array of dimension n log m 2 m. Repeating the process log n log log m times we have the parity of all the n bits.
So far we have been assuming that only the processors in the top row have inputs. Let us now consider the case where each processor has an input. Lemma 4.3. On an n 2 m array with each processor having an input bit, the parity of the input bits can be computed in time O(log m + log n log log m ).
Proof. First, in parallel, the processors within each column run the one-dimensional PARITY algorithm described in section 3.1. This part takes time O(log m). At this point, we have partial results stored in the top row. From the previous lemma, the parity of these partial results can be computed in an additional O( log n log log m ) steps. We are ready to give our PARITY algorithm. Q and with each processor in the top row having an input integer, the sum of the inputs modulo Q can be computed in time O( log n log log n ). Proof. Let m = log n log Q . Let us focus on an m 2 n sub-array which has m inputs on the top row. There are Q m = n possibilities for the m inputs mod Q. For 0 j < n, think of j as an integer written in base Q. As in the computation of parity, processor (i; j) is responsible for determining if the i-th input mod Q is equal to the i-th Q-ary digit of j. Processor (i; j) learns of the input at (i; 0) by a broadcast down from the rst row. Then, processor (0; j) learns from an AND on its row that the i-th Q-ary digit of j is the i-th input mod Q for all i such that 0 i < m. Processor (0; j) then transmits j to processor (0; 0) where the sum mod Q of the Q-ary digits of j is computed.
The original n 2 n array can be divided into n=m sub-arrays of dimension m 2 n placed side by side where the algorithm above is performed in parallel. What remains are n=m numbers in the top row. Iterate this process O( log n log m ) = O( log n log log n0log log Q ) times to compute the sum of all the n integers modulo Q.
To complete the proof we must argue that computing the sum mod Q of the Q-ary digits of a number x is in uniform NC. We assume both Q and x are written in binary. First, since log Q p logn and n p p, then the length of Q is O( p log p). Second, x p p, so that x is of length O( p logp). Thus, the lengths of Q and x can be assumed to be bounded by the same number b, which we can assume is a power of two and of length O( p logp). To nd the sum mod Q of the Q-ary digits of x write x as x 0 + Q b=2 x 1 , by dividing x by Q b=2 . Recursively, nd the a 0 and a 1 which are the sums mod Q of the Q-ary digits of x 0 and x 1 respectively. Then, a = (a 0 +a 1 ) mod Q is the sum mod Q of the Q-ary digits of x. The necessary powers of Q, division by these powers, and sum are all in uniform NC. Since the number of levels of recursion is bounded by log 2 b, then the result a can also be computed in uniform NC.
By the Chinese remainder theorem we know that if we can compute the sum modulo suciently many small integers, we can compute the exact sum. Lemma 4.6. If 6 logt p logn then on an n 2 tn array with each processor in the top row having an input integer such that the sum of these integers is less than 2 t , the sum of the inputs can be computed in time O( log n log log n ). Proof. Think of the n 2 tn array as t sub-arrays, each of size n 2 n, one on top of the other. Let M be the product of all primes less than t. If t 41 then M 2 t [0, Corollary to Theorem 4, page 70]. Hence, if the sum of the input integers is less than 2 t then it is enough to compute the sum modulo M. We already know how to compute the sum modulo small primes. Our plan is to let each n 2 n sub-array compute the sum modulo a dierent prime and then apply the Chinese remainder algorithm to compute the sum modulo M.
To begin with the processors in the top row broadcast the input values down the columns. The jth sub-array decides whether j is a prime. This can be done in two stages. A number j is prime if and only if it is not divisible by any number between 1 and p j. In the rst stage, assign p j processors in the rst row to check for each possible divisor. In the second stage, these processors compute an AND of their results. Only processors in the j-th sub-array for prime j participate in all subsequent steps. The j-th sub-array computes a j , the sum of all the inputs modulo j. By Lemma 4.5, this can be done in O( log n log log n ) time. Next, in O(log t) steps, each processor in the j-th sub-array computes M, the product of all primes less than t, and m j = M=j. The processors in the j-th sub-array compute (a j m j )((m j ) 01 mod j). This can be done in constant time. The nontrivial part is computing ((m j ) 01 mod j). There are at most j possible values for the inverse. We assign j processors in (say) the rst row of the j-th sub-array for each possible value of the inverse. In one step, each of these assigned processors can check whether it has the right value of inverse. The processor corresponding to the right value of the inverse broadcasts this to all other processors. By Chinese remainder theorem, the exact value of the sum is the summation of (a j m j )((m j ) 01 mod j), which can be computed in O(log t) steps.
In total the computation can be done in O(log t + log n log log n ) = O( log n log log n ) time. Lemma 4.7. If tw m and 6 logt p log w then on an n 2m array with each processor in the top row having an input integer such that the sum of these integers is less than 2 t , the sum of the inputs can be computed in time O( log n log log w ). Proof. Think of the n 2 m array as n=w sub-arrays each of dimension w 2 m side by side. Since tw m, each w 2 m sub-array can be thought of as containing a w 2 tw sub-array at the top. Each w 2 tw sub-array has its input on the top row. By lemma 4.6, the sum of the inputs of all the w 2 tw sub-arrays can be computed in time O( log w log log w ) leaving n=w partial sums in the top row. This process is repeated log n= logw times until the input is reduced to a single number. This reduction takes time O( log n log log w ). We now consider the case where each processor has an input. Lemma 4.8. If tw m and 6 logt p log w then on an n 2m array with each processor having an input integer such that the sum of these integers is less than 2 t , the sum of the inputs can be computed in time O(log m + log n log log w ). Proof. The rst step is simply to add the columns in parallel in time O(log m). We are now reduced to the problem in the previous lemma. Hence the total time is O(log m + log n log log w ). Proof. Choose t = log(p k ) where p k is an upper bound on the sum of the input integers and log t 6. Let c be a constant, which depends only on k, such that log t p c log p= loglogp. Now, choose w such that log w = c logp= loglog p. Let m = tw, so that t, w, and m satisfy the hypothesis of lemma 4. In the next section we will show these bounds are optimal.
5. Simulation of a Sub-Bus Mesh by a CRCW PRAM. In this section, we prove a ( log p log log p ) lower bound for computing PARITY, MAJORITY, and SUM on the two-dimensional sub-bus mesh computer. To prove this, we show that any algorithm for the sub-bus model can be simulated by a CRCW (concurrent read, concurrent write) PRAM algorithm with only a constant factor loss in running time. We then apply lower bound results for PARITY on the PRAM model. We begin this section by describing the PRAM results. Then we describe our lower bound model and describe the simulation in detail.
Beame and Hastad [0] considered lower bounds for the following \ideal" CRCW PRAM model. There are p(n) numbered processors which share c(n) numbered memory cells, where p(n) and c(n) are polynomially bounded. There is no bound on the possible contents of a memory cell. Initially, the input bits x 0 ; : : :; x n01 are stored in the rst n memory cells and the remaining cells have value 0. Before each step t, a processor is in some state, say q. At the tth step, the processor may read the value v stored in some memory cell C. Based on C, v and q the processor moves to a new state q 0 , and may write a value v 0 to some cell C 0 . There is no limit on the number of states of a processor nor on the resources needed to compute v 0 and C 0 . If several processors attempt to write into the same memory cell at the same step, the lowest numbered processor succeeds. This model is called the priority CRCW PRAM. Beame and Hastad have shown that the time to compute any of PARITY, MAJORITY, and SUM on the ideal CRCW PRAM is (log n= loglog n) [0] .
In our lower bound model of the two-dimensional sub-bus there are p processors, numbered 0; 1; : : :; p 0 1, connected in a p p 2 p p mesh as in the two-dimensional upper bound model. Processor 0 is the front-end. The computation proceeds in rounds costing one unit, and in each round, the processors rst communicate and then perform arbitrary internal computation. The communication is done just as in the upper bound model. Again, there is no bound on the size of values broadcast or computed. The main result of this section is the following. In the simulating PRAM, there are p processors, numbered 0; 1; : : :; p 0 1, corresponding to the p processors of the sub-bus model. There are also auxiliary processors, whose computation will be described later.
There are two special memory cells called Condition and Statement, used by processor 0 to communicate the instruction at each round. Also, corresponding to each processor i; 0 i p 0 1, there are the following special memory cells (we do not specify their exact addresses, but assume they are computable by processor i). Active(i) is used to denote at each round whether i is active. It is initialized to false at the beginning of each round. Send(i) is used to store the value broadcast by i, at each round, if i is active. Receive(i) is used to store the value, if any, received by i in each round. At the start of each round, processor i sets Active(i) to false and sets Receive(i) to some special value which is not in the range of possible values that can be broadcast by A. We now describe the simulation of a single round of A. First, processor 0 writes the strings <condition> and <statement> in cells Condition and Statement, respectively. Each processor i; 0 i p01 reads these cells and decides if it is active at this round. If so, i writes the value to be broadcast in Send(i) and sets Active(i) to true.
We next describe how each processor i determines the value it receives (if any) during the broadcast instruction. If i receives a value, it is from one of p p processors on either the vertical or horizontal bus along which i is connected. Let these processors be numbered i 1 ; : : :; i p p , where the ordering is such that if i 1 is active, then i 1 is the processor from which i receives a message; if i 1 is not active but i 2 is then i 2 is the processor from which i receives a message, and so on, so that if i k is active and none of i 1 ; : : :; i k01 are active, then i k is the processor from which i receives a message. For example, if the direction of communication is up, then the sequence is (i + p p) mod p; (i + 2 p p) mod p; : : :; (i + p p p p) mod p.
Each processor i has p p auxiliary processors to help it compute the value it receives, if any. Let the auxiliary processors of i be numbered n i + 1; : : :; n i + p p, where n i = p + i p p 0 1. Each processor n i + k computes i k ; this can be done by reading <statement>, to determine the direction of communication. If processor i k is active, that is, Active(i k ) is true, then processor n i + k reads the value Send(i k ) and writes it in Receive(i). Because of the ordering of the auxiliary processors, and the priority write conict resolution assumption, the value written in Receive(i) is the value received by processor i at that round of A, in the sub-bus computation.
Once the cells Receive(i) have been computed, each processor i; 0 i p 0 1 completes the round by simulating the internal computation of the ith sub-bus processor. This completes the description of the simulation. It is clear that all the steps described can be done by the processors of the ideal CRCW PRAM, since they have unbounded resources with which to compute at each step, and an unbounded number of states which can be used to store the internal congurations of the processors of the sub-bus mesh computer.
As a direct consequence of theorem 5.1 we have: To explain the algorithm in more detail, at the beginning of step 1, there are possible minima in every processor along every b-th row. Then step 1 does the constant time MINIMUM on each b 2 b block in parallel. Within each block, it distributes the b initial values into the variables x and y for each processor. Thus, the processor in position (i; j) in each block has the initial value of processor (i; 0) in its x register and the initial value of processor (j; 0) in its y register. A comparison of x and y is recorded in the boolean value t. After broadcasting t up the only processor with the value false is the processor in position (i; 0) which has the minimum for this block. So, that processor will broadcast its initial value x to processor (0; 0) within the block.
In step 2, we have the situation where the processors (ib; jb) have possible minima, and we want to move them all to rows, so that the processors (i; jb 2 ) all have potential minima. This is accomplished in two sub-steps. First, broadcast the potential minima to the right. Second, selectively broadcast the minima up. That is, each potential minimum at processor (ib + k; jb 2 + kb) for 0 k < b is broadcast up. In case p is not of the form 2 2 k for some k then the algorithm must be modied slightly. In the modied algorithm we will always maintain an active rectangular submesh which is p p 2 q where q p p and b divides q evenly. In attempting step 2 it may happen that b blocks. This is sucient to solve the problem in time O(log log p). 7 . Conclusions. We have proved tight bounds (to within constant factors) on the time needed to compute several functions on the sub-bus mesh computer. For some of these problems, such as PARITY, MINIMUM and SUM, the running times on a sub-bus mesh computer match (to within constant factors) the running times on a general PRAM. Moreover, machines based on the sub-bus mesh architecture are commercially available [0] . For these reasons, we believe that the sub-bus mesh architecture deserves further study.
Our algorithms for PARITY and SUM are probably not practical for any reasonable size p for two reasons. First the speed-up by a factor of O(log log p) has too large a constant factor to be signicant. Second, it is doubtful that hardware designers would want to implement the new NC functions required by the algorithm. It is possible to remove the second factor inhibiting practicality by adding preprocessing phases to the algorithms. A preprocessing phase uses only standard arithmetic/boolean operations to compute values which depend only on the processor PID's and the structure of the algorithm and which in the original algorithm would be computed as results of NC functions. Using preprocessing many more values would be computed than are actually used in the algorithm since during the preprocessing it is not known exactly which values will be needed later on. The necessary preprocessing for the two algorithms PARITY and SUM is complicated, but can be accomplished within the O( log p log log p ) time bound.
We have implemented the O(log logp) MINIMUM algorithm on a 1,024 processor MasPar MP-1. The constant factor in front of the loglog p forces the algorithm to run more slowly than the standard O(log p) algorithm. However, we believe that the ideas in the O(log log p) MINIMUM algorithm have the potential to be used in a competitive practical algorithm for nding the minimum on commercially available meshes with more than 1,024 processors.
Several open questions are suggested by our results. Are there simpler O( log p log log p ) algorithms for PARITY or SUM on the two-dimensional sub-bus mesh, that may be competitive on real machines? Is it possible to improve the lower bound for MA-JORITY on a one-dimensional mesh from log 3 p to log 2 p? Can our lower bound for PARITY on the two-dimensional sub-bus mesh can be simplied, or improved by a constant factor, using the mesh model directly rather than translating results from the PRAM model?
