This paper brings together a number of previously known techniques in order to obtain practical and e cient implementations of the pre x operation for the complete binary tree, hypercube and shu e exchange families of networks. For each of these networks, we also provide a \pipelined" scheme for performing k pre x operations in O(k + log p) time on p processors. This implies a similar pipelining result for the \data distribution" operation of Ullman 16]. The data distribution primitive leads to a simpli ed implementation of the optimal merging algorithm of Varman and Doshi, which runs on a pipelined model of the hypercube 17]. Finally, a pipelined version of the multi-way merge sort of Nassimi and Sahni 10], running on the pipelined hypercube model, is described. Given p processors and n < p log p values to be sorted, the running time of the pipelined algorithm is O(log 2 p= log((p log p)=n)). Note that for the interesting case n = p this yields a running time of O( log 2 p log logp ), which is asymptotically faster than Batcher's bitonic sort 3].
The Pre x Operation
We begin by reviewing the basic de nitions necessary to understand the pre x and segmented pre x operations. These operations were rst introduced by Schwartz, where they are referred to as \summing" and \summing by groups " 14] .
Let denote a binary associative operation mapping X X to X, for some domain X. Given n values x 0 ; : : :; x n?1 belonging to X, the Pre x operation computes each of the partial sums y i = x 0 x i , 0 i < n. For example, assume that is addition, n = 5, x 0 = 5, x 1 = 2, x 2 = 6, x 3 = 4 and x 4 = 9. Then the output of Pre x is y 0 = 5, y 1 = 7, y 2 = 13, y 3 = 17 and y 4 = 26. Given an additional n boolean values a 0 ; : : :; a n?1 , we can partition the n given x i values into contiguous intervals in the following manner: an interval begins at each i such that a i = true and extends up to, but not including, the next highest integer j such that a j = true. The rst interval begins at processor 0 regardless of the value of a 0 , and the last interval ends at processor n ? 1. The segmented Pre x operation executes a pre x operation over each interval. Extending the example of the preceding paragraph, assume that a 2 and a 4 are true while a 0 , a 1 and a 3 are false. Then the x i values are partitioned into the intervals fx 0 ; x 1 g, fx 2 ; x 3 g and fx 4 g and the output of the segmented Pre x operation is y 0 = 5, y 1 = 7, y 2 = 6, y 3 = 10 and y 4 = 9. When we give implementations of the Pre x operation in Section 2, it will be convenient to assume that there is an identity element for in X, which we denote 0 . This assumption can be made without loss of generality because if no such element exists, we can simply augment the set X with an identity element 0 by de ning 0 x = x and x 0 = x for all x 2 X. De nition 1.1 For all pairs of boolean values a 0 ; a 1 and all x 0 ; x 1 2 X, let 0 denote the binary operation (a 0 ; x 0 ) 0 (a 1 ; x 1 ) = (a 0 or a 1 ; if a 1 then x 1 else x 0 x 1 ):
The operation 0 will be referred to as the segmented operation. where j is the highest index less than or equal to k such that a j = true, or 0 if there is no such index.
Remark 1 is an immediate consequence of De nition 1.1. For Remark 2, let x 0 ; x 1 be distinct elements of X and note that (true; x 0 ) 0 (true; x 1 ) = x 1 while (true; x 1 ) 0 (true; x 0 ) = x 0 . Remark 3 follows from the observation that for all boolean values a 0 ; a 1 ; a 2 and x 0 ; x 1 ; x 2 
Network Implementations
In this section, we develop e cient implementations of the Pre x operation for the complete binary tree, hypercube and shu e exchange families of networks. We will be concerned with p-processor network implementations of the Pre x operation where processor i initially contains the value x i , 0 i < p, and n = p. The computation is considered to be complete when the partial sum y i = x 0 x i has been computed at processor i, 0 i < p. The model of computation that we adopt for our networks may be de ned as follows. Each processor has an in nite local memory con gured in O(log p)-bit words and can perform the usual set of CPU operations in constant time on word-sized operands. Processors communicate with one another by sending packets over the links provided by the network. A packet consists of a single word of data. The complexity of our algorithms will be stated in terms of time steps. Unless otherwise stated, running times should be assumed to be accurate to within an additive constant. In a single time step, each processor is allowed to send and/or receive at most one packet (1-port communication), and execute a constant number of CPU operations on local data. We will assume that the x i 's, as well as all partial sums of the x i 's, are word-sized quantities. All interprocessor communication in our programs is speci ed using the pair of routines Send and Receive. Send takes two arguments: the rst speci es the word of data to be transmitted, and the second speci es the id of the destination processor. Receive is a function with one argument, which speci es the id of the source processor. Once a packet arrives from the source, the word of data contained in that packet is returned as the value of the function. In order to comprise a valid source/destination pair, two processors must be adjacent in the network.
Binary Tree
The rst implementation of Pre x that we consider is the standard two-pass algorithm for the inorder complete binary tree. Assume that we are given a tree of size p = 2 d ? 1, with processors numbered inorder from 0 to 2 d ? 2. An example of such a network is given in Figure 1 (6) if not Leaf then Send(y L , LeftChild); (7) if not Leaf then Send(y R , RightChild); (8) return(y R );
end Pre x As mentioned above, the program makes two passes over the tree. The rst pass is upward, from the leaves to the root, and the second pass is downward. For every processor p, let T(p) denote the subtree rooted at processor p. Note that the ids of the processors in T(p) form a contiguous block of integers. During the upward pass, each processor receives the sum of its left and right subtrees (x L and x R ), computes the sum of T(p), and passes the result to its parent. During the downward pass, each processor receives from its parent the sum over all processors with ids less than those in T(p) (y L ), computes the sum over all processors with ids less than those in its right subtree (y R ), and sends the appropriate values to its left and right children (y L and y R ). The correctness of the program is easily established by induction on the depth of the tree, and it runs in 4 log p (all logarithms in this paper are base 2) time steps. Note that in any given time step, only two of the levels of the tree are active, implying that the algorithm can be pipelined level by level. By initiating a new pre x computation every second time step, it is possible to perform k Pre x operations on the inorder complete binary tree in 2k + 4 log p time steps.
Hypercube
For the hypercube, the following FFT-like computation executes Pre x in log p time steps:
begin Pre x( , x) (1) y ? x; (2) for i ? 0 to d ? 1 do (3) Send(y, i); (4) if MyId i = 0 then (5) y ? y Receive(i); (6) else (7) temp ? Receive(i); (8) x ? temp x; (9) y ? temp y; (10) end if (11) end for (12) return(x);
end Pre x The variable MyId holds the id of the processor, and MyId i denotes the ith bit of the id (the least signi cant bit is bit 0). The source and destination arguments of Send and Receive specify the bit position in which the two communicating processors di er. The program runs in log p time steps, and functions in the following manner. In addition to the partial sums demanded by the Pre x operation, the total sum is computed at every processor. The local variables x and y accumulate the partial and total sums, respectively. For a hypercube consisting of a single processor, the computation is trivial. Given p = 2 d , d 1, processors with associated x i values, the program rst recursively computes partial and total sums for the upper and lower halves of the values independently, and then exchanges the total sums between halves. This enables the revised partial sums for the upper half to be computed, as well as the new total sums. Unfortunately, the above program does not lead to a pipelined implementation of the Pre x operation because it uses all of the processors at every time step. To achieve pipelined speedup we can make use of the dilation 2 inorder complete binary tree embedding 5]. Figure 2 gives this embedding for the case p = 16, where the \extra" processor (with id p ? 1) has been added as an extra level above the root. The edges depicted in Figure 2 are physical hypercube edges. The left child of a non-leaf processor is connected directly to its parent, while the right child is connected to its parent via the left child. It is easy to verify that the pipelined algorithm given for the inorder complete binary tree in Section 2.1 can be modi ed to run in the same time bound on the dilation 2 inorder complete binary tree embedding. In particular, note that processor p ? 1 is in an appropriate location to receive the sum over all of the other processors. To summarize, we have shown that k Pre x operations can be performed in 2k + 4 log p time steps on the hypercube.
Shu e Exchange
The hypercube code given in the preceding section for performing a single Pre x operation can be easily adapted to the shu e exchange:
begin Pre x( , x) (1) y ? x; (2) repeat d times (3) Send(y, Exchange); (4) if MyId 0 = 0 then (5) y ? y Receive(Exchange ); (6) else (7) temp ? Receive(Exchange ); (8) x ? temp x; y ? temp y; (10) end if (11) Send(x, Unshu e); (12) x ? Receive(Shu e); (13) Send(y, Unshu e); (14) y ? Receive(Shu e); (15) end repeat (16) return(x);
end Pre x
The above program runs in 3 log p time steps. As we saw for the hypercube, however, a di erent approach is needed in order to obtain a pipelined implementation of the Pre x operation. Unfortunately, it is not possible to embed the inorder complete binary tree in the shu e exchange with constant dilation. Instead, we make use of the dilation 2 complete binary tree embeddings depicted, for the case p = 16, in Figures 3 and 4 . The leaves of the tree in Figure 3 are the high-numbered processors (those with ids in the range p=2 to p ? 1), numbered inorder. In this embedding, the id of the left child of an internal processor is the shu e of the id of its parent, and siblings communicate via the exchange connection. The embedding of Figure 4 is de ned in a similar fashion, and has the low-numbered processors (0 to p=2 ? 1) at its leaves.
We can make use of these embeddings to obtain a pipelined implementation of k Pre x operations as follows. First, use the embedding of Figure 3 to compute the k sets of partial sums over the high-numbered processors. This takes 2k + 4 log p time steps. Similarly, the embedding of Figure 4 can be used to perform k pre x sums over the low-numbered processors in 2k + 4 log p time steps. At this point, all that remains to be done is to broadcast, in a pipelined fashion, the k total sums over the low-numbered processors to the p=2 high-numbered processors, and to add these values to the partial sums computed earlier. This last phase can be performed in 2k + 2 log p time steps using the embedding of Figure 4 (note that the desired sums are already available at the root), so k Pre x operations can be executed in 6k + 10 log p time steps on the perfect shu e. 
A Useful Variation
In Section 4 we will make use of a variant of the Pre x operation, Pre x 0 , de ned as follows. Rather than computing x 0 x i at processor i, 0 i < p, Pre x 0 outputs 0 at processor 0 and x 0 x i?1 at processor i, 1 i < p. This is sometimes more convenient, particularly when the operator is not invertible. Note that all of our implementations of Pre x may be trivially modi ed to implement Pre x 0 with precisely the same time bounds. For example, in the complete binary tree program of Section 2.1, it su ces to change the return value from y R to y L x L .
Data Distribution
Consider the binary associative operator de ned over X by x y = x, for all x; y 2 X. This is sometimes referred to as the Copy operator. Observe that the e ect of applying Pre x with the Copy operator is to perform a broadcast of a single value from processor 0 to all other processors. Of course, there are simpler techniques for broadcasting a single value over the processors of any of the networks we have considered. However, combining this observation with the results of the previous section immediately implies that k segmented broadcasts can be executed in 2k + 4 log p time steps on the tree or hypercube, and in 6k + 10 log p time steps on the perfect shu e. In order to fully illustrate the techniques discussed in Section 1, we now study the implementation of segmented Pre x with the Copy operation in greater detail. As stated in Section 1, processor i initially holds the boolean value a i and x i 2 X, 0 i < p. Note that under the Copy operation the only relevant x i 's are those for which the corresponding a i is true. Clearly, there is no identity element for the Copy operation in X. To remedy this situation, we extend the domain of Copy from X to B X and de ne every pair with rst component false, say, to be an identity element. In practice, this corresponds to prepending a single bit Note that the above formulation allows bit pipelining in the sense described by Blelloch 6] . In other words, as each bit of the two operands is received, the next output bit can be computed. This holds not only for the Copy operator, but also for any other single-pass operator, as de ned in 6]. Finally, we observe that the data distribution operation de ned by Ullman 16 ] is equivalent to a segmented Pre x operation with the Copy operator. Thus, the techniques outlined in this paper immediately lead to e cient pipelined implementations of this primitive for the complete inorder binary tree, hypercube and shu e exchange network families.
Sorting on a Pipelined Hypercube
In this section, we describe a simpli ed implementation of the optimal merging algorithm of Varman and Doshi 17] , and show how this can be used to develop a pipelined version of the sorting algorithm of Nassimi and Sahni 10] for a pipelined model of the hypercube.
The Sort operation is de ned as follows. Given n O(log p)-bit values, with bn=pc or dn=pe at any processor, rearrange the n values so that every value in processor i is less than or equal to every value in processor j whenever 0 i < j < p. In addition, we require that there be bn=pc or dn=pe values at any processor, and that the set of values within any particular processor be sorted. There has been a great deal of previous research related to the problem of sorting on the hypercube. The time bounds for the merging and sorting algorithms described in this section do not apply to the 1-port model of computation that we have been considering up to this point. Instead, we will make use of a restricted form of the less realistic d-port model, in which a processor can send and/or receive a packet from each of its log p neighbors in a single time step. This model, which we refer to as the pipelined hypercube model, was originally de ned by Varman and Doshi 17], and we refer the reader to their paper for both the strict de nition as well as the hardware implementation details. We only need the pipelined model of the hypercube for performing pipelined inverse concentration routes. It is interesting to note that we do not require pipelined concentration routes, nor do we require the pipelined inverse concentration with copy operation of Varman and Doshi. Concentration and inverse concentration routes were de ned by Nassimi and Sahni 10], and it is easy to show that k such operations can be performed in k + log p time steps on the pipelined hypercube model. Furthermore, there is no hope of achieving this asymptotic time bound on the 1-port model since there is a lower bound of (k log 1=2 p) time steps in this case. To prove this lower bound, consider a set of k monotone routes for which the source processors are exactly those with strictly more 0's than 1's in their ids and the destination processors are those with more 1's than 0's. In such a case, (kp) packets must pass through the O(p log ?1=2 p) processors with an equal number of 0's and 1's (or one more 0 than 1, say, if log p is odd), which implies a lower bound of (k log 1=2 p) time steps for performing k monotone routes. Since a monotone route is equivalent to a concentration route followed by an inverse concentration, and these operations have equal complexity, this lower bound applies to the pipelined concentration and inverse concentration operations as well.
We now describe a pipelined algorithm for merging two sorted lists X and Y , each of length pk, on p processors. The algorithm is similar to that proposed by Varman the set X i , where x i is the largest element of X 0 that is less than z j . Furthermore, it is easy to check that the set Z j is contained in the union of Z 0 2j , Z 0 2j+1 , the set X i corresponding to the largest x i that is less than z 2j , and the set Y i corresponding to the smallest y i that is greater than z 2j+1 . These observations lead to the following pipelined merging algorithm. by simulating a bitonic merge over 2p processors. Record the data movements to facilitate the \unmerge" of step 3. This takes 2 log p time steps. 3. Route the rank of each value in Z 0 back to the processor which originally held that value. This can be done in 2 log p time steps by following the paths recorded in step 2 in the reverse direction. 4. Route each set X i to the processor that held x i after step 2, 0 i < p. The id of that processor can be computed from the rank received by processor i in step 3. The routing can be performed in 2k + 2 log p time steps using a pipelined inverse concentration. Route the Y i 's in a similar fashion, for a total cost of 4k + 4 log p time steps. 5. Assuming the set X i was routed to processor j i in the previous step, broadcast X i to all processors with ids in the range j i + 1 to j i+1 , 0 i < p. This can be done in 2k + 4 log p time steps with a single application of the Pre x 0 operation, as described in Section 2. 6. Assuming the set Y i was routed to processor j i in the previous step, broadcast Y i to all processors with ids in the range j i?1 to j i ? 1, 0 i < p. This can be done with a single application of a \backwards" version of Pre x 0 , and takes 2k +4 log p time steps. 7. At this point, processor j contains a copy of Z 0 2j , Z 0 2j+1 , the largest X i with x i < z 2j and the smallest Y i with y i > z 2j+1 , 0 j < p. As observed above, the union of these sets contains the desired set Z j , and the values to be discarded (i.e., those not belonging to Z j ) can be determined by computing the exact rank of either z 2j or z 2j+1 . These sets can be merged, and the rank computation performed, with O(k) local operations. Our de nition of a time step allows these local operations to be interleaved with the computations of steps 5 and 6 at no extra cost. Note that only step 4 uses the power of the pipelined model. The total running time of Merge is 8k + 17 log p time steps. Now consider the case in which 2p processors are available to perform the merge, where we assume that X i is initially stored at processor i, Y i is initially stored at processor 2p ? i ? 1, and Z j is to be output at processor j, 0 i < p, 0 j < 2p.
In this case, step 1 is unnecessary, and the cost of each of the steps 2, 3 and 4 is halved, while the cost of the remaining steps is unchanged. Thus, the total cost of Merge with 2p processors is 6k + 12 log p time steps. Note that for k = (log p), this running time is within a constant factor of optimal. Furthermore, as observed by Varman and Doshi, this optimal merging routine immediately implies an optimal algorithm for sorting when the number of values to be sorted, n, exceeds the number of processors, p, by a factor k that is (log p). The idea is to sort the set of k values at each processor locally, and then to merge sorted subcubes repeatedly until the entire hypercube has been sorted. At each level, even subcubes are sorted in ascending order and odd subcubes are sorted in descending order. The running time of this algorithm, which we refer to as MergeSort, is X 0 i<log p (6k + 12i) = 6k log p + O(log 2 p):
As mentioned above, this running time is optimal for k = (log p). We now describe a pipelined version of the multi-way merging procedure of Nassimi and Sahni 10] that runs on the pipelined hypercube. The input consists of 2 l sorted lists of length k2 m , and the output is a single sorted list of length k2 l+m . The merging is performed in O(k + log p) time steps on a hypercube with p = 2 2l+m processors. Let the ith input list be denoted X i , 0 i < 2 l , and let the set of k elements of X i with ranks between jk and (j +1)k?1 (inclusive) be denoted X i j , 0 j < 2 m . The set X i j is initially stored at processor i2 m + j. Let the output list be denoted X. At the end of the merging process, the elements of X with ranks between jk and (j + 1)k ? 1 (inclusive) should be stored at processor j, 0 j < 2 l+m . It is useful to view the processors of the given hypercube as forming a 2 l by 2 l+m array, where the processor in row i and column j has id i2 l+m + j (row-major order). Note that all of the X i j 's are stored in row 0. In fact, each processor in row 0 contains exactly one set X i j . Our algorithm makes use of pipelined broadcast and sum operations over entire subcubes. Formally, a pipelined broadcast operation takes k values stored at a single processor and broadcasts them over the entire subcube. For a pipelined sum operation, processor i initially holds k values a ij , 0 i < p, 0 j < k. The output is the k sums P 0 i<p a ij , 0 j < k, all of which are output at a single designated processor. Although such operations can be performed using Pre x, other implementations exist which are more e cient by a constant factor. For example, using the multiple spanning binomial tree (MSBT) embedding of Ho and Johnsson 8] it is possible to perform k broadcasts in k + log p time steps. Similarly, k sums can be performed in k + log p time steps. Note that although these operations are pipelined, they run on the 1-port model and thus do not require the additional power of the pipelined model. Algorithm MultiWayMerge 1. Broadcast X i j to all of the processors in column i2 m + j, 0 i < 2 l , 0 j < 2 m . Each of the columns is an independent subcube of dimension l. Thus, the broadcasts can be performed in k + l time steps using an MSBT embedding within each column. 2. Replicate list X i across the ith row, 0 i < 2 l . In other words, route a copy of X i j to each column of the ith row that is congruent to j mod 2 m . This amounts to performing pipelined broadcasts over subcubes of dimension l, which can be done in k + l time steps using the MSBT embedding. 3. Merge the lists X i and X j using the jth block of 2 m processors of row i (i.e., columns j2 m to (j + 1)2 m ? 1), 0 i; j < 2 l , i 6 = j. This takes 8k + 17m time steps.
4. In the jth block of 2 m processors of row i, \unmerge" the rank of each element of X i in X j (this is the rank of that value in X i X j minus its rank in X i ), 0 i; j < 2 l , i 6 = j.
In other words, route the rank of each value back to the processor that contained the value before step 3. This is a pipelined inverse concentration, and can be performed in k + m time steps. Where i = j, simply label each value with its rank in X i . 5. Compute the rank of every value in X. The processors of row i are used to perform this computation for the elements of the set X i , 0 i < 2 l . For each set X i j , we perform a pipelined sum over a subcube of dimension l, adding the ranks computed in step 4 and routing the results to the rst block of 2 m processors in each row. This takes k + l time steps using the MSBT embedding. 6. In row i, route the elements of X i to the correct output column (given by the oor of the rank computed in step 5 divided by k), 0 i < 2 l . This is a pipelined inverse concentration in a subcube of dimension l + m, and takes k + l + m time steps. 7. Each column of the array now contains k values. Route these values to the top of the column (row 0). In terms of data paths, this is essentially an inverse pipelined broadcast operation over a subcube of dimension l, and it can be performed in k + l time steps using the MSBT embedding. Only steps 3, 4 and 6 require the power of the pipelined model. Summing all of the costs stated above, the total running time of MultiWayMerge is readily seen to be 14k + 5l + 19m time steps. By repeatedly applying MultiWayMerge on successively larger subcubes, we can obtain a fast sorting algorithm for the case n < p log p. The running time of this algorithm, which we refer to as MultiWayMergeSort, will be shown to be O(log 2 p= log((p log p)=n))), as opposed to O(log 2 p= log(p=n)) for the sorting algorithm of Nassimi and Sahni. For the interesting case n = p, the running time of MultiWayMergeSort is O(log 2 p= log log p), a slight asymptotic improvement over that of Batcher's bitonic sort. It must be emphasized, however, that MultiWayMergeSort only runs on the pipelined model of the hypercube. We now give a more formal description of the MultiWayMergeSort algorithm, and analyze its time complexity. The algorithm is designed to sort n = k2 m values on a hypercube with p = 2 l+m processors. It is useful to view the processors as being arranged in a 2 l by 2 m array, where the processor in row i and column j has id i2 m + j (row-major order).
Algorithm MultiWayMergeSort cost of the ith iteration is 14k + 5l + 19il time steps, for a total cost of approximately (14k + 4l + 12m)m=l time steps. 4. The values have been sorted, but they are not con gured appropriately (i.e., all of the values are in row 0). All of the values can be routed to the correct output locations using k pipelined inverse concentration routes, which takes k + log p time steps. Steps 3 and 4 make use of the power of the pipelined model. The total running time of MultiWayMergeSort is minimized (to within a constant factor) by setting k = log p, and for this choice of k the running time is dominated by the cost of step 3. Observing that l = log(pk=n) and m = log p ? l log p, we nd that for k = log p the algorithm runs in 47 2 log 2 p= log((p log p)=n) + O(log p) time steps. For the case n = p, we can set k = log p= log log p and reduce the dominant term in the running time to 19 2 log 2 p= log log p, at the expense of increasing the error term to O((log p= log log p) 2 ).
Concluding Remarks
In this paper, we have presented simple and e cient pipelined implementations for the Pre x operation on the complete inorder binary tree, hypercube and shu e exchange families of networks. This led immediately to an elegant pipelined implementation of Ullman's data distribution primitive. A variant of the Pre x was used to obtain a simpli ed implementation of Varman and Doshi's optimal merging algorithm for the pipelined model of the hypercube. In order to better assess the practical speed of the various algorithms presented in this paper, we have computed the coe cient on the leading term of the running time in each case. It is quite possible that one or more of the moderately large coe cients in Section 4 could be improved with only minor modi cations to the code. It should be mentioned that for permutation routing, an important special case of the sorting problem, there is a much simpler O(log 2 p= log log p) time algorithm for the case n = p than MultiWayMergeSort 11] . The idea is to route packets in a greedy fashion over sets of log log p dimensions at a time. Each set of routings produces a load balancing problem in which there may be as many as log p packets at any one processor, and the objective is to redistribute the packets so that there is exactly one at each processor. It is a worthwhile exercise to show how this redistribution can be performed in O(log p) time on the pipelined hypercube by making use of the pipelined pre x, broadcast and concentration operations discussed in this paper.
