Proper distribution of operations among parallel processors in a large scienti c computation executed on a distributed-memory machine can signi cantly reduce the total computation time. In this paper we propose an operation, called simultaneous parallel reduction (SPR), that is amenable to such optimization. SPR performs reduction operations in parallel, each operation reducing a one-dimensional consecutive section of a distributed array. Each element of the distributed array is used as an operand to many reductions executed concurrently over the overlapping array's sections. SPR is distinct from a more commonly considered parallel reduction which concurrently evaluates a single reduction. In this paper we consider SPR on Single Instruction Multiple Data (SIMD) machines with di erent interconnection networks. We focus on SPR over sections whose size is not a power of 2 with the result shifted relative to the arguments. Several algorithms achieving some of the lower bounds on SPR complexity are presented under various assumptions about the properties of the binary operator of the reduction and of the communication cost of the target architectures.
Introduction
Optimal placement of data and operations in large scienti c computations executed on a distributed-memory machine can signi cantly reduce the total computation time. In this paper we consider the impact of operation placement on the e ciency of computations that arise in theoretical population biology 1]. The computation that motivated the study of simultaneous reduction algorithms models an isolated ecosystem by simulating the dynamics of interspecies interactions. The model represents the ecosystem as a cellular automata whose cells correspond to the habitat sites that are small enough to accommodate at most one organism. Each cell can be in one of q states in a state set S = fs 1 ; s 2 : : : ; s q g. The probability of a site's state transition from state s i to state s j , P s i ! s j ] is a function, f si;sj , of the states of its neighboring sites. For simplicity, the relevant neighborhood is de ned as a rectangle with the a ected site located inside it (but not necessarily at the center). This rectangle's size and the position of the a ected site within the rectangle is de ned by the ability of the species to propagate. Hence, these parameters depend on the physical characteristics of the species involved, the terrain of the habitat, and the existing air or water convection.
Let n 1 ; n 2 stand for the lengths of the sides of the rectangle and p 1 ; p 2 The expression on the right is computed q times for each site in the habitat. Hence, its evaluation dominates the total execution time 2]. The above computation is an example of a reduction computed simultaneously over many overlapping contiguous sections of an array. Other applications of such operations arise in computational biology (e.g., cluster recognition, fractal dimension computation 2]) and in physics (e.g., solvers for partial di erential equations characterizing uid ow).
Parallel Reduction is de ned as the application of an associative binary operator across a list of elements 3]. Given an array, A, of size n and the binary operator, , parallel reduction obtains the scalar result, n i=1 A i], as follows: 
On a ne-grain distributed-memory architecture, the array A will often be distributed over N processors (i.e., A i] will be stored on processor p i], for 1 i N).
In such a case, also the result res i] is stored on processor p i]. Equation (1) de nes N parallel reductions that are evaluated simultaneously. In this paper, we assume that the array sections are of equal size, n, and that the results of reduction operations are placed at an arbitrary (but the same for all reductions) o set r within each section. Reduction operations are frequently used in parallel algorithms 4] and binary operators in such reductions are required to be associative. Some examples are addition, multiplication, minimum, maximum, set union, and set intersection. Algorithms for many variations of parallel reduction have been studied extensively in the literature; however, the authors of this paper are not aware of any previous work on simultaneous parallel reduction that allows overlapping of the regions being reduced. A balanced binary tree implementation of parallel reduction on mesh-connected architectures was presented in 5]. Another standard parallel reduction algorithm was introduced in 6] for tree topologies of arbitrary but bounded fan-in and for arbitrary tree depth. The segmented pre x problem is a variant of parallel reduction that subdivides a single dimension of processors into non-overlapping contiguous regions of varying size. A multiple pre x algorithm that reduces non-contiguous, non-overlapping regions simultaneously was presented in 4] .
In this paper, we consider algorithms for SPR executed on an SIMD parallel computer. In the following, we assume that the binary operator used in reduction, , is associative. The presented algorithms were designed in such a way that the order of the binary operator is preserved, hence the operator may not be commutative. One algorithm presented in section 3.2 can be sped up if the operator has an inverse, denoted by . This condition is satis ed by such common operators as addition and multiplication (but not by minimum or maximum). We present here the simultaneous parallel reduction for the one-dimensional array sections (i.e., onedimensional, continuous subarrays). The more general m-dimensional sections can be reduced by applying the one-dimensional section algorithm along each dimension separately. The presented algorithms are parameterized by the size of the array sections being reduced, n, and the position (o set) of the nal result, r; 0 r < n, measured from the leftmost element of the reduced array section.
Complexity Metrics Used
SIMD parallel architectures vary widely in methods and expenses of interprocessor communication. Performance of two distinct algorithms with the same traditional complexity measures (which reveal their asymptotic behavior) may perform quite di erently on a particular architecture. To help determine the best algorithm for a given architecture, we provide the following complexity metrics, useful in expressing the traditional time complexity for di erent architectures for the presented algorithms. Hop count: c ch (i), the distance (i.e., the number of processors traversed on the route to the destination) that each message travels in the i-th parallel communication step. We will also use the total number of hops, c ch = P ccm i=1 c ch (i).
Message size: c ms , the number of words (i.e., single precision or integer numbers) sent in each message.
Depending on the considered architecture, the time complexity of an algorithm can be expressed as: Since SPR performs many parallel reductions over the array sections of size n, the lower bounds for the cost metrics are the same as for parallel reduction. For the ne-grain solution, i.e., when the number of processors is equal to N, the number of elements in array A, and each processor stores one element of A, these bounds are: dlog 2 ne for the operation and message counts, n ? 1 for the hop count, and 1 for the message size.
Algorithms for Simultaneous Parallel Reduction
In the following, we describe algorithms for which none of the counts exceeds the double of the corresponding lower bound. The presented algorithms are parameterized by the size of the array sections being reduced, n, and the position (o set) of the nal result, r; 0 r < n, measured from the leftmost element of the reduced array section.
Recursive Combination Algorithm
In this section, we assume that the linear array of N processors initially stores a distributed array A which we want to overwrite with the resultant array res. the most (the least, respectively) signi cant non-zero bit of the argument q. z(q) denotes the number of nonzero bits in q. shift(val; dist) shifts the argument val across dist processors to the left (when dist < 0) or to the right (when dist > 0) from its original position (so val from processor p i] moves to processor p i + dist]).
Shift is done for all processors at the same time. For simplicity, we assume that the processor array is wrapped, i.e., processor p N] is followed by processor p 1] b .
The standard parallel reduction algorithm is shown in Figure 1 . It is assumed that the same program is executed on each processor and that array A is distributed over the processors in such a way that each processor stores a single element. An inspection of this algorithm reveals the following observations.
Simultaneous execution of this algorithm for overlapping sections creates subcomputations shared by many sections. Each subcomputation produces a reduction of two subsections.
Inside the for-loop the size of the left subsection being reduced is a power of two. The right subsection can be of any size from one to the size of the left subsection (this size is de ned by the position of the executing processor in the reduced section).
The nal reduction is performed only by the rst processor. The left subsection reduced in this step is of size n ? 2 dlog 2 (n?1)e whereas the right subsection is of size 2 dlog 2 (n?1)e . In each SPR step, every processor must hold values that represent reductions of left subsections for some subcomputation and right subsections for the others. Consequently, the algorithm in Figure 1 must execute two reductions in each step c , thus, doubling the computation cost. If communicated data move only to the left, the result is placed at o set 0. Let the binary representation of o set r be (r log(n) : : : ; ::r 1 ; r 0 ) 2 . By shifting communicated data to the right at each step k for which r k is equal to one, the result of SPR will be placed at o set r.
These observations led us to the design of the algorithm presented in Figure 2 .
To simplify the description, only the case of h(n ? In the MPL language of the MasPar computer, shif t can be readily implemented using xnetW and xnetE statements. The exception is the case when n is a power of 2, which produces the left and right subsections of equal length. 
Su x-Pre x Algorithm
For this algorithm, the entire range N of array A is divided into N=n disjoint sections, each of size n. The algorithm consists of two stages:
1. executing the parallel pre x algorithm on disjoint sections of the input array A to obtain the result for each disjoin section in one location, and partial results for overlapping sections in the others, and 2. applying operator to partial results obtained in the previous step.
The algorithm uses two auxiliary arrays, a pre x array, P 1::N], and a su x array, S 1::N], de ned as follows: for k-th disjoint section of array A extending from kn ? n + 1 to kn, the arrays P; S are: If the inverse operator exists, then array P can be obtained in one step:
P i] = S kn ? n + 1] S i + 1] for kn ? n < i < kn The value of S kn ?n+1] can be broadcast to section k in dlog 2 ne communication shifts; hence, the cost metrics of this version of the algorithm are: c o = dlog 2 ne + 2 and c cm = 2dlog 2 ne+1. If hops are costly, the value of S kn?n+1] can be broadcast in n?1 sequential shifts for a unit distance. The result can be sent to proper position in one shift of at most n=2 distance, so c ch n ? 1 + n ? 1 + n ? 1 + n=2 < 7n=2.
When the inverse operator does not exist (e.g., the operator is maximum or minimum), array P can be obtained by another application of the parallel pre x algorithm over n ?1 elements and in the opposite direction than the one producing S, as shown in Figure 3 . Therefore, the complexity of the algorithm in this case is c o = c cm < 2dlog 2 ne + 1, and c ch < 7n=2. However, during the execution of the parallel pre x algorithm, some processors are idle and some partial results are already pre xes for some elements. Hence, the PRAM implementation can take advantage of this observation to produce both P and S in just dlog 2 ne steps. e Right-end placement of the result can be achieved by switching P and S in their roles, i.e., by using the expression P i] S i ? n + 1]. Hence, the placement of the result at the given o set can always be achieved by a single shift for the distance of at most n=2. Shown in Figure 4 is the parallel pre x algorithm for the creation of array S that also generates all the pre xes needed for array P (pre xes needed for P are enclosed in the bold boxes). It can be shown that all pre xes of the form (1::2 i ) are byproducts of the su x generation on processor 1. In step i > 1, there are 2 i?1 processors idle, numbered from n ? 2 i?1 + 1 to n. They can be used to evaluate all pre xes from (1::2 i?1 + 1) to (1::2 i ? 1) as shown in Figure 4 . The same processing can be used in each disjoint section.
Scalable case. In this case, we assume that the number of processors P is not greater than N=n; hence, each disjoint section has at most one processor allocated to itself. For simplicity, we consider the case when k = N=(nP) is an integer. If a section is assigned to a single processor, then arrays P and S can be computed in 2n ?2 operations without involving any communication. Each processor just needs to communicate the single boundary section, so the complexity of the algorithm is: c o = 3k(n ? 1) 3N=P ; c cm = 1; c ch = 1; c ms = n. The work (without communication) done by all processors is nearly constant and is about 3N. Only when the inverse operation exists, can a single processor execute a smaller number of steps than the work of parallel processors f . When the communication cost is included, the work becomes dependent on the number of processors and is about 3N + P latency. Similar analysis for cases when there are 1 < p < n processors assigned to each section shows that the computational work stays the same and the communication work is proportional to p.
In Pursuit of the Asymptotic Minimum
An interesting open problem, that is equivalent to a more than hundred years f With the inverse operation, a single processor can evaluate all pre xes P 1::N] over the entire range of data N and then nd the result for i > n as res i] = P i] P i ?n], hence executing only 2N ?n operations. However, when the reductions perform arithmetic operations on oating-point numbers, such an algorithm will introduce large rounding errors for larger values of N which makes it impractical.
old problem 9], can be de ned as follows:
Given the interval of an arbitrary size n, nd the minimum operation count for simultaneous parallel reduction. In this section, we brie y discuss two algorithms, decomposition and factorization, that approach this minimum. A more detailed description of these algorithms is given in 8].
The decomposition algorithm is based on factoring out bit patterns in the binary representation of the interval length n. Let 
The decomposition algorithm combines in parallel all quotients and then all patterns using the recursive combination algorithm from Section 3.1. The nal step is to combine obtained products. Putting together all the needed operations yields:
d o (n) < log 2 n + log 2 n 2 q + 2 q + q(1 + log 2 3)
Let's consider, for example, the interval size n = 15, i.e., in binary notation, n = (1111) 2 , so h(n) = 3; z(n) = 4. The recursive combination algorithm presented in Section 3.1 will require h(n) + z(n) ? 1 = 6 parallel operation steps, three steps to build recursively 2 3 subsections and another three to combine partial results into the nal result. However, there is one pattern of length two (repeated two times)
in the binary representation of n, so n = (1) 2 (11) 2 (0101) 2 . Two steps are needed to create the rst pattern and another three for the second pattern (to combine recursively subtrees of size three into the result). Altogether, just ve steps are required by the decomposition algorithm. The di erent evaluation subtrees of the two algorithms discussed above are shown in Figure 5 .
We have shown in 8] that to minimize d o (n) for the given n it is necessary to select q log 2 log 2 n 2 , and the corresponding number of operations is about log 2 n + Table 2 .
A more e cient decomposition scheme is used in the following factorization algorithm. Let c(n) be the message and operation count of this algorithm and s(n) the largest subtree of the size of the power of two constructed during evaluation.
For each interval size n, this algorithm selects the most e cient method of obtaining the result from the following three g : n dlog 2 ne max k n c (k)   k  8  3  4  7  64  6  8  43  512  9  13  479  4096  12  16  2399  32768  15  20  21407   Table 1 : Operation and Message Counts for the Algorithm Based on Factoring.
of n = 15, the factorization algorithm will take advantage of the factorization n = 3 5 and use only ve steps (the evaluation subtrees are the same as for the decomposition algorithm, as shown in Figure 5 ). The factorization algorithm is faster than the decomposition for many n's. Consider, for example, n = 47. The factorization algorithm will recognize that n = 2 + 5 9, yielding an eight-step algorithm. The decomposition algorithm will extract the following patterns n = (1) 2 (11) 2 (101) 2 + (100000) 2 (01) 2 (0001) 2 ; hence, it will require ve steps to obtain one non-trivial quotient and another ve to built non-trivial patterns; altogether ten steps. The factorization algorithm performs better because the decomposition algorithm is restricted to patterns (factors) that have length equal to some power of two. For n = 47, even the recursive combination will do better than decomposition because it will use (100000) 2 + (1000) 2 + (100) 2 + (10) 2 + 1 as terms in the additive combination, yielding nine steps. The ability of the factorization algorithm to use 45 = 5 9 factorization of a subterm results in a more e cient solution.
For a given n, we have designed an algorithm of complexity O(n 2 ) that determines the value of c(n) and generates the program for evaluation of SPR. The maximum values of c(n) yielded by this algorithm for n 2 15 are shown in Table  1 . The authors are not aware of any tighter upper bound for the value of c(n). Table 2 summarizes the total complexity of the presented algorithms for various architectures. All logarithms are of base 2 and UCL stands for Uniform Communication Latency. For simplicity, whenever relevant, communication latency is assumed to be 1 and the hop cost is equal to the number of data sent in the message. For each of the presented algorithms there is an architecture on which this algorithm is optimal. Complexity of such optimal algorithm is printed in bold in Table 2 .
Conclusion
The factorization algorithm is e cient only when the parameters are known at compile time, because determining factorization is computationally expensive. With this exception, the presented algorithms could be e ciently implemented as library routines. Some applications use SPR with parameters known at compile time. In such cases, evaluation of some conditional statements at run-time can be avoided. This was the case for the ecological simulation discussed in Section 1. To avoid any conditional operations based on the interval size and the o set during run-time, we have designed a simple code generator, which inputs the given interval size and the o set, and produces object code that implements the proper algorithm Algorithm Architecture Recursive Pre x-Su x Decomposition Combination with without PRAM 2 log n log n logn log n + 2 p log n SIMD: UCL 3 log n 3 log n 4 log n 2logn + 1:78(log n) 0:72 Hypercube log 2 n log 2 n=2 log 2 n 3 log 2 n Mesh n 7n=2 7n=2 5n=3 Table 2 : Complexity of the Algorithms on Di erent Architectures.
for these parameters. Future research on improvements of the decomposition algorithm is needed to provide tighter bounds on the minimum operation count for SPR.
