This paper presents a general data-parallel formulation for a class of problems based on the divide and conquer strategy. A combination of three techniques-mapping vectors, index-digit permutations and space-filling curves-are used to reorganize the algorithmic dataflow, providing great flexibility to efficiently exploit data locality and to reduce and optimize communications. In addition, these techniques allow the easy translation of the reorganized dataflows into HPF (High Performance Fortran) constructs. Finally, experimental results on the Cray T3E validate our method.
INTRODUCTION
The design of compilers for parallel machines that generate (semi-)automatically parallel programs with acceptable performance is a research area of increasing interest. In order to facilitate the analysis done by the compiler, parallelization environments based on language extensions have been proposed. High Performance Fortran (HPF) [1, 2, 3] is one of the most significant cases, based on the data-parallel paradigm. HPF permits the programmer to specify data distributions, parallel sections, communications/synchronization optimizations and so on. However, writing efficient HPF programs is not necessarily a trivial task. Indeed, the high-level nature of the language often leads to inefficient parallel code. Additionally, the suitability of HPF to obtain efficient parallel codes for a large variety of complex applications has not been sufficiently proven. The programmer needs a good understanding of both the application and the HPF execution model in order to map data efficiently on the processors' memory and to organize communications.
In order to help the programmer in this difficult task, we have developed a data-parallel framework that permits the description in a uniform, methodical and precise way, of an important class of complex problems, based on the divide and conquer (DC) method. Under this framework, the problem may be adapted to the data-parallel programming paradigm, in such a way that it can be easily translated into HPF.
Our data-parallel formulation of a DC problem requires four phases.
Linearization. The programming model implies the use of linear arrays to represent the data. As our aim is to deal with DC problems, typically the data is already structured either as linear arrays or as trees. In the second case, we may use space-filling curves [4] , such as, for instance, Peano-Hilbert (PH) curves.
We have used this formulation previously for mesh connected multicomputers, but using the message-passing programming model [7, 8] . In this paper, however, we adapt our parallelization framework to the data-parallel paradigm, and present it in a more detailed and general way (for instance, a performance model is developed), including a guideline for applying it to a particular DC problem.
Computation and communication operators are designed in such a way that they can be easily translated into HPF constructs, without needing to add new features to the language and compiler. In particular, computations are expressed as extrinsic (local) procedures, while communications are basically expressed as array sectioning operations. As a result, the formulated problem can be easily translated into an equivalent HPF program. The main goal of our data-parallel formulation is to reach a trade-off parallel solution considering a general formulation of the problem and overall performance.
In this paper we will first consider regular DC dataflows, which we classify into two groups, successive doubling and trees. As examples to validate our method, we will take the self-sorting fast Fourier transform and some tridiagonal system solvers. Afterwards, an irregular DC dataflow (the Barnes-Hut method for solving N-body problems) will be taken into consideration. Octtrees resulting from this method are linearized by a space-filling curve, permitting in this way the use of similar techniques to those used for the regular case. HPF formulations obtained from these algorithms have been implemented on a Cray T3E multiprocessor. In these experiments, we show that the cost of data redistributions among different processors for irregular dataflows is comparable to that for regular dataflows.
The organization of the rest of the paper is as follows. In Section 2 we present our data-parallel formulation in terms of the index-digit operators and the mapping vector technique. Section 3 introduces the parallelization guideline and the performance model. DC problems are also introduced and analysed in the context of the data-parallel formulation. Section 4 evaluates experimentally the selected problems and discusses the results obtained. Finally, some related work is discussed in Section 5 and in Section 6 the paper is concluded.
MAPPING VECTORS AND INDEX-DIGIT OPERATORS
Let us define the index-digit representation [6] of a data sequence of size N = r n , where r is some integer greater than 1 (called base or radix). In this representation, each data item of the sequence is denoted by the base r decomposition of its index. That is, data item A(t) with index t = t n r n−1 + . . . + t 2 r + t 1 is written as [t n . . . t 2 t 1 ]. Consider now that our parallel machine is a distributed memory multiprocessor. Without loss of generality, we can assume that the processors are organized as a linear array, Q = r q . After distributing the data sequence A among the processors, each data item A(t) may be characterized by a pair of two coordinates (distribution representation), the first one representing the processor assigned (processor coordinate) and the second one representing the local memory address (memory coordinate) where it is stored. Considering perfect load balancing, each processor will be assigned a total of U = N/Q = r u data items (u = n − q). If a different processor topology is considered, the processor coordinate could be subdivided into other coordinates. As an example, for a mesh topology, the (row, column) coordinates should be adequate.
Mapping vectors
A mapping vector is defined as a one-to-one assignment between both representations, the index-digit and the distribution. That is, it specifies which of the t i (1 ≤ i ≤ n) digits correspond to the processor coordinate (a total number of q digits) and which ones correspond to the memory coordinate.
As an example, if we assign the q most significant digits of the index-digit representation to the processor coordinate and, therefore, the remaining n − q digits to the memory coordinate, then the mapping vector corresponds to a block distribution of the data sequence. On the other hand, if the q least significant digits are assigned to the processor coordinate, a cyclic distribution results. In general, a block-cyclic distribution, with a block size of r k (0 ≤ k ≤ u, where u = n − q), is specified by assigning the set of digits [t q+k . . . t k+1 ] to the processor coordinate, and the remaining set [t n . . . t q+k+1 , t k . . . t 1 ] to the memory coordinate. Note that these mapping vector assignments have a direct translation into HPF data distributions.
Index-digit operators
Index-digit operators are used to specify the problem dataflow. Thus we will have two types of operators, corresponding to computations and data rearrangements (including communications). Operators of the first type transform the data sequence into another of the same size. They represent in-place arithmetic operations on the data items, and hence do not modify the data coordinates. Operators of the second type represent data reordering, which can be local or imply data movement among processors (i.e. communications). These operators modify the data coordinates, and are defined as permutations of the data item index-digit representation.
Computational operators
Now let us define now some basic computational operators. The + and − operations carried out on the index-digit representation are understood as follows. Given the indexdigit representations a = [a n . . .
Examples of butterfly operators are graphically depicted in Figure 2 , where N = 8, r = 2 and i = 1. The three operators compute four butterflies. Each butterfly in operator B 1 , B 1 and B 1 reads two, three and four data items, respectively. For the case r = 3, operators B 1 and B 2 are depicted at the top of Figure 3 .
Let us analyse the behaviour of the butterfly operators on a data-parallel programming environment. A data-parallel model makes explicit the parallelism in terms of parallel operations on data aggregates. This paradigm is based on a single thread of control and a globally shared address space. Parallel executions are determined by both how data is distributed among processors and the owner-computes rule. Each sentence is executed by the processor that owns the data item to be written or modified, that is, the data item that appears in the left-hand side of the sentence. The data needed for such computation (variables at the right-hand side of the sentence) may be local or not. In the latter case, a Dataflow of a two-stage algorithm using butterfly operators B 1 and B 2 , for a data sequence of length N = 9 and radix r = 3. Data items are specified by their index-digit representation. Assuming a block data distribution on a three-processor machine, the leftmost digit in the above representation is assigned to the processor field, while the other digit corresponds to the memory coordinate. Thus, all butterflies in B 1 are computed locally, while the second stage (B 2 ) implies global communications. In the bottom of the figure, the sets of data we have marked are those that recombine in the first butterfly of B 2 and the arrows represent the communications needed to compute such butterfly.
communication operation to gather that data is inserted by the compiler in front of the sentence. Consider again the butterfly operator B i . Subindex i represents the digit with respect to which the index-digit representation of all input data differs. Thus, and once the mapping vector has been chosen, if the ith digit has been assigned to the memory coordinate, any set of r input data to operator B i will have the same processor coordinate. Then, B i will execute computations concurrently in all processors using only local data. Otherwise, if the ith digit is in the processor field, then each input of a single butterfly is in a different processor. Hence global communications among r processors are needed in order to have a copy of the input sequence to this butterfly in the r processors (each sends/receives r − 1 data to/from the other r − 1 processors). Due to the owner-computes rule, each of these processors computes one output of the butterfly and stores the result in-place. Note, however, that not all inputs to the butterfly operators B i and B i are local, even in the case of the ith digit belonging to the memory field.
As an example of a dataflow, in the bottom of Figure 3 we show a block distribution of a data sequence of length N = 9 and r = 3 on a three-processor machine. Thus each data index consists of two digits, the most significant one assigned to the processor coordinate and the least significant one to the memory coordinate. The dataflow is described by the string of operators B 1 B 2 . In that condition, operator B 1 (first stage) is computed locally in each processor (index 1 is in the memory field). However, operator B 2 (second stage) requires communications in order to provide the correct input data in each processor. In this case, each processor sends/receives two pieces of data to/from the other two processors. The owner-computes rule implies that processor zero must compute the first output of all butterflies in B 2 , while processors two and three compute the second and third outputs, respectively, of such butterflies.
As an important conclusion, the parallel performance of the algorithm may improve if its dataflow is reorganized in such a way that all used butterfly operators have their indices belonging to the memory coordinate. This way, these operators will compute butterflies with maximum locality (complete in the case of B i and almost complete for B i and B i ). There are several possibilities in HPF to express this kind of parallelism. One of the methods is the use of an HPF LOCAL procedure, which is executed concurrently in all processors. Variables in an HPF LOCAL procedure are private to the processor which is running that particular instance of the procedure, and thus may store different values on different processors.
Data rearrangement operators
To reorganize the algorithm dataflow and obtain, for instance, maximum locality butterfly operators, some intermediate communication stages may be necessary. These stages are described using data rearrangement operators, that will be described in the rest of the section. Figure 4 shows examples of how to apply this operator to a radix-2 sequence of length N = 8.
Turning again to the data-parallel implementation, data rearrangements may be specified into HPF by means of array-sectioning operations. The exchange operator, for instance, corresponds to an array-section swap and thus may be implemented as in the Fortran 90 subroutine shown in Figure 5a (bit-intrinsic procedures are explained, for FIGURE 4 . Application of the operator E 3,1 and E 3,2 to a data sequence of length N = 8 (and r = 2). The rounded data sections are those that permute. instance, in Appendix A of [9] ). This subroutine performs the transposition operator E i,j on an array A of size N. In Fortran 90, an array section can be extracted from a parent array in a completely general way by using vector subscript notations. A vector subscript, v(N), is a onedimensional integer array expression, in which each of its elements has the value of a subscript in the array section being defined. The array sectioning shown in Figure 5a is determined by using a masked array assignment, by means of a WHERE statement. The result is that the assignment is only accomplished for those elements that 
fulfil the condition ((btest(v,i)) .NEQV. (btest(v,j))).
As a consequence, the transposition is only performed for those elements whose digits t i = t j . Table 1 provides the definitions of other basic data rearrangement operators.
The reduction operator, , transforms a data sequence into another whose length is reduced by a factor of r, by eliminating the least significant digit from its index-digit representation. This operation can also be easily implemented through Fortran 90 arraysectioning operations. For example, in the case of radix r = 2, the reduction operator applied to an array A[N] may be implemented as follows,
Taking Table 1 again, the extension operator, ε, corresponds to the reverse reduction operator. The perfect unshuffle operator, , performs a right cyclic shift between two digits in base r representation. This operator may be implemented in Fortran 90 as shown in Figure 5b . This subroutine applies i,j to an input array A. In this paper the perfect unshuffle will be only used to make local data rearrangements. Hence the vector subscript has a size U and the subroutine can be declared as EXTRINSIC (HPF LOCAL).
Finally, the inverse-digit operator, ρ, reverses the order of the digits between two positions in the data sequence representation. This operator may be implemented as a combination of exchange operations.
Note that if a permutation operator only modifies the memory field then a local data rearrangement is accomplished, with no communications. However, if the processor field is modified then data communication among processors occurs as a consequence of the operator.
DIVIDE AND CONQUER ALGORITHMS
A classical and well-known strategy to solve problems is the DC paradigm. This method consists of dividing a problem into several subproblems of smaller size, solving each subproblem independently and then combining the results obtained to construct a solution to the original problem. In parallel computation, the DC technique is a powerful method, as subproblems generated at each level may be solved independently and, thus, concurrently.
In this section we consider three classes of DC algorithms:
successive doubling based, regular trees (standard and extended) and irregular trees (the Barnes-Hut method for the N-body problem). First, however, a parallelization guideline and a performance model are described.
Parallelization guideline
In this section we propose a guideline to apply the parallelization framework to the first two classes of DC algorithms: successive doubling and regular trees. The irregular case is very difficult to formalize. However, our framework can also be applied, as shown at the end of this paper for a particular case. Note that the dataflow of the above two DC classes exhibits the following characteristics.
• A sequence of n stages (see Figures 6a and 9a) , each one may be described by means of some of the butterfly operators defined in Section 2.2.1.
• At a stage, the distance of the input data indices to each butterfly is r times greater (lower) than the distance in the previous stage, r being the radix.
Keeping in mind these characteristics, our guideline proceeds as follows.
1. For each algorithm we consider the straightforward formulation of its dataflow using only butterfly operators (see Figure 3 ). 2. In a data-parallel environment, and independently of how the data sequence is mapped on the machine, some of the butterfly operators will execute locally while others will require data communications. Once data distribution of the input sequence has been decided (by the programmer or externally by the problem), we identify the memory field using the mapping vector. From this field we determine which butterflies in the operator string have indices in it. These operators are executed with maximum locality (see Section 2.2.1). 3. The rest of the butterfly operators have their indices in the processor field. Locality may be optimized by interchanging that index with another in the memory field.
As a result, the original string operator (composed only by butterflies) is translated into an equivalent one made of locality-optimized butterflies and index-digit data rearrangement operators (see Section 2.2.2).
Each one of the data rearrangement operators in
the resulting string operator in step three may be implemented by a specific HPF routine, as was explained in Section 2.2.2. On the other hand, butterfly operators will become calls to HPF LOCAL subroutines.
The key point in the above guideline is step three, because there are usually many ways of transforming a string of butterfly operators. The final string operator should imply data communications organized in well-defined stages (this also simplifies its translation into an HPF program). To decide the most suitable transformation it is important to take into account what data movement is implied by each index-digit operator when it is applied to an input data sequence. In addition, we also need to know which of these data movements are best performed (in efficiency terms) in our particular target parallel machine, according to the performance model introduced in the next section. Sometimes, the given performance model is not enough for our particular target machine. In such cases, some actual experiments may be necessary in order to determine the data movements with best performance. As a result, the operators for those data movements should be chosen to build the transformed string operator. As an example, in the case of a mesh-connected machine, we should restrict our communications to those operators that interchange indices between the memory field and only one of the dimensions of the mesh (row or column). This results in parallel communications along rows or columns.
Performance model
We need an execution model in order to characterize and compare the different data-parallel formulations for DC problems that we will introduce in the following sections. The model calculates the total execution time of a parallel program as
where T total is the total execution time, T B is the execution time of a butterfly operator B, k is the number of butterfly operators (k = (N × n)/r for most algorithms we have considered) and T comm is the cost of communications, which includes communications due to non-local butterflies. Communication cost is proportional to the number of communication steps or messages, assuming that all of them are of similar length. This way, the communication cost for the butterfly operator, B i , will be
As an example, the communication overhead for an algorithm with a dataflow similar to that of the example in Figure 3 , but with n = u + q stages, is given by
Communication cost for the index-digit operators shown in Table 1 may be computed in a similar way,
Assuming a block distribution of the input sequence and t l as the most significant index in the memory field, the above expression becomes
Successive doubling
The successive doubling (SD) method was introduced by Cooley and Tukey [10] to compute the FFT. We refer to this class of algorithms as standard SD. The Wang and Mou algorithm [11] for solving tridiagonal systems is an important example of standard SD. Other examples are the Batcher's bitonic sort algorithm [12] , the Hartley transform [13] and the cosine transform [14] . There are variants of the SD method that we denote as extended SD, which we will consider next. Examples of such variants are the Hockney's parallel cyclic reduction [15] method, the wavelet transform [16] and filter banks [17] . Finally we will also study the Cooley-Tukey FFT algorithm in full, which is based on the SD method, but includes partial reordering stages. Figure 6a depicts an example of a standard SD dataflow for a data sequence of length N = 8. A straightforward formulation of such dataflow using base r = 2 butterflies operators is B 1 B 2 B 3 (operators apply from left to right). In general, the string formula would be Let us assume a block distribution of the data sequence, of arbitrary length N, on a parallel computer. In such a case, butterflies with input data close together (that is, B i , with i = 1, 2, . . . , u belonging to the memory coordinate) will execute locally, while the rest of them will require data communications. As indicated in Section 3.1, this fact introduces complexities when writing the HPF parallel code. These difficulties may be reduced by inserting in the dataflow, data rearrangement operators that transform nonlocal butterfly operators into local ones. For our standard SD dataflow, after the first u local stages we can insert exchange operators to interchange digits between the processor and memory fields. As a result, indices of the non-local butterfly operators will be shifted to the memory coordinate, and thus they become local.
Standard SD
The new dataflow can then be formulated as
where u + q = n. As u is the length of the memory field, the first u stages of the algorithm are locally computed. After these stages, each one of the q next stages includes, in this order, an exchange operator, that affects both the memory and processor fields, a local butterfly computation and a local data rearrangement operator u,1 (perfect unshuffle). Figure 6b shows the new dataflow assuming a block data distribution on a four-processor machine. Note that the local perfect unshuffle does not occur for this simple case. A more complete example is outlined in Figure 7 , for a data sequence of length N = 128, block distributed on 32 processors. Observe in both examples that the output data sequences appear cyclically reordered with regard to the original order. However, the original order can be obtained by applying the perfect unshuffling operator u times.
The communication cost associated with Equation (2) is
which improves the communication cost of the straightforward implementation in a factor of r(r − 1). The above is the best option we have found for the standard SD dataflow. The dataflow described by Equation (2) can now be easily translated into an HPF program, as shown in Figure 8 . Routines for computing data rearrangement operators, Exchange() and PerfectUnshuffle(), appear in Figure 5 . The Butterfly routine simply implements the specific arithmetic computation associated with the operator. Figure 9a depicts an example of an extended SD dataflow, for a data sequence of length N = 16 with r = 2. It is also shown how the data sequence will be mapped on a four-processor machine if a block data distribution is used. A straightforward formulation of such dataflow using the operators described in the previous section is B 1 B 2 B 3 B 4 . In general, it would be n i=1 B i . In principle, extended SD dataflows appear to show communication complexities similar to standard ones. Therefore, a similar solution could be used, that is, by transforming the dataflow in a similar way to that described by Equation (2), but replacing the butterfly operators by the extended ones. However, this is not so as, assuming again a block data distribution, all extended butterflies imply data communications. In addition, the message size to be communicated increases with the index i of B i . For instance, consider the straightforward formulation of the dataflow shown in Figure 9a . In the first stage, each processor needs the following input data sets for computing the two butterflies assigned, and similarly in the rest of the stages. As a result, the communication cost for the straightforward formulation is given by
Extended SD
where u + q = n (and Q = 2 q ).
Considering that message length increases with the index i of B i , which also represents the stage in the algorithm dataflow, better parallel performance would be expected if such dataflow is reorganized in a reversed order to (2) . That is, having the local computations at the end,
This way, we begin by performing the communications due to operator E in the first q stages. Operator u,1 means local rearrangement of data (perfect unshuffle permutation). In the remaining u stages there are no communications because the data required to compute the butterflies are in the same processor.
In Figure 9b we schematically illustrate the reorganized dataflow for a data sequence of length N = r 7 block mapped on 16 processors. Extended butterfly operators, B 1 , are executed in the first q = 4 stages, followed by a perfect unshuffle permutation and an exchange. At the end of these stages, processor and memory fields appear to be interchanged, i.e. the original block data distribution has been transformed into a cyclic distribution. As a result, the following extended butterfly operators, B i , will execute with no communications. As an example, consider the case N = 16, r = 2 and a four-processor machine (see Figure 9a ). Butterfly operator B 3 , in the third stage of the original dataflow, requires the following input data for each processor:
As q = 2, the data sequence at this third stage is already cyclically distributed. Thus, t 2 t 1 digits in the index-digit representation of the data sequence belong to the processor field, while the first two most significant digits, t 4 t 3 , belong to the memory field. Hence all the inputs to each butterfly in B 3 are in the same processor (although each processor will now execute different butterflies to those of the original dataflow).
In general, the communication cost for the new formulation of the extended SD is
Note that the output data sequences appear cyclically reordered with regard to the original order.
As in the case of the standard SD, the HPF implementation of extended SDs is obtained in a similar way, as shown in Figure 10 .
FFT
The Cooley-Tukey FFT algorithm dataflow consists of an initial-digit reversal permutation (inverse-digit operator) followed by a standard SD. Alternatively, the dataflow of the standard SD may be reversed, locating the permutation at the end of the algorithm. We will take this last option with the aim of presenting a situation different from that studied in the previous sections. Observe that, in this case, it is more convenient to interchange the fields processor and memory. That is, the data sequence will be initially distributed in a cyclic way instead of block.
Communications associated with the digit reversal permutation degrade the performance significantly. So, a large number of algorithms have been designed [18] to reduce the impact of this permutation. We consider here the selfsorting algorithm because, on the one hand, it provides the resulting sequence in the correct order and, on the other hand, the digit reversal permutation can be decomposed into exchange operations and a local inverse-digit operation. This formulation, presented in [7, 19] , is
where n > 2q. Note that the last u (from q + 1 to n) stages of the algorithm are executed locally as index i belongs to the memory field (assuming as indicated before a cyclic distribution of the data sequence). The first q stages, in contrast, require communications due to the exchange operator. Between both bunches of stages, an inverse-digit operator accomplishes local data rearrangements. Figure 11 illustrates the dataflow of the self-sorting FFT for a data sequence of N = r 6 data on four processors. As before, Equation (4) can be easily translated into a HPF program, as shown in Figure 12. 
Standard regular trees
Regular trees are a different and frequent dataflow pattern for DC problems. We consider that in a regular direct tree each node has r inputs and r outputs, but only one of the outputs progresses to the next stage. So, the number of nodes computed in each stage is reduced by a factor of r as we progress through the tree. While computations on nodes are represented by the butterfly operator, the reduction in the number of nodes may be represented by the reduction operator, .
Therefore, following the above description, the dataflow of a regular direct tree is similar to that of the standard SD, but inserting a reduction operation between each pair of butterflies. Having in mind the processor and memory coordinates in the index-digit representation, a block data distribution of the data sequence implies that both fields appear in the order (processor, memory). In the first u = n − q stages of the tree algorithm, the reduction operator eliminates a digit from the memory field. After these u stages, the field memory disappears, which means that a unique data item remains in each processor. During the execution of the rest of the stages a digit is eliminated from the processor coordinate, stage by stage. This means that, stage by stage, the number of working processors is reduced by r, and each active processor computes a unique butterfly. This is inefficient in parallel because it implies a large number of short data communications.
We may reorganize the tree dataflow in a much more efficient way if a number of short communications are grouped into only one long communication, enough to completely fill the available memory in the processors. That is, in each stage of the tree, the number of working processors is reduced by a factor of r, each active processor computing several butterflies locally. This can be expressed as
In this equation we have assumed, as it is the most usual situation, that the memory field is longer than the processor field. Otherwise, the last two stages in Equation (5) must be iterated q/u times. The mutation operator M proc, * gathers all data in each processor and allocates them in processor zero. In other words, it transfers the digits of the processor coordinate to the memory coordinate, proc mem proc mem
This operator can be implemented through Fortran 90 array sections using subscript triplet notation. For example, in the case of a block distribution, the mutation operator In Figure 13 we present an example of the dataflow described by Equation (5) for a data sequence of 64 items on 16 processors.
Following a similar argument, an inverse tree can be formulated by reversing the formula in Equation (5), as
The mutation operator M * ,mem is the reverse of M proc, * , that is, it scatters the data on processor zero to the rest of the processors, proc mem proc mem
As in the case of operator M proc, * , the M * ,mem operator can be implemented through Fortran 90 array sections using subscript triplet notation, as in the following sentence, A(1:N:U) = A(1:Q:1).
Extended regular trees
A variant of the standard tree used in many DC algorithms is what we call an 'extended tree'. In these trees, the output of some of the nodes can be used as input for more than one node in the following stage. A typical example of an extended tree dataflow is the elimination stage of the cyclic reduction (CR) algorithm [15, 20] , shown in Figure 14 . The formulation of this type of tree is similar to that of the standard trees (see Equation (5)), but replacing butterflies by extended butterflies. For example, in the case of the CR algorithm, extended butterflies of type B i are used.
As extended trees constitute a general case, including the standard trees, in Figure 15 we show the general HPF program for such general trees. This program derives directly from the operator formulation explained above. Data rearrangement operators, M proc, * and mem , appear as array sections using subscript triplet notation as pointed out before. The Extended Butterfly() routine implements the specific arithmetic computations associated with the operator.
Irregular trees
For irregular DC applications, special data distributions are required to provide both data locality and load balancing. The operational description of the problem and its implementation into a data-parallel language is significantly more difficult than in the problems analysed in the previous sections. Examples of irregular trees appear in many problems in physics, engineering and computer graphics, such as N-body, molecular dynamics, 3-D rendering and radiosity.
As a working example of an irregular-tree-structured DC problem, the hierarchical N-body problem is considered. Recent work [21, 22] has shown that hierarchical N-body simulations can be implemented into HPF. All these implementations, however, use non-adaptive and adaptive versions of the Anderson method for far-field force computations. This section shows an efficient HPF implementation of a different hierarchical force calculation method, the Barnes-Hut (BH) algorithm [23] . There are significant differences between their implementations and ours. Basically, in our case communications are implemented using array copy sections, obtaining efficient point-to-point data transferences. In addition, HPF LOCAL constructs are used to obtain very efficient node local codes. The BH algorithm is the simplest hierarchical N-body method, widely used in astrophysics. As an astrophysics application simulates a large number of bodies (millions of stars), the BH method employs a DC strategy to reduce the number of particles in the force sum. A simple physical intuition is used: the force contribution of a far enough galaxy can be completely replaced by a single point mass located at its centre of mass. This idea is recursively applied with the help of a tree data structure, an octtree in 3-D or a quadtree in 2-D. The root of a quadtree (octtree) is a square (cube) comprising the whole body domain. This large box is subdivided into four (eight) equal-sized subboxes. This partitioning process is repeated recursively until at most one body is found in each subbox.
Basically, the BH method is organized into three stages. First, the quadtree (octtree) for the body domain is built. Second, for each subbox of the tree structure, the centre of mass and the total mass for all the bodies inside are computed. And third, for each body, the tree data structure is traversed to compute all the force contributions following a DC strategy. That is, if the distance from the body to a subbox is much larger than its size, then all the bodies contained are approximated by a single point mass. One typical distance test, known as the multipole acceptability criterion (MAC), is based on checking if the fraction l/d is less than some user-defined threshold [23] , where l is the length of a side of the subbox and d is the distance from the body to the subbox centre of mass.
Data-parallel implementation
For an irregular problem like the BH method, our parallelization framework is used as follows:
Linearization. In our formulation, we linearize the tree data structures (quadtrees or octtrees) through the use of space-filling curves [4] . We use, in particular, a Peano-Hilbert (PH) linear ordering, which has the property that, to a certain degree, two points adjacent on the curve are adjacent in the tree structure. In addition, the PH curve is easy to generate and results in simple indexing. As an example, Figure 16 shows a 2-D ensemble of bodies and its associated quadtree. In the bottom of the figure, the corresponding PH curve is depicted together with the corresponding linear ordering of the bodies.
Data distribution. After linearizing the problem, the linear sequence of bodies is block distributed among processors to keep the locality exhibited by the PH curve, and the computations of the simulation are organized as array operations. A DATA-PARALLEL FORMULATION FOR DIVIDE AND CONQUER ALGORITHMS 315 sections to organize communications, instead of array indirections or irregular gather/scatter operations, as in [21, 22] . This facilitates the compiler work to generate efficient communication operations.
Workload balancing. Finally, as BH contains dynamic computations, a diffusive type workload balancing scheme is included in the parallel algorithm.
As an example, Figure 17 represents the complete dataflow of one simulation step of the BH method for a 2-D system. The upper part of this figure shows the original ensemble of bodies, its corresponding quadtree, the PH linearization of the tree and the block partitioning of the resulting linear array of bodies.
To simplify the explanation, we have represented the BH method by the following string of operators,
Note that although the BH dataflow can be organized as a string of operators, these operators cannot be defined as permutations of the index-digit representation of the data sequence. We have to introduce new computational and data arrangement operators. In the above operator string (Equation (9)), the initial product runs over the iterations or steps of the simulation. For each iteration, we have three computing stages.
• In the first stage ( U i=1 T z,i ), each processor builds a local quadtree from the assigned block of bodies. Symbol z represents the root of the local tree, which is built incrementally by adding one by one all the local bodies (U = N/Q bodies in each processor). This is a completely local computation that can be easily written into HPF as follows (operations to build a quadtree are not essential in the explanation of the BH method implementation, and thus were encapsulated into a procedure),
• The second stage represents an information exchange, to enable computation of force contributions on the local bodies. Each processor must determine what information it owns is needed by the rest of the processors. This information is obtained by applying the MAC heuristic to benchmark bodies coming from the other processors. In our implementation, each processor broadcasts the extreme subboxes in its local tree. This interchange is represented by operator ξ , while operator ϒ represents the application of the MAC test. The results are stored in three tables, slice, list and list t (see Figure 17) . The ith entry of 
END DO.
• In the third stage ( UU i=U +1 T z,i ), the information exchanged is added to the corresponding local tree, thereby obtaining an extended local tree. UU is the total number of bodies that each processor now has, also considering the owned ones. Finally, each processor is now able to compute forces on the owned bodies, and update their velocities and positions (F , V and X operators). All these operations are also local computations, as in the first stage, and can be written into HPF in a similar way, as follows, overcome this problem we have also implemented a dynamic load balanced version of the BH method, using a diffuse type dynamic scheduling algorithm [24] , which preserves some data locality. At the end of each simulation step, the scheduling piece of code tests if the workload imbalance is greater than some threshold. The imbalance is calculated by taking the difference between the loads of the most-loaded and the least-loaded processors divided by the average load in the full system. In the case of high imbalance, the most-loaded processor sends a fraction of its set of bodies to the least-loaded of its neighbouring processors (a linear processor array is considered). The workload is estimated by counting the total number of force contributions with regard to its set of bodies.
EVALUATION
We have implemented an example set of algorithms that exhibit the dataflows described in previous sections. All programs were written in HPF and compiled using the Portland Group [25] commercial HPF compiler. The experiments were conducted on a Cray T3E multiprocessor.
First, we have considered the original FFT (standard SD with inverse-digit permutation) which is straightforwardly formulated as the string B 1 B 2 . . . B n ρ (N = 2 n is the problem size), as well as the formulation given by Equation (4) . Figure 19e presents the speedups for both implementations, for an input data sequence of length N = 2 20 . The improvement in the parallel performance is significant. Figure 19a depicts the speedup of this algorithm for two problem sizes. Figure 19 also presents the speedups for the parallel cyclic reduction algorithm, PARACR [15] (extended SD dataflow) and the radix-2 cyclic reduction, CR [15, 20] (extended and inverse regular tree dataflows).
These numbers have been measured with respect to the straightforward sequential algorithm before dataflow reorganization, and show the low overhead associated with our method. In all cases we obtain superlinearity, due to the dataflow reorganization, which introduces high data locality, with a significant reduction in cache misses. The speedup plots in Figure 19 show a severe knee around eight processors. This knee appears for all our test cases due to limited range in sizes of the input datasets used. Figure 19d shows, for instance, the execution behaviour of the FFT algorithm for an input data sequence of length N = 2 20 . Note that with more than two processors, the communication time increases very slowly with the number of processors, but the computation time reduces much faster at the same time. Thus at the end the total running time is dominated by the communications, which is almost constant (which justifies the slowdown in the speedup).
In the case of the BH method, a total of 100 iterations (timesteps) were executed and measured. Input bodies were generated randomly according to the Plummer model with monopolar approximation and potential softening, which is widely used to generate spherical galaxies made up of equal mass bodies. The softening parameter avoids singularities when particles get too close to each other. The code for the particle generator was lifted from the SPLASH2 Barnes application [26] , and modified for use with this simulation. Figure 20a shows the performance of the HPF code. Speedups are calculated with respect to the sequential program running on one processor, and 100 timesteps were executed. Superlinearity is obtained as the dataflow reorganization introduces high data locality. Beyond eight processors there is a significant reduction of performance (see Figure 20b performance gains are presented in Figure 20c compared to the extreme case of 'no balance'.
RELATED WORK
Much attention has been paid in the literature to the formal parallelization of DC algorithms. In [27] a transformational approach is presented to obtain generic parallel DC algorithms, that are further implemented on linear arrays, mesh or hypercube multicomputers. Starting from an operational DC program, mapping sequences to sequences, they apply a set of semantics preserving transformational rules, which transform the parallel control structure of DC into a sequential control flow, thereby making the implicit data parallelism in a DC scheme explicit. A similar approach is used in [28] , where a provably correct, architecture-independent family of parallel implementations for a class of data-parallel algorithms is derived, called DH (distributable homomorphisms). The implementations are well-structured SPMD programs with groupwise personalized all-to-all exchanges. These exchanges may be directly translated into MPI using MPI ALLtoall sentences. As a case study, FFT is analysed.
A rather different methodology, based on a space-time mapping, is presented in [29] . This work uses a geometric computational model based on coordinate transformations with which time (the schedule) and space (the processor allocation) can be made explicit. This technique may be applied to a class of DC recursions, resulting in a functional program skeleton and its parallel implementation on message-passing multiprocessors.
A similar methodology to the one presented in this paper has been developed in [30] . This work designs a framework based on induced permutations to compute the different FFT algorithms. Induced permutations are based on operations on the tensor sum decomposition of the data indices, and represent a generalization of our indexdigit permutations (which are included as a particular case). Induced permutations permit the implementation of a large class of data-sorting procedures, and are used in the work of formulating mixed-radix versions of the FFT.
Finally, in [31] a parallel implementation of the BarnesHut method is studied, very similar to the one considered in this paper. The parallel method is structured into a sequence of alternating computation and communication phases. The distribution of the bodies among processors is accomplished by means of an orthogonal recursive bisection. Load balancing is introduced in each phase, taking as weights the number of interactions being computed by each processor. Next, each processor builds a locally essential tree, which carries an irregular communication pattern, unknown in advance. To simplify the procedure a global reduction is used to first compute the number of messages destined for each processor. Finally, after building the tree, the processors can proceed to compute without any communication. Our parallel implementation is similar but with the difference that the data distribution is carried out using a space-filling curve.
CONCLUSIONS
We have presented a data-parallel formulation for problems and demonstrated its use in implementing regular and irregular algorithms. Our experiments provide promising evidence that a data-parallel approach can be used to efficiently solve regular and irregular DC problems without adding new features to the currently available HPF languages and compilers. The methodology we have used to reorganize the algorithmic dataflows provides great flexibility for efficiently exploiting data locality and reducing and optimizing communications.
