Abstract. In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is ensured when executing that iteration. The experimental results show that, in most cases, our approaches achieved better performance than other loop partitioning rules.
Introduction
In recent years, there have been major efforts in developing approaches to parallelization of scientific applications. Distributed memory machines provide the computing power that scientists and engineers need for research and development. Parallelizing compilers play an important role by automatically customizing programs for complex processor architectures, improving portability and providing high performance to non-expert programmers.
As the scientists attempt to model and compute more complicated problems, they have to envisage to develop efficient parallel code for sparse and unstructured problems in which array accesses are made through a level of indirection. This means that the data arrays are indexed through the values in other arrays, which are called indirection arrays or index arrays. The use of indirect indexing causes the data access patterns, i.e. the indices of the data arrays being accessed, to be highly irregular. Such a problem is called irregular problem, in which the dependency structure is determined by variable causes known only at runtime. Irregular applications are found in unstructured computational fluid dynamic (CFD) solvers, molecular dynamics codes, diagonal or polynomial preconditioned iterative linear solvers, and n-body solvers.
Exploiting parallelism for irregular problems becomes very difficult due to their irregular data access pattern. Figure 1 illustrates a typical irregular code segment. Here, elements are moved across the rows of a 2-D array based on the information provided in indirection array ielem. The elements of array cell are shuffled and stored in array new_cell. From this example we can observe that accesses to data array new_cell are dictated by the contents of the index array ielem in the irregular loop. Because of changes in access patterns, compilers make analysis of data distribution, locality and cache optimization, and communication optimization more difficult.
Researchers have demonstrated that the performance of irregular parallel code can be improved by applying a combination of computation and data layout transformations. Some researches focus on providing primitives and libraries for runtime support [2, 12, 3, 9] , some provide language support such as add irregular facilities to HPF or Fortran 90 [16, 11, 18] , and some works attempt to utilize caches and locality efficiently [4] .
Hwang et al. [12] presented a library called CHAOS, which helps user implement irregular programs on distributed memory machines. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors. The same working group as the above, Ponnusamy et al. extended the CHAOS runtime procedures which are used by a prototype Fortran 90D compiler to make it possible to emulate irregular distribution in HPF by reordering elements of data arrays and renumbering indirection arrays [16] . Also, in their paper [3] , Das. et al. discussed some primitives to support communication opti-mization of irregular computations on distributed memory architectures. These primitives coordinate inter-processor data movement, manage the storage of, and access to, copies of off-processor data, minimize inter-processor communication requirements and support a shared name space.
In this paper, we propose an approach to minimize the communication cost in pre-processing for compiling irregular loops. In our approach, neither owner computes rule nor almost owner computes rule, which was proposed in [16] , is used in parallel execution of a loop iteration for irregular computation. A communication cost reduction computes rule, called least communication computes rule, is proposed. For a given irregular loop (if a loop body includes assignments with indirection array references, it is called "irregular loop"), two sets F anIn(P k ) and F anOut(P k ) for each processor P k (0 ≤ k ≤ m) are defined, where F anIn(P k ) is a set of processors which have to send data to processor P k before the iteration is executed, and F anOut(P k ) is a set of processors to which processor P k has to send data after the iteration is executed. According to these information we partition the loop iteration to a processor on which the minimal communication is ensured when executing that iteration.
Should Owner Computes Rule Always Be Used?
In the following discussion, we assume the irregular loop body has only loopindependent dependence, but no loop-carried dependence (it is very difficult to test irregular loop-carried dependence since dependence testing methods for linear subscripts are completely disabled), because most of practical irregular scientific applications have this kind of loops. Consider the irregular loop below:
Generally, in distributed memory compilation, loop iterations are partitioned to processors according to the owner computes rule [1] . This rule specifies that, on a single-statement loop, each iteration will be executed by the processor which owns the left hand side array reference of the assignment for that iteration.
For the loop in Example 1, if owner computes rule is applied, the first step is to distribute the loop into three individual loops each of which includes the statement S1, S2, and S3, respectively. Without loss of generality, for iteration i r , assuming that X(ix(i r )), Y (iy(i r )), and Z(iz(i r )) are distributed onto P 0 , P 1 , and P 2 , respectively. Then the iteration i r of executing S1, S2, and S3 would be partitioned to processor P 1 , P 0 , and P 2 , respectively. Thus if any references to array elements on the right-hand side is not owned by the processor executing the statement (say, an off-processor reference), the array data on the right-hand side would have to be communicated to the owner. The following table shows the owner of executing assignments and required communications for the example loop.
Statement Owner array elements required communication S1:
However, owner computes rule is often not best suited for irregular codes. This is because of two reasons: Use of indirection in accessing left hand side array makes it difficult to partition the loop iterations according to the owner computers rule, secondly, because of use of indirection in accessing right hand side elements, total communication may be reduced by using heuristics other than owner computes rule. Therefore, in CHAOS library, Ponnusamy et al. [16, 17] proposed a heuristic method for irregular loop partitioning called almost owner computes rule, in which an iteration is executed on the processor that is the owner of the largest number of distributed array references in the iteration.
For the above irregular loop, if the almost owner computes rule is used, because each of P 0 , P 1 and P 2 owns one element participating iteration i r , we can select one of them, for example, P 0 , as the owner. This means that all the statements of iteration i r must be computed on processor P 0 . The required communications are shown as follows, where tmp Y and tmp Z mean the values obtained at the loop executing owner but need to send back to the owners of Y (iy(i r )) and Z(iz(i r )), respectively:
P0 tmp Z, Y (iy(ir))(P1 −→ P0) After the iteration is executed, scattering tmp Y , tmp Z to the owners of Y (iy(ir)) and Z(iz(ir)).
Obviously the communication cost is reduced as compared to the owner computes rule. Some HPF compilers employ this scheme by using EXECUTE-ON-HOME clause [18] . However, when we parallelize a fluid dynamics solver ZEUS-2D code by using almost owner computes rule, we find that the almost owner computes rule is not optimal manner in minimizing communication cost -either communication steps or elements to be communicated. Another drawback is that it is not straightforward to choose optimal owner if several processors own the same number of array references.
For example, in the above loop, if P 2 is selected as the executing processor, we find that the required communications only need three times (X(ix(i r )) : P 0 −→ P 2 , and scattering tmp X, tmp Y to owners of X(ix(i r )) and Y (iy(i r ))). Shown in Example 2 is a more complicated irregular loop, which is a simplified version extracted from ZEUS-2D code [15] : Example 2 DO 10 t = 1, time_step C Outer loop takes the execution times of irregular loop DO 100 i = 1, N S1:
We assume this loop would be executed on 4 processors in parallel. If the array element Y(i) is aligned with X(i) in the initial distribution, clearly Y(j1(i)) is also distributed onto the same processor with X(j1(i)). So we can assume that P 0 , P 1 , P 2 , and
, and [X(j4(i)),Y(j4(i))], respectively, for iteration i. According to almost owner computes rule, this loop iteration would be partitioned to P 2 because it has the majority number of data elements. The communication would be:
-Import communication before the loop iteration is executed:
1. X(j1(i)), Y(j1(i)): P 0 −→ P 2 -Export communication after the loop iteration is executed:
1. tmp Yj1: P 2 −→ P 0 2. tmp Xj2: P 2 −→ P 1 3. tmp Xj4, tmp Yj4:
However, if we consider communication overhead when the iteration is partitioned to P 0 , we can obtain the communication pattern as follows:
-Import communication before the loop iteration is executed: 1. tmp Xj2: P 0 −→ P 1 2. tmp Xj4, tmp Yj4:
Although the number of elements to be communicated is 6, same as the former, but the communication steps are reduced (three times). This improvement is important when the outer sequential time step-loop is large. This illustrates that the almost owner computes rule is not always an optimal scheme for guiding partition of loop iterations.
Based on the above observation, we propose a more efficient computes rule for irregular loop partition. This approach partitions iterations on a particular processor such that executing the iteration on that processor ensures -the communication steps is minimum, and -the total number of data to be communicated is minimum
Efficient Loop Iteration Partitioning
In this section, we give the strategy of loop iteration partitioning for irregular codes according to least communication computes rule. Different from the owner computes rule, the whole loop body of a loop to be parallelized is processed. Suppose that all of the arrays including data arrays and index arrays are initially distributed as BLOCK. The communication pattern of a partitioned loop iteration on a processor can be represented as a directed graph G = (V, E), called communication pattern graph (CPG), where the set of nodes V consists of all pre-execution nodes, an execution node and all post-execution nodes for the processors. An edge from a pre-execution node P i to an execution node P j (i = j) represents that if an iteration is executed on processor P j , processor P i has to send data to P j . The weight of the edge represents the number of data to be communicated. Similarly, an edge from a execution node to a post-execution node represents that the results of execution of the iteration returns to the processor. Figure 2 shows the communication pattern graph of the loop of Example 2.
Loop Partitioning Algorithms
Iteration partitioning according to least communication computes rule is to assign iterations to the processors such that whole communication (import and export) steps and message length is minimized. Before the loop partitioning algorithms are presented, we give the following definitions.
Definition 1 A loop-independent dependence exists in a loop body (block) when two statements reference the same memory location within a single iteration of all their common loops. In this paper, we pay attention to the following three kinds of dependence: 
. = A(ia(i))
Definition 2 Let an iteration i r be partitioned onto processor P k , set D j (i r , k) is defined as all the data array elements which must be sent from P j to P k . |D j (i r , k)| is the number of data in the set. Similarly, D j (i r , k) is defined as all the data array elements which must be sent back to P j from P k .
Definition 3
Let an iteration i r be partitioned onto processor P k , set F anIn(P k ) is defined as a set of processors P 1 , P 2 , . . . , P l which have to send data to processor P k before the iteration is executed. Each processor
If there is no need to import data when executing the iteration on P k , F anIn(P k ) = ∅. Similarly, set F anOut(P k ) is defined as a set of processors P 1 , P 2 , . . . , P l which have to receive data from processor P k after the iteration is executed. Each processor P j , 1 ≤ j ≤ l has a degree
If there is no need to export data after executing the iteration on P k , F anOut(P k ) = ∅.
Example 3
For Example 1, suppose that the sizes of array X, Y, Z, and ix, iy, iz are all 12, the number of processors m = 3, all data arrays and index arrays are distributed with BLOCK, and the values of index arrays are as follows: 3 4 5 6 7 8 9 10 11 12 ix 5 7 9 11 1 3 2 4 6 8 10 12 iy 4 4 3 3 1 6 9 10 2 4 6 8 iz 4 5 6 7 8 10 12 2 1 3 9 11 We want to partition iteration 5. We obtained the data elements of this iteration in the loop body are X(1), Y (1), and Z(8). Then,
Definition 4 The degrees of the set F anIn(P k ) and F anOut(P k ), deg(F anIn(P k )) and deg(F anOut(P k )), are defined as
From the above definitions, we have the following proposition.
Proposition 1
The least communication computes rule is to partition an iteration to a processor P k such that
P is the processor set. 2. If more than one P k , say P k1 , P k2 , . . . , P k l satisfy the above formula, then select a P kj such that deg(F anIn(P kj )) + deg(F anOut(P kj )) = min Pj ∈{P k 1 ,...,P k l } (deg(F anIn(P j )) + deg(F anOut(P j ))).
In the following algorithms, we assume that a loop body is composed of n statements S 1 , S 2 , . . . , S n , and each S i has one left-hand array elements l i and h right-hand array elements r 1 , r 2 , . . . , r h , D(S i ) and U (S i ) represent the definevariable set and use-variable set of statement S i respectively. We also abbreviate d ∈ P if a data is distributed onto processor P . The algorithms for computing D j (i r , k) and D j (i r , k) are as follows.
Input: Iteration i r , Processor P k , and iteration block {S 1 , S 2 , . . . , S n };
There is no true dependence between r t and all of // the left hand variables of the statements before S i , // and there is no input dependence between r t and all //other right hand variables of the statement before
There is no output dependence between l i and all of // the left hand variables (l i+1 , . . . , l n ) of the statements below
The algorithm 3 computes the set F anIn. The computation of F anOut is similar with F anIn. Finally, Algorithm 4 determines the processor on which the loop iteration is partitioned.
Algorithm 4 Partition(F anIn, F anOut, P k ) Input: F anIn(P j ) and F anOut(P j ) for all processor P j , 0 ≤ j ≤ m; Output: iteration executing processor P k ;
Node Program
After the index array redistribution is completed, we can develop a node program which has three parts: pre-execution import communication (gathering phase), irregular loop execution (executing phase), and post-execution export communication (scattering phase). In the node program, D k (j) (D k (j)) is the all data which need send to P j from P k (current executing processor), before (after) loop execution. i$local marks as local loop index. α k is the number of iterations partitioned onto P k . P k 's node program: // pre-communicating required elements with other processors.
send to P j ; end if end for // executing the local iterations for i$local = 1, α k S 1 (i$local); S 2 (i$local); . . . ; S n (i$local); end for // post communicating changed remote elements with other processors.
send changed data to P j ; end for for j = 0, m − 1, = k if P k ∈ F anOut(P j ) then receive changed data from P j ; end if end for
Experiments and Performance Results
We now present experimental results to show the efficacy of the methods presented so far. We measure the difference made by using owner computes rule, almost owner computes rule, and our least communication computes rule in an experimental program. All the experiments are examined on a 24 node SGI Origin 2000 parallel computer.
We select an irregular kernel of fluid dynamics code, ZEUS-2D for our study. ZEUS-2D is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinois at UrbanaChampaign) for astrophysical radiation magnetohydro dynamics problems [15] . ZEUS-2D solves problems in one or two spatial dimensions with a wide variety of boundary conditions. The C language preprocessor allows the user to define various macros to customize the ZEUS-2D algorithm for the desired physics, geometry, and output. ZEUS-2D solves the equations of ideal (non-resistive), non-relativistic, hydrodynamics, including radiation transport, (frozen-in) magnetic fields, rotation, and self-gravity. Boundary conditions may be specified as reflecting, periodic, inflow, or outflow. The kernel irregular subroutine X2INTZC includes some loops with similar appearance as Example 2. We specify the geometry as Cartesian XY, the grid as uniformly spaced zones 800 by 2, and extend the irregular loop iterations to 1000. The node programs are written in Fortran, using MPI communication library and system call gettimeofday() for measuring execution time.
In Figure 3 , we show the performance difference of INC2TZ obtained by using three kinds of the compute rule. Performance of the different versions of the code is measured for 2 to 24 processors. The curves marked with owner computes, almost computes, and least comm. are the versions of the code which the loop is partitioned using owner computes rule, almost owner computes rule, and our least communication computes rule. The figure shows that our method gets good performance in most cases. On small number of processors, our method is not better than the other computes rules. This is because that the total communication time is so small that the communication optimization makes no significant difference. Another reason is that the load balance rate of our method is worse in some cases. However, the overall performance difference is small. When the same data is distributed over a larger number of processors, the communication time becomes a significant part of the total execution time and the communication optimization makes significant difference in the overall performance of the program.
In Figure 4 , we further study the impact of different versions on communication statements. Only the communication time is shown for the various versions of the code. Communication optimizations in our method include message aggregation before and after loop execution, index array redistribution and communication scheduling. When the number of processors is large, our method can lead to substantial improvement in the performance of the code, because the communication time influences significantly total performance of parallel program. 
Conclusions
The efficiency of loop partitioning influences performance of parallel program considerably. For automatically parallelizing irregular scientific codes, the owner computes rule is not suitable for partitioning irregular loops. In this paper, we have presented an efficient loop partitioning approach to reduce communication cost. In our approach, runtime preprocessing is used to determine the communication required between the processors. We have developed the algorithms for performing these communication optimization.
We have done a preliminary implementation of the schemes presented in this paper. The experimental results demonstrate efficacy of our schemes. However, we only performed an intra-procedural analysis to optimize communication, because many irregular applications have the schemes of inter-procedural irregular loops, as our future work, we will extend our approaches to inter-procedural irregular loop partitioning.
