Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, end a computation involving the regular section A(t: h: s), the local memory access sequenw for any processor is characterized by a finite state mechine of at most k states. We present fast algorithms for computing the essential information about these state machines, end extend the framework to handle multid@ensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.
Introduction
Languages such as Fortran D [4] and High Performance Fortran [5] allow the programmer to annotate programs with alignment and drktribution statements that describe how to map arrays to processors in a distributed-memory environment. Both languages introduce an auxiliary Cartesian grid called a templat< arrays are aligned to templates, and templates are distributed across the processors. The alignment of an array to a template and the distribution of the template determine the final mapping of the array. lle compiler must partition the arrays end produce "node code" for execution on each processor of a distributed-memory parallel computer.
Consider an array A of size n aligned to a template such that array element A(z) is aligned to template cell as' + 11. is distributed across p processors using a cyclic(k) distribution, in which template cell i is held by processor (i div k) mod p. (We number array elements, template cells, and processors starting from zero.) Thus the array A is split intop local arrays residing in the local memories of the processors. Consider a computation involving tlte regular section A(1: h: s), i.e., the set of array elements {A(4 + js) :0 S j < (h -t)/s}, wheres >0. (Table 1 summarizes the notation.) In this paper we consider the following problems q What is the sequence of local memory addresses that a given processor m must access while performing its share of tbe computation?
c What is the set of messages that a @en processor m must send while performing its share of the computation?
The cyclic(k) distributions generalize the familiar bbck and cyclic distributions (which are cyclic ( [n/pi) and cyclic( 1) respectively). Dongarra et al. [3] have shown these distributions to be essential for efficient dense matrix algorithms on distributed-memory machines. High Performance Fortren includes directives for multidimensional cyclic(k) distribution of templates. Many spwial cases of these problems have been solved, but the general solution does not appear in the literature or (to the best of our knowledge) m any implementation. Koelbel and Mehrotra [8] treat special cases where the array is mapped using a block or a cyclic mapping, and the regular section has unit stride. Koelbel [7] handles non-unit-stride regular sections with block or cyclic mappings. MacDonald et al. [9] discuss the problem in the context of Cray Research's MPP Fortran, and conclude that a general solution requires unacceptable address computation overhead in inner loops. They therefore restrict block sizes and number of processors to powers of two, altowing tbe use of bit manipulation in address computation.
Experimental implementation of our solution shows Mschinery.
To copy otherwise, or to republish, requires a fee and/or specific permission.
4th ACM PPOPP,51931CA,USA 0 1993 ACM 0-89791 -589 -5/93 /0005 /0149 . ..$1 .50 that address computation and rnterprocessor canrmmication csn be performed efficiently without these restrictions.
The paper is organized as follows. Section 2 solves the problem for a one-level mapping (a = 1, b = O). Section 3 reduces the problem for two-level mappings to combining the solutions from two subproblems involvrng one-level mappings. Section 4 extends the tla.ntework to handle multidimensional arrays, aud Section 5 extends it to handle generation of communication sets. Section 6 discusses performance results, and Section 7 presents conchtsions.
One-level mappings
We first consider the problem where a one-dimensional array A is aligned to a template with a = 1 and b = O. This is called a ondevel mapping.
In this case, the array and the template are in one-to-one correspondence, aud we specify the problem by the pair (p, k) describing the distribution of the template across the processors. As shown in Figure l(a) , we visualize the layout of the array in the processor memories as courses of blocks. 'lltree functions P, C, and 0 describe the location of array element A(i). P(i; p, k) is the processor holding A(i), C(+ p, k) is the course of blocks within this processor holdrng A(i), and 0(;; p; k) is the offset of A(i) witidm the course. The layout of array elements in the local processor memories is as shown m Figure l (b) .
These functions are defined as follows.
The folIowing function gives the local memory location (with respect to the base of the local array) where element A(i) is stored within processor P(i; p, k):
We are interested in the sequence of local memory locations that an individual processor, numbered m, accesses while performing its share of the computation on the regular section A(Z: h:s). We specify the sequence of accesses by its starting location and by the differences AM(iI, 22) = M(iz; p, k) -M(il; p, k), where iI and iz are two successive indices of the part of the regular section mapped to processor m. As the examples in F@re 2 illustrate, the spacing AM is not constant.
'l'he key insight is that (for a particular p, k, and stride s) the offset O(i I; p, k) of one element of the reguku section determines the offset O(iZ; p, k) of the next element of the regular section in the same processor, and also determines the spacing AM(il, iz). The contents of the table, or the transitions of the FSM, depend on p, k, snds, but not on the section bounds 1 and h or the processor number m, The same FSM describes the offsets for every processor, though each processor sees at most one cycle of the FSM. The first memory location used by the section in processor m (and its offset which is the "start state" for that processor's view of the FSM) &pends on 1 and m as well asp, k, ands.
Followrng equation (4), we can write AM in terms of the differences m course and offset
where the definitions of AC and AU are analogous to that of AM.
Here are some examples. We assume that 1 = O and that h is large.
Example 1 k = 4, p = 4,s = 3. The distribution is shown in Figure 2 (a). The transition table T is as follows. For convenience, we also tabulate AC and AO. This shows a simple case where all processm access local memory with constant stride. Substitudng the definition of P from equation (l), we get
This reduces to 
Array index (processor p -1) (p -l)k ::: (c) which is equivalent to Iinding au integer q such that km-l<sj -pkq~km-l?+k-1.
We rewrite inequalky (6) as a set of k linear Diophantine equations in the variables j and q, {sj-pkq =A:km-l <A<km-l+k-1}.
'l'he equations in this set are to be solved independently (rather thsn simultaneously). (7), we iindthe solution having the smallest nonnegative j. The minimum of these solutions gives us tie starting index value I + js, which we then convert to the starting memory location usiug equation (4).
We compute the ending memory location similarly, by finding the largest solutions no greater than h, and tdcing the maximum among these solutions. Now we extend this idea to build the AM sequence for the processor. A sorted version of the set of smallest positive solutions determined above gives us the sequence of indices that will be successively accessed by the processor. A linear scan th~otrgb this sequence is sufficient to generate the AM sequence for the cycle of the FSM visible to processor m. If the FSM has multiple cycles, the local AM sequences maybe different for different processors; otherwise, the sequences are cyclic shifts of one snorher. To avoid following the NEXT pointers at mntime, the AM table we compute contains the memory spacings in the order seen by the processor.
All of these elements are combined in Algorithm 1, which provides the solution for the memory address problem for a one-level mapping. The mtmingtime of Algorithm 1 is O(logmin(s, pk)) + O(k log k), which reduces to O(min(k log k + logs, k log k + logp)).
As the output can be of size k in the worst case, any algorithm for this problem takes Q(k) time. Finding an algorithm with O(k) running time remains an open problem. Our implementation recognizes the following special cases that can be handled more quickly p = 1, s divides k, k = 1, pk divides s, a single course of blocks, ands divides yk. Algorithm 1 can be simplified by noting that the rigbthsnd sides of successive solvable equations differ by d, and that the identity of the first solvable equation is easily determined from the value of (km -1) mod d. This removes the conditional from the loop in lines 5-12. Similar incremented computations can also be used to simplify the floor and ceiling computations m that loop.
'lltenode code for each processor consists of the fo~owing loop. As an example, let A be an array with element A(i) aligned with template cell 3i, let the template be mapped with p = k = 4, and let the regular section be A(O: 42: 3), which we abbreviate to B. The picture is shown in Figure 3( Thus, the local memory storage for A in the example above is as shown in Figure 3(b) . We handle tlds complicatim by tirst doing all our computations with respect to the template (the bcal tensplare space), and then switching back to the local memory space in the 6M1 step. The picture in local template space is shown m Figure 3(c) .
Note that A and B are both aligned to "regular sections" of the template A(i) is aligned to template cell 3i, and l?(i) is aligned to template cell 92. We therefore create state tables for A and B with respect to the local template space using Algorithm 1. The "AM" entries returned by the algorithm are actually spacings in locrd template space we therefore rename this column AT. This givea the tables for A and 1? shown below. Now all we need is to switch back to local memory space and fill in the AM entries.
As we traverse tbe state table TA for any processor m, we access each element of A stored on that processor, in order. But srnce storage is allocated based on A, this translates to accessing consecutive locations in local memory. Thus the AM entries for TA are all unity. Now it is a simple matter to find the AM entries for the desired section B. For instauce, the tirst line of TB says that if we are at an element of the section that has offset O, the next element has offset 2 and is 6 template cells away. 3).
Fwst, we compte the sum of the AT and the AM around the state machiue cycle, so that we cau cast out any full cycles that must be traversed. Second we tabulate the distance of each offset from a fixed origin (the numarcs table m Algorithm  2) , allowrng the computation of the distance between any two offset positions by two table lookups and some simple arithmetic.
Consider the third line of TB, which has a AT of 15. Using tbe precomputed sums, we determine in constant time that this corresponds to one full cycle plus 3 more template cells, and that the full cycle contributes a memory offset of 4. Now all we need is to determine the number of memory cells encountered in gorng from offset position 2 to offset position 1 in TA. We tid this distance in constant time using the rtumarcs Multidimensional arrays So far, we have characterized the access sequence of a single processor by a start address and a memorv stride Dattem. the latter and have a FSM associated with i~and each processor will execute a set of nested loops. Instead of the start address being fixed as it was before, it too will be generated by the transitions of a FSM. For simplicity, we will deal with a two-dimensional array. The extension to higher dimensions is straightforward.
An array dimension with p = 1 is "mapped to memory".
Consider TheFSM T2 is used to generate memory spacings for successive base addresses, while the FSM T1 is used to generate memory soaci.mm between successive elements with the same base address. k.lgori%m 3 computes these FSMS. Generating communication sets
The set of messages that a given processor m must send while performing its sham of the com@ation is called its conrmunicatwn set. We now extend the FSM approach to generate communication sets for regular communication patterns such as shhls and stride changes. In this section, we assume a one-dimensional array with a srngle-level mapping. Extensions to multilevel and multidiruensioml cases areas in the address generation problem.
We adopt Koelbel's execution model for parallel loops [7] , in which a processor goes through four steps:
1.
2.
3.
4.
Send those data items it holds that are required by other processors.
Perform local iterations, i.e., iterations for which it holds alI data.
Receive &ts items from other processors.
Perform nonlocal iterations.
We combme the Enal two steps into a receive-execute loop.
Our approach puts all the intelligence on the sending side, keeping the receive-execute loop simple. We wish to minimiie the number of messages sent and maximize cache benefits. To this end, each processor makes a single pass over the data in its local memory using the FSM technique &scribed above, figuring out the destination of each data element and building messages. The messages contain (address, value) pairs which the receiving processor uses to perform its computations.
All the data that must go from processor i to processor j' are communicated in a single message. The data iterns corresponding to the local iterations are conceptually sent by the processor to itself. Of course, this message is never sent rather, the local iterations are performed out of the message buffer. The idea is similar to the rnspector-executor model [10] for irregular loops.
Shifts
Consider the computation (For simplicity, assume 1 > O.) In this case, a processor communicates with at most two processors. We show how to augment the FSM with a A'P and a A.C column, such that adding these quantities to the processor and memory location of the current element gives the remote processor number and the memoty location in the remote processor of the element with which it interacts. Suppose we know that element A(i) is located on processor m at memory location M(i; p, k) with offset Oi = O(i; p, k). Then tbe problem is to find the processor and local memory location of theelementA(i-E). Letl = q ,pk+~, audAP = ((rOi)/kl. Thm P(i-gp, k)=(m-AP+p)modp. Sh.tce O<Oi <k, AP can assume only two values as the offset Oi varies. We also define the quantity AC such that M(i -&p, k) = M(i; P, k) + AL(Oi).
Stride changes
Now consider the communication required to move the regular section A(O: s2n: S2) to align with the regular section A(O: Sln: Sl). This is harder than a shift there is no simple lookup technique for generating the communication sets. The problem is that the pattern of destination processors can have period longer than k. The only fact that appears to be true in this case is that the sum of tbe periods over all processors divides pk. We therefote resort to a simple technique.
We generate the FSM to reflect the storage properties of the s2-"th element of this section interacts with index section. Now, the j i=j.
slof A. Letql=i divpk, rl=imod pk, qz=rldivk, and T2 = T I mod k. Then the &sired element is stored in memory location ql . k + T2 on processor qz.
Many modem RISC microprocessors perform integer division and remainder operations in software, which are rather slow. The equations above can be rewritten to require only two integer divisions, which substantially speeds up the implementation.
If either pork is a power of two, the computation can be sped up further by using shifts and bit masks instead of the division operations. Our implementation incorporates these optirnizations.
Termination
Each processor must determine when it has completed all its operations. As we do not send messages between every pair of processon, we cannot use the number of messages to determine termination. However, it is easy to figure out the total number of operations that processor m will execute (by a variant of Algorithm 1). A simple termination test initializes a counter to this value, decrements it by the number of amay elements received in each message, and exits when the counter reaches zero.
6
Experimental results
In this swtion we present experimental results of implementing the algorithms above. We measure botb the preprocessing cost of table generation and the runtirue loop overhead. Our computational kernel is the DAXPY operation y(l:n:s) = y(l:n:s) + a*x(l:n:s).
The ratio of memory operations to computation is high in this kernel, making it unfavorable for our algorithms.
The overhead due to irregular memory access patterns would decrease with more computation in the loop. AU experiments reported in this section were done on a Sparcstation 2, for which Dongarra [2] reports a 100 x 100 double-precision LINPACK performance of 4.0 Mflops. (We measured 3.3 Mflops.) (A repetition of the experiments on a single processor of an iPSC/860 showed simiisr characteristics.) We used the GNU C compiler, with optimization at the -02 level. A higher optimization level increased compilation time without improving the quality of the generated code. With s = 1 the loop compiled into 9 instructions, while the general case with table lookup compiled into 17 instructions.
Our measured performance for thes = 1 case was 2.0 IVftlops for long vectors.
Local address generation
Wc have seven parameters for a one-dimensional array: two alignment parameters a and b, two layout parameters p and k, and three section parameters 4, h, ands. This is too large a space over which to present results. We now explain how we reduce the number of variables.
c Alignment stride a: For any constant c, the memory layout for parameters ca and ck is exactly the same as that for a and k. Thus we experiment only with a = 1; the effect of changing a can be simulated by changing k instead.
q Al@rnent offset 11: 'lltis is irrelevant for large sections. For small sections where end effects could be importaut, we observed that changes in b affected results by less than 5%. We therefore take b = O.
q Section lower bound f: Again, this is irrelevant for large sections and has negligible effect even for small sections. We therefore take t = O. Section upper hound h: This affects the number of elements in the DAXPY, which in turn affects the tradeoffbetween loop startup and iteration time in a single processor. We vary h along witi s, keeping h/s fixed, so that all our experiments do tbe same amount of work regardless of stride.
Section strides:
This is the z-axis of our plots, as one of the two independent variables. Number of processors p: We run our experiments with a fixed number of processors. Varying the n~ber of processors has the effect of scaling the plots. For a given set of parameters, we compute the tables for each processor and time the execution of the distributed loop. Let r~be the time taken by processor m to execute the TZm iterations it receives. Then the execution time per element is t. =max~. = nn This factors out the effect of load imbalance,
We plot 2/t. (the megaflops per second per processor) against stide s for various block sizes k. The plots are shown in Figure 5 . The curves are the sum of two effects. Figure 6 preprocessing overhead for memory stride computations. The graph corresponds to an alignment stride of 1. The multiple points for each block size show the processing times for each of 32 processors.
Baseline
The lower envelo~is a smooth, relatively flat function ofs. The value of the envelope is approximately the asymptotic performance for the single-processor DAXPY.
2. Jiggles: The plot jiggles up and down some, presumably due to cache stride effects [1] .
Although the general loop has ahnost twice as many instructions as the unit-stride loop, the dtiference in performance is much less. This is because the execution time is dorninatedbythe floadng-pornt operations and references to operands in memory. The references to the AM array usually hit in cache, and the additional loop control instructions probably execute in overlapped fashion.
The preprocessing time required to compute the AM tables is shown in Figure 6 , where we plot the running time of Algorithm 1 against block size k. The running time is approximately O(k), with certain special cases being handled much faster. Experimental observation confirmed that the running times for Algorithm 2 were no more than twice those of Algorithm 1. If the AM sequence is known at compile time, unrolling tbe loop bodies avoids the indirection into the AM array. Figure 7 shows that the benefits of this are mixed.
While loop unrolling removes the indirection, it also increases the size of the loop body, thus the benefits of a simpler loop body maybe undone by misses in the insbuction cache caused by the larger loop body. This is a factor for larger block sizes, since the amount of unrolliig may be as large as k. We recommend unrolling only if the AM sequences are short, or can be compacted, e.g., by run-length encoding.
Communication set generation
For the communication set generation algorithms, we measured the time required to generate the FSM tables and tie time required to marshal the array elements into messages. We dld not measure the actual communication time, since that varies wi&ly among machines. The results are presented in Table 2 .
Conclusions
Our easy to compute, and the runtime overheads due to indirect addressing of data are small. Our data access patterns exploit temporal locality and make effective use of cache. We also support blocking of messages to reduce message startup overheads.
Our prototype implementation has demonstrated our performance claims.
An optimized implementation of our techniques should deliver close to maximum performance for general cyclic(k) distributions of arrays.
Our algorithms are fully parallel in that each processor of a parallel machine can generate its tables independently. 
