was a rammer student at RIA_.
IDeparlment
of Mathematics, Massachusetts/maktate of Technolol_, Cambridge, MA 02139 (ste_ag@thenry.lcsamit.edu).
This work was performed while the author was a postdoctoral scientist at Xerox PARC. Table 1 : Symbols used in this paper.
To appear in the
which template cell i is held by processor (idiv k) modp.
(We number array elements, template cells, and processors starting from zero.) Thus the array A is split into p local arrays residing in the local memories of the processors.
Consider a computation involving the regular section A(l: h : s), i.e., the set of array elements {A(t + is) : 0 <_ j <_ (h -e)/s}, where s > O. (Table 1 summarizes the notation.)
In this paper we consider the following problems:
• What is the sequence of local memory addresses that a given processor rn must access while performing its share of the computation?
• What is the set of messages that a given processor rn must send while performing its share of the computation?
The cyclic(k) distributions generalize the familiar block and cyclic distn_outions (which are cyclic (In/p] ) and cyclic(l) respectively). Dongarra et al. [3] have shown these distn'butions to be essential for efficient dense matrix algorithm_ on distn'buted-memory machines.
High Performance Fortran includes directives for multidimensional cyclic(k) distribution of templates. Many special cases of these problems have been solved, but the general solution does not appear in the literature or (to the best of oar knowledge) in any implementation. Koelbel and Mehrotra [8] treat special cases where the array is mapped using a block or a cyc/ic mapping, and the regular section has unit stride. Koelbel [7] handles non-unit-stride regular sections with block or cyc/ic mappings. MacDonald et al. [9] discuss the problem in the context of Cray Research's MPP Fortran, and conclude that a general solution requires unacceptable address computation overhead in inner loops. They therefore restrict block sizes and number of processors to powers of two, allowing the use of bit manipulation in address computation.
Experimental implementation of our solution shows that address computation and interproeessor communication can be performed efficiently without these restrictions.
The paper is organized as follows. Section 2 solves the problem for a one-level mapping (a = l,b = 0). Section 3 reduces the problem for two-level mappings to combining the solutions from two subproblems involving one-level mappings. Section 4 extends the framework to handle multidimensional arrays, and Section 5 extends it to handle generation of communication sets. Section 6 discusses performance results, and Section 7 presents conclusions.
2

One-level mappings
We first consider the problem where a one-dimensional array A is aligned to a template with a = 1 and b = O. This is called a one-level mapping. In this case, the array and the template are in one-to-one correspondence, and we specify the problem by the pair (p, k) describing the distn'bution of the template across the processors.
As shown in Figure l(a) , we visualize the layout of the array in the processor memories as courses of blocks. Three functions P, e, and O describe the location of array element A(i). These functions are defined as follows.
The following function gives the local memory location (with respect to the base of the local array) where element A(i) is stored within processor P(i; p, k):
We are interested in the sequence of local memory locations that an individual processor, numbered m, accesses while performing its share of the computation on the regular section A(l: h : a). We specify the sequence of accesses by its starting location and by the differences AA4(ih i2) = .M(i2;p, k) -A4 (il; p, k) , where il and i2 are two successive indices of the part of the regular section mapped to processor m. As the examples in Figure 2 illustrate, the spacing AA4 is not constant.
The key insight is that (for a particular p, k, and stride s) the offset O(il;p, k) of one element of the regular section determines the offset 0(i2; p, k) of the next element of the regular section in the same processor, and also determines the spacing AA4(ih i2). Thus we can represent the offset sequence and the A.M sequence as a simple 
where the definitions of AC and AO arc analogous to that of AA, I. Here are some examples. We assume that t = 0 and that h is large.
Example 1 k = 4, p = 4, s = 3. The dism'bution is shown in Figure 2(a) . The transition table T is as follows. For convenience, we also tabulate AC and AO. This shows a simple case where all processors access local memory with constant stride. Figure 2( [STATE0 [ r,_cr3I A_3 IIaClAOI0 3 __ Figure 2 (c). The transition table T is as follows. The A.M sequence has period k, and local memory accesses vary in stride.
I STATE I m_-T1A_ IIaC I AO I 'f_, /--,,
The algorithm
Preparatory to finding a processor's AA4 sequence, we solve the problem of finding its starting memory location. Given a processor number m, regular section A(t: h : s), and layout parameters p and k, we wish to find the first element (if any) of the regular section that resides on that processor. This is equivalent to finding the smallest nonnegative integer j such that 7_(t + s j; p, k) = m. Substituting the definition of 7_ from equation (1), we get
This reduces to km _< (t + sj) rood vk _< 1,(,,, + 1) -1 which is equivalent to finding an integer q such that We rewrite inequality (6) as a set of k linear Diophantine equations in the variables j and q, {sj-pkqf_:km-l<)_<km-l+k-1}.
The equations in this set are to be solved independently (rather than simultaneously).
Solutions exist For each solvable equation in set (7), we find the solution having the smallest nonnegative j. The minimum of these solutions gives us the starting index value l + js, which we then convert to the starting memory location using equation (4). We compute the ending memory location similarly, by finding the largest solutions no greater than h, and taking the maximum among these solutions. Now we extend this idea to build the A._ sequence for the processor.
A sorted version of the set of smallest positive solutions determined above gives us the sequence of indices that will be successivelyaccessedby the processor. A linear scan through this sequenceis sufficient to generate the A.Ad sequence for the cycle of the FSM vis_le to processor rn. If the FSM has multiple cycles, the local AAd sequences may be different for different processors; otherwise, the sequences are cyclic shifts of one another. To avoid following the NEXT pointers at runtime, the A.M table we compute contains the memory spacings in the order seen by the processor.
All of these elements are combined in Algorithm 1, which provides the solution for the memory address problem for a one-level mapping. The running time of Algorithm
As the output can be of size k in the worst case, any algorithm for this problem takes fl(k) time. Finding an algorithm with O(k) naming time remains an open problem. Our implementation recognizes the following special cases that can be handled more quickly:, p = 1, s divides k, k = 1, pk divides s, a single course ofblocks, and s divides pk. Algorithm I can be simplified by noting that the right hand sides of successive solvable equations differ by d, and that the identity of the first solvable equation is easily determined from the value of (kin -l) rood d. This removes the conditional from the loop in lines 5-12. Similar incremental computations can also be used to simplify the floor and ceiling computations in that loop. The node code for each _essor consists of the following loop.
•base 4-startmem; i *--0 Thus, the local memory storage for A in the example above is as shown in Figure 3Co ). We handle this compfication by first doing all our computations with respect to the template (the local template space), and then switching back to the local memory space in the final step. The picture in local template space is shown in Figure 3(c) .
Note that A and B are both aligned to "regular sections" of the template: A(i) is aligned to template cell 3i, and B(i) is aligned to template cell 9i. We therefore create state tables for A and B with respect to the local template space using Algorithm 1. The "AA4" entries returned by the algorithm axe actually spacings in local template space; we therefore rename this column AT. This gives the tables for A and B shown below.
(;3
Now all we need is to switch back to local memory space and fill in the AA4 entries.
As we traverse the state machine cycle, so that we can cast out any full cycles that must be traversed. Second, we tabulate the distance of each offset from a fixed origin (the numarcs table in Algorithm 2), allowing the computation of the distance between any two offset positions by two table lookups and some simple arithmetic. Consider the third line of Tv, which has a AT of 15. Using the precomputed sums, we determine in constant time that this corresponds to one full cycle plus 3 more template cells, and that the full cycle con_'ibutes a memory offset of 4. Now all we need is to determine the number of memory cells encountered in going from offset position 2 to offset position 1 in Ta. We find this distance in constant lime using the nttmarcs table. Algorithm 2 shows the method with these optlmizations incorporated.
1(2,o)
Algorithm 2 (Memory access sequence for a two-level mapping.) Input: Alignment parameters (a, b), layout parameters (p, k), array length n, regular section A(t: h : s), processor number m.
Ou_ut:
The AA4 So fat, we have characterized the access sequence of a single processor by a sta_ address and a memory stride pattern, the latter being generated by the transitions of a FSM. For a multidimensional array, each dimension willbe charactefize, d by a (p,k) pair and have a FSM associated with it, and each processor will execute a set of nested loops. Instead of the start addressbeing fixed as it was before, it too wiU be generated by the transitions of a FSM. For simplicity, we will deal with a two-dimensional array. The extension to higher dimensions is straightforward.
An array dimension with p = 1 is "mapped to memory".
Consider a one-level mapping of an array A(0: 17, 0: 7) with parameters (Pt, k0 -(3, 3) in the first dimension and (P2, k2) --(2, 2) in the second dimension, and let the section of interest be A(0: 17: 2, 0: 7: 3). The picture is shown in Figure 4 .
Let each block be laid out in column-major order in local memory. Thus, the sequence of index values corresponding to successive memory elements of processor (0,0) (Had the section been larger in the second dimension, the column accessedby processor (0, 0) immediately after column 0 would be column 9. The elements (0, 0) and (0, 9) would be 30 locations apart in local memory. This explains the AA4 entry in the first row of T2.)
The FSM T2 is used to generate memory spacings for successive base addresses, while lhe FSM Tl is used to generate memory spacings between successive elements with the same base address. 
Generating communication sets
The set of messages that a given processor m must send while performingitsshareofthecomputation iscalled its communication set. We now extend theFSM approach togenerate communication setsforregular communication patterns suchas shifts and stride changes. Inthis section, we assumea one-dimensional array with a single-level mapping.Extensions tomultilevel andmultidimensional cases areasintheaddress generation problem.
We adoptKoelbel's execution model forparallel loops [7] , in which a processor goesthrough foursteps:
1. Send those data items it holds that are required by other processors.
2.
Performlocal iterations, i.e., iterations forwhichit holdsall data.
3. Receive dataitemsfromother processors.
Perform nonlocal iterations.
We combinethefinal two steps intoa receive-execute loop.
Our approach puts all the intelligence on the sending side, keeping the receive-execute loop simple. We wish to minimize the number of messages sent and maximize cache benefits. To this end, each processor makes a single pass over the data in its local memory using the FSM technique descn'oed above, figuring out the destination of each data element and building messages. The messages contain (address, value) pairs which the receiving processor uses to perform its computations. All the data that must go from processor i to processor j are communicated in a single message. The data items corresponding to the local iterations are conceptually sent by the processor to itself. Of course, this message is never sent; rather, the local iterations are performed out of the message buffer. The idea is similar to the inspector-executor model [10] for irregular loops.
Shifts
Consider the computation
A(O:h:s) = A(O:h:s)+ A(t:t+h:s),
where the regular section A(l: l + h :s) is to be moved to align with A(0: h:s). (For simplicity, assume ! > 0.) In this case, a processor communicates with at most two processors. We show how to augment the FSM with a A7) and a A£ column, such that adding these quantifies to the processor and memory location of the current element gives the remote processor number and the memory location in the remote processor of the element with which it interacts.
Suppose we know that element A(i) is located on processor m at memory location .A4(i;p, k) with offset Oi = O(i;p, k). Then the problem is to find the processor and local memory location of the elemem A(i -l). Let l = q. pk + r, and AP = [(r -O_)/k]. Then P(i -l; p, k) = (m -AP + p) rood p. Since 0 < O, < k, AP can assume only two values as the offset Oi varies.
We also define the quantity A£ such that A4(i -l;p, k) = _(i;p, k)+ Az(o,).
Stride changes
Now consider the communication required to move the regular seclion A(0: s2n: s2) to align with the regular section A(0: stn: sl) . This is harder than a shift; there is no simple lookup technique for generating the communication sets. The problem is that the pattern of destination processors can have period longer than k. The only fact that appears to be true in this case is that the sum of the periods over all processors divides pk. We therefore resort to a simple technique.
We generate the FSM to reflect the storage properties of the s2-section. Now, the jth element of this section interacts with index i = j • st ofA. Letqt = idivpk, rt = i rood pk, q2 = rt div k, and r2 = rl mod k. Then the desired dement is stored in memory location q_ • k + r2 on processor q2:
Many modem RISC microprocessors perform integer division and remainder operations in software, which are rather slow. The equations above can be rewritten to require only two integer divisions, which substantially speeds up the implementation. If either p or k is a power of two, the computation can be sped up further by using shifts and bit masks instead of the division operations. Our implementation incorporates these optimizations.
Termination
Each processor must determine when it has completed all its operations. As we do not sendmessagesbetween every pair of processors, we cannot use the number of messages to determine termination. However, it is easy to figure out the total number of operations that processor m will execute (by a variant of Algorithm 1). A simple termination lest initializes a counter to this value, decrements it by the number of array elements received in each message, and exits when the counter reaches zero.
Experimental results
In this section we present experimental results ofimplementing the algorithms above. We measure both the preprocessing cost of table generation and the runtime loop overhead.
Our computational kernel is the DAXPY operation y(l:n:a) -y(l:n:s) + a*x(l:n:a).
The ratio of memory operations to computation is high in this kernel making it unfavorable for our algorithms. The overhead due to irreguhr memory access patterns would decrease with more computation in the loop. All experiments reported in this section were done on a Sparcstation 2, for which Dongarra [2] reports a 100 x 100 double-precision LINPACK performance of 4.0 Mflops. (We measured 3.3 Mflops.) (A repetition of the experiments on a single processor of an iPSC/g60 showed similar characteristics.) We used the GNU C compiler, with optimization at the -02 level. Our measured performance for the s = I case was 2.0 M/lops for long vectors.
Local address generation
We have seven parameters for a one-dimensional array:, two alignment parameters a and b, two layout parameters p and k, and three section parameters t, h, and s. This is too large a space over which to present results.
We now explain how we reduce the number of variables.
. Alignment stride a: For any constant c, the memory layout for parameters ca and ck is exactly the same as that for a and k. Thus we experiment only with a = 1; the effect of changing a can be simulated by changing k instead.
. Alignment offset b: This is irrelevant for large sections. For small sections where end effects could be important, we observed that changes in b affected results by less than 5%. We therefore take b = 0.
• Section lower bound t: Again, this is irrelevant for large sections and has negligible effect even for small sections. We therefore lake t = O.
• Section upperbound h: This affects thenumber of elements in the DAXPY, which in turn affects the tradeoffbetween loop starmp and iteration time in a single processor. We vary h along with s, keeping h/s fixed, so that all our experiments do the same amount of work regardless of stride.
• Section strldc s: This is the z-axis of our plots, as one of the two independent variables.
• Number of processors p: We run our experiments with a fixed number of processors.
Varying the number of processors has the effect of scaling the plots.
• Block size k: This varies from plot to plot, as the second of the two independent variables.
We use the execution time per element of the DAXPY loop as our measure of performance. For a given set of parameters, we compute the tables for each processor and time the execution of the distributed loop. Let rm be the time taken by processor m to execute the n,, iterations it receives.
Then the execution time per element is
This factors out the effect ofload imbalanee. :_: : -: _: ::::
We plot 2/G (the megaflops per second per processor) against stride s for various block sizes k. The plots are shown in Figure 5 . The curves are the sum of two effects. 
Baseline:
The lower envelope is a smooth, relatively fiat function of s. The value ofthe envelope is approximately the asymptotic performance for the single-processor DAXPY.
Jiggles:
The plot jiggles up and down some, presumably due to cache stride effects [1].
Although the generalloop has almost twice as manyinstmctions as the unit-stride loop, the difference in performance is much less. This is because the execution time is dominated by the floating-poim operations and references to operands in memory. The references to the AA4 arrayusually hit in cache, and the additional loop control instructions probably execute in overlapped fashion.
The preprocessing time required to compute the AM tables is shown in Figure 6 , where we pIot the running time of Algorithm 1 against block size k. The running time is approximately O(k), with certain special cases being handled much faster. Experimental observation confirmed that the running times for Algorithm 2 were no more than twice those of Algorithm 1. If the A,Ad sequence is known at compile time, unrolling the loop bodies avoids the indirection into the AAd array. Figure 7 shows that the benefits of this are mixed.
While loop unrolling removes the indirection, it also increases the size of the loop bod_ thus the benefits of a simpler loop body may be undone by misses in the instruction cache caused by the larger loop body. This is a factor for larger block sizes, since the amount of unrolling maybe as large as k. We recommend unrolling only ff the AA4 sequences are short, or can be compacted, e.g., by rtm-length encoding.
Commun|cation set generation
For the communication set generation algorithms, we measured the time required to generate the FSM tables and the time required to marshal the array elements into messages.
We did not measure the actual communication time, since that varies widely among machines.
The results are presented in Table 2 .
Conclus|ons
Our easy to compute, and the runtime overheads due to indirect addressing of data arc small. Our data access patterns exploit temporal locality and make effective use of cache. We also support blocking of messages to reduce message starmp overheads.
Our prototype implementation has demonstrated our performance claims.
An optimized implementation of our techniques should deliver close to maximum performance for generalcyclic(k) distributions of arrays. Our algorithms are fully parallel in that each processor of a parallel machine can generate its tables independently Finding a fully parallel algorithm running in O(k) parallel time or with O(p -{-k) total work remains an open problem.
