Generating local addresses and communication sets for data-parallel programs by Schreiber, Robert et al.
NASA-CR-[96605
/,a/'-6_ :d-,__
Research Institute for Advanced Computer Science
NASA Ames Research Center
,F_"_ ,,L
Generatin.g Local Addresses and
Communication Sets for Data-Parallel
Programs
Siddhartha Chatterjee
John R. Gilbert
Fred J. E. Long
Robert Schreiber
Shang-Hua Teng
(NASA-CR-194605) GENERATING LOCAL
ADORESSES AND COMMUNICATION SETS
FOR OATA-PARALLEL PROGRAMS
(Research Inst. for Advanced
Computer Science) 11 p
_/62
N94-15796
Unclas
0191135
RIACS Technical Report 93.03 April 1993
To appear in the Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, San Diego, CA, May 1993.
Submitted to Journal of Parallel and Distributed Computing.
\\
\
\
\
\
https://ntrs.nasa.gov/search.jsp?R=19940011323 2020-06-16T19:29:30+00:00Z

Generating Local Addresses and
Communication Sets for Data-Parallel
Programs
Siddhartha Chatterjee
John R. Gilbert
Fred J. E. Long
Robert Schreiber
Shang-Hua Teng
The Research Institute of Advanced Computer Science is operated by Universities Space Research
Association, The American City Building, Suite 311, Columbia, MD 21044, (301) 730-2656
Work reported herein was supported by NASA via Cooperative Agreement NCC 2-387 between NASA and
the Universities Space Research Association (USRA). Work was performed at the Research Institute for
Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, CA 94035-1000.

Generating Local Addresses and Communication Sets for Data-Parallel Programs
Siddhartha Chatterjee * John IL Gilbert t Fred J. E. Long'_ Robert Schreiber * Shang-Hua Teng §
Abstract
Generating local addresses and communication sets is an impor-
tant issue in distn'buted-memory implementations of data-parallel
languages such as High Performance Fortran. We show that for
an array A affinely aligned to a template that is distn'buted across
p processors with a cyclic(k) distn'bution, and a computation in-
volving the regular section A(t: h : s), the local memory access
sequence for any processor is characterized by a finite state ma-
chine of at most k states. We present fast algorithms for computing
the essential information about these state machines, and extend
the framework to handle multidimensional arrays. We also show
how to generate communication sets using the state machine ap-
proach. Performance results show that this solution requires very
little rtmtime overhead and acceptable preprocessing time.
1 Introduct;on
Languages such as Fortran D [4] and High Performance Fortran [5]
allow the programmer to annotate programs with alignment and dis-
tribution statements that descn"oe how to map arrays to processors
in a distn'buted-memory environment. Both languages introduce
an auxiliary Cartesian grid called a template; arrays ate aligned to
templates, and templates are distributed across the processors. The
alignment of an array to a template and the distn'bution of the tem-
plate determine the final mapping of the array. The compiler must
partition the arrays and produce "node code" for execution on each
processor of a distn'outed-memory parallel computer.
Consider an array A of size n aligned to a template such that
array element A(i) is aligned to template cell ai + b. The template
is distn'buted across p processors using a cyclic(k) distribution, in
*Research Institute for Advanced Computer Science, Matq Stop "ID45-1,
NASA Ames Research Center, Moffett Field, CA 94035-1000 (_@_¢_.edu,
_:.la_riacs.edu). The work of these authors was supported by the NAS Sy_.ems
Division via Cooperative Agreement NOC 2-387 between NASA and the Unive_ities
Space Research Association (USRA).
tXerox palo Alto Research Center, 3333 Coyote I-FallRoad, Palo Alto, CA 94304-
1314 (gilbert@ pga'e.x et_x.enm ). Copyright _) 1992 by Xexox Corporation. All rights
reserved.
*Depm'tm ent o f Computer and Inform ation Science, University of Califomi_ Santa
Cruz, CA 95064 (flong@cae.uc_c.edu). This workwas performedwhile the author
was a rammer student at RIA_.
IDeparlment of Mathematics, Massachusetts/maktate of Technolol_, Cambridge,
MA 02139 (ste_ag@thenry.lcsamit.edu). This work was performed while the author
was a postdoctoral scientist at Xerox PARC.
To appear in the Proceedings of the Fourth ACM SIGPLAN
b'ymp osimn on Principles and Practice of Parallel Prograrnm kt g,
San Diego, CA, May 1993.
A a distn'buted array
n size of array A
a stride of alignment of A to the template
b offset of a]igument of A to the template
p number of processors to which the template is distributed
k block size of dislr_ution of the template
/f lower bound of regular section of A
h upperbound of regular section of A
8 stride of regular section of A
m processor number
Table 1: Symbols used in this paper.
which template cell i is held by processor (idiv k) modp. (We
number array elements, template cells, and processors starting from
zero.) Thus the array A is split into p local arrays residing in the local
memories of the processors. Consider a computation involving the
regular section A(l: h : s), i.e., the set of array elements {A(t +
is) : 0 <_ j <_ (h - e)/s}, where s > O. (Table 1 summarizes the
notation.) In this paper we consider the following problems:
• What is the sequence of local memory addresses that a given
processor rn must access while performing its share of the
computation?
• What is the set of messages that a given processor rn must
send while performing its share of the computation?
The cyclic(k) distributions generalize the familiar block and
cyclic distn_outions (which are cyclic(In/p] ) and cyclic(l) respec-
tively). Dongarra et al. [3] have shown these distn'butions to be es-
sential for efficient dense matrix algorithm_ on distn'buted-memory
machines. High Performance Fortran includes directives for multi-
dimensional cyclic(k) distribution of templates.
Many special cases of these problems have been solved, but the
general solution does not appear in the literature or (to the best of
oar knowledge) in any implementation. Koelbel and Mehrotra [8]
treat special cases where the array is mapped using a block or a
cyc/ic mapping, and the regular section has unit stride. Koelbel [7]
handles non-unit-stride regular sections with block or cyc/ic map-
pings. MacDonald et al. [9] discuss the problem in the context of
Cray Research's MPP Fortran, and conclude that a general solu-
tion requires unacceptable address computation overhead in inner
loops. They therefore restrict block sizes and number of processors
to powers of two, allowing the use of bit manipulation in address
computation. Experimental implementation of our solution shows
that address computation and interproeessor communication can be
performed efficiently without these restrictions.
The paper is organized as follows. Section 2 solves the problem
for a one-level mapping (a = l,b = 0). Section 3 reduces the
problem for two-level mappings to combining the solutions from
two subproblems involving one-level mappings. Section 4 extends
the framework to handle multidimensional arrays, and Section 5
extends it to handle generation of communication sets. Section 6
discusses performance results, and Section 7 presents conclusions.
2 One-level mappings
We first consider the problem where a one-dimensional array A
is aligned to a template with a = 1 and b = O. This is called
a one-level mapping. In this case, the array and the template are
in one-to-one correspondence, and we specify the problem by the
pair (p, k) describing the distn'bution of the template across the
processors. As shown in Figure l(a), we visualize the layout of
the array in the processor memories as courses of blocks. Three
functions P, e, and O describe the location of array element A(i).
_(i;p, k) is the processor holding A(i), e(i;p, k) is the course
of blocks within this processor holding A(i), and O(i; p, k) is the
offset of A(i) within the course. The layout of array elements in
the local processor memories is as shown in Figure 1(b).
These functions are defined as follows.
"P(i;p,k) = (imodpk) divk=(idivk)modp (1)
C(i; p,/c) = idlv pk = (i _v k) div V (2)
O(i;p, k) = irood k (3)
Therefore, we have
i = vk. c(i; p, k) + 1,. p(i; v, k) + o(i; p, k).
The following function gives the local memory location (with
respect to the base of the local array) where element A(i) is stored
within processor P(i; p, k):
.M(i;p,k)=k.C(i;p,k)+O(i;p,k). (4)
We are interested in the sequence of local memory locations that
an individual processor, numbered m, accesses while performing
its share of the computation on the regular section A(l: h : a). We
specify the sequence of accesses by its starting location and by
the differences AA4(ih i2) = .M(i2;p, k) - A4(il; p, k), where il
and i2 are two successive indices of the part of the regular section
mapped to processor m. As the examples in Figure 2 illustrate, the
spacing AA4 is not constant.
The key insight is that (for a particular p, k, and stride s) the
offset O(il;p, k) of one element of the regular section determines
the offset 0(i2; p, k) of the next element of the regular section in
the same processor, and also determines the spacing AA4(ih i2).
Thus we can represent the offset sequence and the A.M sequence
as a simple table indexedby offset. We visualize this table as the
transition diagram of a finite state machine (FSM) whose k states
are the offsets. This FSM is particularly simple because it has no
input, and each state has only one outgoing transition; in fact, it is
just a union of state-disjoint cycles. Thus the sequence of A.M's is
periodic, with period at most k.
The contents of the table, or the transitions of the FSM, depend
on p, k, and s, but not on the section bounds I and h or the processor
number m. The same FSM descn_s the offsets for every processor,
though each processor sees at most one cycle of the FSM. The first
memory location used by the section in processor m (and its offset,
which is the "start state" for that processor's view of the FSM)
depends on l and m as well as p, k, and s.
Following equation (4), we can write AA4 in terms of the dif-
ferences in course and offset:
a_(6, i_) = 1,. ac(i_, i_) + t,o(it, 6), (5)
where the definitions of AC and AO arc analogous to that of AA, I.
Here are some examples. We assume that t = 0 and that h is
large.
Example 1 k = 4, p = 4, s = 3. The dism'bution is shown in
Figure 2(a). The transition table T is as follows. For convenience,
we also tabulate AC and AO. This shows a simple case where all
processors access local memory with constant stride.
I ST^TEI ra_cr I a'_ IIaC l aO I _'_
0 3 3 0 3
1 0 3 1 -1
2 1 3 1 -1
3 2 3 1 -1
Example2 k = 4,p = 3,s = 3. The distn'bution is shown in
Figure 2(b). The transition table T is as follows. This demonstrates
that the FSM for a problem can consist of disjoint cycles. Each
processor sees at most one cycle of the FSM, depending on its start
state.
[STATE0[ r,_cr3I A_3 IIaClAOI03 __
1 1 4 1 0
- 2 2 4 1 0
_3 0 I1 II 1 -3 LL)
Example3 k = 4,p = 3, s = 5. The distn'bution is shown in
Figure 2(c). The transition table T is as follows. The A.M sequence
has period k, and local memory accesses vary in stride.
I STATEI m_-T1A_ IIaC IAO I 'f_, /--,,
0 3 7 1 '3
- 1 2 9 2
2 0 2 1 - -2
3 i 2 1 -2
2.1 The algorithm
Preparatory to finding a processor's AA4 sequence, we solve the
problem of finding its starting memory location. Given a processor
number m, regular section A(t: h : s), and layout parameters p and
k, we wish to find the first element (if any) of the regular section that
resides on that processor. This is equivalent to finding the smallest
nonnegative integer j such that 7_(t + s j; p, k) = m.
Substituting the definition of 7_ from equation (1), we get
((t + _J)rood vJ') dip k = m.
This reduces to
km _< (t + sj) rood vk _< 1,(,,, + 1) - 1
which is equivalent to finding an integer q such that
kra-l <_ s j- pkq < kin- l+ k- 1. (6)
(a)
Course
0
1.
Offset
Processor 0 Processor I . ,.. Processor p - 1
0 ... k-I k ... 2k-I ... _ l?kk ... pk-I
vk ... (P+l)k-I (P+l)k. ... (P+2)k-I ... ,_,_,-- ... 2vk-1
0 ... k-I 0 ... k-I ... 0 ... k-I
CO)
Local memory location 0 ... k - 1 k ... 2k - 1
Array index (processor 0) 0 ... k - 1 pk
Array index (v_essor 1) k ... 2k - 1 (V+ 1)k
Array index (processorp - 1) (p- l)k ... pk - 1 (2p- 1)k
/P+I/k-1
... p 2 1 ...
... 2pk - 1 :..
Figure 1: A one-level mapping. (a) Layout parameters. Processors run across the page, while memory is organized as a two-dimensional
array of courses and offsets. Each box represents a memory cell, and the number in the cell indicates the array element it holds. CO)Layout
of array elements in local processor memories. Each processor's memory is one row of the diagram.
(a)
Co)
(c)
Proc. 0 Proc. 1 Proc. 2 Proc. 3
[01 1 2 _] 4 5 _ 7 8 [_ 10 11 112[ 13 14
16 17 [-_ 19 20 _ 22 23 I-_ 25 26 _ 28 29 I'_ 31
32 _ 34 35 ['_ 37 38 [-_ 40 41 [-_ 43 44 _ 46 47
Proc. 0 Proc. I Proc. 2
[1_ 1 2 _ 4 5 [L_ 7 8 _9i_ 10 1113 14 16 17 19 20 22 23
Proc. 0
lO] 1 2 3
12 13 14 [_
24 [_ 26 27
36 37 38 39
48 49 _ 51
Proc. 1
4 151 6 7
16 17 18 19
28 29 [-_ 31
] 41 42 43
52 53 54
Proc. 2
8 9 IlO[ 11
_] 21 22 23
32 33 34
44 _ 46 47
56 57 58 59
Figure 2: One-level array mappings. The corresponding state tables are shown in Examples 1-3. (a) k = 4,p = 4, s = 3. Co)
k = 4,p= 3,s = 3. (c)k = 4,p = 3, s =5.
We rewrite inequality (6) as a set of k linear Diophantine equa-
tions in the variables j and q,
{sj-pkqf_:km-l<)_<km-l+k-1}. (7)
The equations in this set are to be solved independently (rather than
simultaneously). Solutions exist for an individual equation if and
only ifA is divis_le by GCD(s,pk).
The generalsolutionto the linear Diophantine equation sj -
pkq = A isfound as follows.Letd = GCD(s,pk) = ots+flpk,
with a, fl G Z. The quantities a, t, and d are determined by the
extended Euclid algorithm [6, page 325]. Then the general solution
of the equation is the one-parameter family j = (Aa + pk'r)/d and
q = (-Aft + s'r)/d, for "r E Z.
For each solvable equation in set (7), we find the solution having
the smallest nonnegative j. The minimum of these solutions gives
us the starting index value l + js, which we then convert to the
starting memory location using equation (4). We compute the
ending memory location similarly, by finding the largest solutions
no greaterthan h, and taking the maximum among these solutions.
Now we extend this idea to build the A._ sequence for the
processor. A sorted version of the set of smallest positive solutions
determined above gives us the sequence of indices that will be
successivelyaccessedby the processor. A linear scan through this
sequenceis sufficient to generate the A.Ad sequence for the cycle of
the FSM vis_le to processor rn. If the FSM has multiple cycles,
the local AAd sequences may be different for different processors;
otherwise, the sequences are cyclic shifts of one another. To avoid
following the NEXT pointers at runtime, the A.M table we compute
contains the memory spacings in the order seen by the processor.
All of these elements are combined in Algorithm 1, which pro-
vides the solution for the memory address problem for a one-level
mapping. The running time of Algorithm I is O(log nfin(s, pk)) +
O(k log k), which reduces to O(min(k log k + log s, k log k +
logp)). As the output can be of size k in the worst case, any
algorithm for this problem takes fl(k) time. Finding an algorithm
with O(k) naming time remains an open problem. Our implemen-
tation recognizes the following special cases that can be handled
more quickly:,p = 1, s divides k, k = 1, pk divides s, a single
course ofblocks, and s divides pk.
Algorithm I can be simplified by noting that the right hand sides
of successive solvable equations differ by d, and that the identity
of the first solvable equation is easily determined from the value of
(kin - l) rood d. This removes the conditional from the loop in
lines 5-12. Similar incremental computations can also be used to
simplify the floor and ceiling computations in that loop.
The node code for each _essor consists of the following loop.
•base 4- startmem; i *-- 0
while (base </astmem) do
compute with address base
base _ base + A.A4 [i]
i ,--- (i + I) rood length
enddo
Algorithm 1 (Memory access sequence for a one-level
mapping.)
Input: Layout parameters (p, k), regular section A(l: h : s),
processor number m.
Output: The A.A4 table, the length of the table, ststtin.__ memory
location, ending memory location, and )__ AA4 for
processor rrL
Method:
1 integer AA4[k], length
2 integer X, i, start, end, d, _, fl, /course, /offset, course,
offset
3 integer loc[k+l], sortedlk+l]
4 (d, o_, t) (-- EXTENDED-EUCLID(s, pk); length *-- O;
start +-- h + 1;end _t- 1
5 forA=km-t tokm--t+k--ldo
6 if A rood d --- 0 then /* Solvable equation */
8 start ,-- min(s_ Ioc[lenggaD
s (_O t .j_ _k I (h--t)d--A_s I\'_9 end'--max(end, l+'d_ --_ L pk_ J))
10 length ,-- lengOa + 1
11 endlf
12 enddo
13 if length - O or start > h then
14 return A3A, _1_,_1_,_L, _1_/* No work for processor */
15 endif
16 sorted*-.- SORT(lot, length)
17 sorted[length] *-.- sorted[O] +pks/d; Icourse
C(sorted[O]; p, k); loffset ,-- O(soned[O]; p, k)
18 fori-0 tolength- ldo
19 course ,-- C (sorted[i + 1]; p, k); offset
+---O(sorted[i + I]; p, k)
20 A.Ad[i] -- k"(course - 1course) + (offset - loffseO
21 lcotrrse _ course; loffset *- offset
22 enddo
23 return AA4, length, A4(s_-t; p, k), A4(end; p, k),
sk/GcD(s, pk)
We illustrate the actions of Algorithm 1 on Example 3 for pro-
cessor 0. Here, p = 3, k = 4, and s = 5. The extended Euclid
algorithm returns d = 1, a = 5, and fl = -2. Since each linear
Diophantine equation is solvable, the loop in lines 5-12 is executed
four times. At the end of this loop, loc - [0, 25, 50, 15], length - 4,
start - 0, and end - 50. After llne 17, sorted - [0, 15, 25, 50, 60].
At the end of the loop in lin_es 18-22,AA4 - [7, 2, 9, 2].
3 Two-level mappings
Two-level mappings ate solved by two applications of Algorithm 1.
As an example, let A be an array with element A(i) aligned with
template cell 3i, let the template be mapped with p = k = 4, and let
the regular section be A(0: 42: 3), which we abbreviate to B. The
picaLre is shown in Figure 3(a). The boxes represent template cells,
the number j indicates the location of A(j), and numbers in squares
are part of the regular section A(0: 42 : 3). Two-level mappings
encompass _dl mappings possible with the linguistic constructs of
nested sections, af_e alignment, and cyclic(k) distribution.
Unlike the one-level case, all template cells do not contain array
elements. However, we allocate local storage only for the occupied
cells of the template. Thus, the local memory storage for A in
the example above is as shown in Figure 3Co). We handle this
compfication by first doing all our computations with respect to the
template (the local template space), and then switching back to the
local memory space in the final step. The picture in local template
space is shown in Figure 3(c).
Note that A and B are both aligned to "regular sections" of the
template: A(i) is aligned to template cell 3i, and B(i) is aligned
to template cell 9i. We therefore create state tables for A and B
with respect to the local template space using Algorithm 1. The
"AA4" entries returned by the algorithm axe actually spacings in
local template space; we therefore rename this column AT. This
gives the tables for A and B shown below. (;3
_TATE NEXT AI AA4 '.__.,_
0 3 3 ?
Ta- 1 0 3 ?
2 1 3 ?
---3---- 2 3 ?
(
STATE NEXT AT- AA4
0 2 6 ? _ - _1)
TB- 1 3 6 ? C2 1 15 ? /-
3 0 9 ? (,,
Now all we need is to switch back to local memory space and
fill in the AA4 entries.
As we traverse the state table TA for any processor m, we
access each element of A stored on that processor, in order. But
sincestorageisallocated based on A, thistranslatesto accessing
consecutivelocationsinlocalmemory. Thus the A.M entriesfor
TA are all unity. Now it is a simple matter to find the AA4 entries
for the desired section B. For instance, the first line of TB says
that if we are at an element of the section that has offset 0, the
next element has offset 2 and is 6 template cells away. To find
the corresponding AA4, we start in state 0 of TA, and walk the
state table, accumulating AT and AA4 counts. We stop when we
have accumulated the desired AT count. (It is easy to show that at
this point we will have arrived at the correct offset as well.) The
accumulated AA4 value is the desired entry for that Rue ofTB. This
procedure results in the following table for B)
[SrAr_I Nrrr I m" IIA_ I
01216 21 3 6 2
3 0 9 3
The algorithm as presentedabove requires O(k) time to com-
pute each AA4 entry of TB. TWo optimizations reduce this com-
plexity to O(1), allowing the final fixup phase to nm in O(k) time.
First, we compute the sum of the AT and the A.A4 around the state
lIn [hill _IS_, _ D.lrl_ out to be AT divided by [he alignment stride a - 3, but
this is not _me in generaL Try the following case: a - 3, b - 0, p - 3, k - 8, s - 3.
(a)
Co)
(c)
I01
16
32
ProcessorO
H
11 [121
22
1271
17
28
Processor 1
2
7
[lS]
23
[ 33 [ 34
35 ]391
8
13
29
40
Processor 2 Processor 3
131 4 5
10
14 [15 I
19 20 [21[
25 26
13o] 31
35 136[ 37
41 1421
I L°calmem°ryl°cati°nII°l 1 [ 2[ 3 [ 4[ 5 I 6 I 7 I 8 I 9 I 1°1
ProcessorO 0 1 6 11 16 17 [ 22 27 32 33 38
Processorl 2 7 12 13 18 23 28 29 34 39
Processor2 3 8 9 14 19 24 25 30 35 40141
Processor3 4 5 10 15 20 21 26 31 36 37
IL°caltemPlatecenIIO lI 121 3 1415 I 6 171 8 1 9 1I°lII[...[
Processor0 0 I 6 Ii ...
Processor 1 2 7 12 13 ...
Processor 2 3 8 9 14 ...
Processor 3 4 5 10 15 ..-
Figure 3: A two-level mapping of an array A aligned to the template with stride 3. (a) Layout of template ceUs. Numbers are array elements;
boxed numbers are in the regular section A(0: 42 : 3). Co) Local memory storage for array. (c) Local template space for array.
1(2,o)
Processor(O,O)
(o,1) (0,4)
(1,1) (1,4)
(2,1) (2,4)
(9, 1) (9,4)
(10,1) (10,4)
(II,1) (11,4)
(4,1)
(5,1)
(12,1)(13,1)
l(14,o) (14,1)
Processor(1,O)
(3,1) (3,4)
Processor(0,1)
(0,5)
(1,5)
(2,5)
(9,5)
(10,5)
(11,5)
(0,2)1,2)
(2,2)
(9,2)
(10,2)
(11,2)
(0,7)(1,7)(2,7)
(9,7)(lO,7)
(11,7)
(3,5) (3,2)
Processof(l, 1)
(14,3)I 1(14,6)
(3,7)
PrOCeSSOr
(6,1)
(7,1)
8
(15,1/(16,1)
(17,1)
(4,4) (4,5)
(5,4) (5,5)
(12,4) (12,5)
(13,4) (13,5)
(14,4) (14,5)
(2,0)
(6,4) (6,5)
(4,2)
(5,2)
(12,2)
(13,2)
(14,2)
(6,2)
Processor (2, 1)
1(8, 6)
(4,7)
(5,7)
(12,7)
(13,7)
(14,7)
(6,7)
(7,7)(7,4) (7,5)
(8,4) (s,5)
(15,4) (15,5)
(16,4) (16,5)
(17,4) (17,5)
(7,2)
(8,2)
_i5,2)(16,2)
(17,2)
(8,7)
(15,7)(16,7)
(17,7)
Figure 4: Mapping a two-dimensional array. Layout of array elements on processor arrangement with p = k = 3 in the vertical dimension
and p = k = 2 in the horizontal dimension. The boxed pairs are A(0:17: 2, 0: 7: 3).
machine cycle, so that we can cast out any full cycles that must be
traversed. Second, we tabulate the distance of each offset from a
fixed origin (the numarcs table in Algorithm 2), allowing the com-
putation of the distance between any two offset positions by two
table lookups and some simple arithmetic. Consider the third line
of Tv, which has a AT of 15. Using the precomputed sums, we
determine in constant time that this corresponds to one full cycle
plus 3 more template cells, and that the full cycle con_'ibutes a
memory offset of 4. Now all we need is to determine the num-
ber of memory cells encountered in going from offset position 2
to offset position 1 in Ta. We find this distance in constant lime
using the nttmarcs table. Algorithm 2 shows the method with these
optlmizations incorporated.
Algorithm 2 (Memory access sequence for a two-level
mapping.)
Input: Alignment parameters (a, b), layout parameters (p, k),
array length n, regular section A(t: h : s), processor
number m.
Ou_ut: The AA4 table, length of the table, starting memory
location, ending memory location, and the length of
A in processor m.
Method:
1 integer £ ram, numarcs[k]
2 integer AA4/k], length, startmem, lastmem
3 integer AAdl[k], lengthl, startmeml, lasrmeml, memcyclel
4 integer AA42[k], length2, startmem2, lasrmem2, memcycle2
5 function residual(x)-
length I •L (x-startmem l)/memcycle I j
6 function major(x) - (nvmarcs[x mod k] -
numarcs[startmeml mod k]) rood lengthl
7 function realmem(x) - major(x) + residual(x)
8 (AAdl, lengthl, startmeml,/astmeml, memcyclel)
,--- Algorithm l(b, a(n - I) + b, a, p, k, m)
9 (A/d2, length2, startmem2, lastmem2, memcycle2)
+-- Algorithm l(ag + b, ah + b, as, p, k, m)
10 if length1 -.L or length2-.L then
11 return AA4, _1_,_1_,.1_, 0/* No work for processor */
12 endif
13 mm ,--- starmaeml
14 for i - 0 to length1 -- I do
15 numarcs/mm rood k] *- i
16 mm _ mm + AAdl[i]
17 enddo
18 mm *---startmem2
19 for i - 0 to length2 -- 1 do
20 AAA[i] .- reahnem(mm + AAd2[i]) - realmem(mm)
21 mm _ mm+ A.M2[i]
22 enddo
23 return AA4, length2, realmem(startmem2),
realmem(lastmem2 ), w.almem(lastmem I) --
realmem(startmem l )+ l
4 Multid;mensional arrays
So fat, we have characterized the access sequence of a single pro-
cessor by a sta_ address and a memory stride pattern, the latter
being generated by the transitions of a FSM. For a multidimen-
sional array,each dimension willbe charactefize,d by a (p,k) pair
and have a FSM associated with it, and each processor will execute
a set of nested loops. Instead of the start addressbeing fixed as it
was before, it too wiU be generated by the transitions of a FSM. For
simplicity, we will deal with a two-dimensional array. The exten-
sion to higher dimensions is straightforward. An array dimension
with p = 1 is "mapped to memory".
Consider a one-level mapping of an array A(0: 17, 0: 7) with
parameters (Pt, k0 - (3, 3) in the first dimension and (P2, k2) --
(2, 2) in the second dimension, and let the section of interest be
A(0: 17: 2, 0: 7: 3). The picture is shown in Figure 4.
Let each block be laid out in column-major order in local mem-
ory. Thus, the sequence of index values corresponding to successive
memory elements of processor (0,0) is {(0,0), (1,0), (2,0), (9,0),
(10,0), (11,0), (0,1) .... }. Such a decomposition is used by Don-
garra et al. for dense linear algebra codes on distn'buted-memory
machines [3]. Then the tables corresponding to the two dimensions
of A are
T] _---
I STATE[ N_r I A_ I
0 2 2
1 0 2
2 1 2
and
[STATEI IAM I
Tz = 0 1 30 •
1 0 6
(Had the section been larger in the second dimension, the col-
umn accessedby processor (0, 0) immediately after column 0 would
be column 9. The elements (0, 0) and (0, 9) would be 30 locations
apart in local memory. This explains the AA4 entry in the first row
of T2.)
The FSM T2 is used to generate memory spacings for successive
base addresses, while lhe FSM Tl is used to generate memory
spacings between successive elements with the same base address.
Algorithm 3 computes these FSMs.
Algorithm 3 (Memory access sequence for a multidimen-
sional array with a two-level mapping.)
Input: Number of dimensions d; alignment parameters
(a j, bj), layout parameters (p:, k:), dimension length
nj, regular section A(! i : h i : si), processor number
ms, for each dimension j (1 < j < d).
Ou_ut: AA4j, length_, startmemj,/astmemj (1 < j < d).
Method:
a integer j, mere, m, k
2 mere *-- J
3 forj- 1 to ddo
4 (AAdj, lengthj, stanmemj, lastmemj, m)
,--- Algorithm 2(a_, bs , pj, kj, nj, 1i, h i, s i , mj)
5 startmem_ ,--- starm_emj • mere
6 hstmemj .-- lastmem i • mere
7 for k - 0 to length i - 1 do
8 AA4_ [k] ,---AA4# [k].mem
9 enddo
10 me/n _-- mere • m
11 enddo
12 return A.M_, length_, smrwaem i, lastmemj
(l_</_<d)
For the two-dimensional case, each processor executes the fol-
lowing doubly-nested loop. Higher-dimensional cases follow the
same pattern, with one loop for each dimension.
base2 *-- stallmem2
h,--O
while base_ </astmem2 do
base_ 4-- star_eml
il ,-0
while baser __hsuneml do
computewithaddressbasez+ basel
basel ,- baser +AA{m[il]
il _ (/1 + l)mod le_gthl
enddo
basez .- basez +AAdz[iz]
h ""(h+ 1)mod Ieagth2
enddo
5 Generating communication sets
The set of messages that a given processor m must send while per-
formingitsshareofthecomputationiscalleditscommunication
set.We now extendtheFSM approachtogeneratecommunication
setsforregularcommunicationpatternssuchasshiftsand stride
changes.Inthis ection,we assumea one-dimensionalrraywith
a single-levelmapping.Extensionstomultilevelandmultidimen-
sionalcasesareasintheaddressgenerationproblem.
We adoptKoelbel'sexecutionmodel forparallelloops[7],in
whichaprocessorgoesthroughfoursteps:
1. Send those data items it holds that are required by other
processors.
2.Performlocaliterations,i.e.,iterationsforwhichitholdsall
data.
3.Receivedataitemsfromotherprocessors.
4.Performnonlocaliterations.
We combinethefinaltwo stepsintoareceive-executeloop.
Our approach puts all the intelligence on the sending side, keep-
ing the receive-execute loop simple. We wish to minimize the
number of messages sent and maximize cache benefits. To this end,
each processor makes a single pass over the data in its local memory
using the FSM technique descn'oed above, figuring out the desti-
nation of each data element and building messages. The messages
contain (address, value) pairs which the receiving processor uses to
perform its computations. All the data that must go from processor
i to processor j are communicated in a single message. The data
items corresponding to the local iterations are conceptually sent by
the processor to itself. Of course, this message is never sent; rather,
the local iterations are performed out of the message buffer. The
idea is similar to the inspector-executor model [10] for irregular
loops.
5.1 Shifts
Consider the computation
A(O:h:s) = A(O:h:s)+ A(t:t+h:s),
where the regular section A(l: l+ h :s) is to be moved to align
with A(0: h:s). (For simplicity, assume ! > 0.) In this case, a
processor communicates with at most two processors. We show
how to augment the FSM with a A7) and a A£ column, such that
adding these quantifies to the processor and memory location of
the current element gives the remote processor number and the
memory location in the remote processor of the element with which
it interacts.
Suppose we know that element A(i) is located on processor m
at memory location .A4(i;p, k) with offset Oi = O(i;p, k). Then
the problem is to find the processor and local memory location of
the elemem A(i - l). Let l = q. pk + r, and AP = [(r - O_)/k].
Then P(i - l; p, k) = (m - AP + p) rood p. Since 0 < O, < k,
AP can assume only two values as the offset Oi varies.
We also define the quantity A£ such that A4(i - l;p, k) =
_(i;p, k)+ Az(o,).
A£(O,) -- ((O,-r) modk)-Oi-kq-71,
k if(mk-r+Oi)<O,I1 = 0 otherwise.
5.2 Stride changes
Now consider the communication required to move the regular sec-
lion A(0: s2n: s2) to align with the regular section A(0: stn: sl).
This is harder than a shift; there is no simple lookup technique for
generating the communication sets. The problem is that the pattern
of destination processors can have period longer than k. The only
fact that appears to be true in this case is that the sum of the periods
over all processors divides pk. We therefore resort to a simple
technique.
We generate the FSM to reflect the storage properties of the s2-
section. Now, the jth element of this section interacts with index
i = j • st ofA. Letqt = idivpk, rt = i rood pk, q2 = rt div k,
and r2 = rl mod k. Then the desired dement is stored in memory
location q_ • k + r2 on processor q2:
Many modem RISC microprocessors perform integer division
and remainder operations in software, which are rather slow. The
equations above can be rewritten to require only two integer divi-
sions, which substantially speeds up the implementation. If either
p or k is a power of two, the computation can be sped up further by
using shifts and bit masks instead of the division operations. Our
implementation incorporates these optimizations.
5.3 Termination
Each processor must determine when it has completed all its opera-
tions. As we do not sendmessagesbetween every pair of processors,
we cannot use the number of messages to determine termination.
However, it is easy to figure out the total number of operations that
processor m will execute (by a variant of Algorithm 1). A simple
termination lest initializes a counter to this value, decrements it by
the number of array elements received in each message, and exits
when the counter reaches zero.
6 Experimental results
In this section we present experimental resultsofimplementing the
algorithms above. We measure both the preprocessing cost of table
generation and the runtime loop overhead.
Our computational kernel is the DAXPY operation
y(l:n:a) - y(l:n:s) + a*x(l:n:a).
The ratio of memory operations to computation is high in this
kernel making it unfavorable for our algorithms. The overhead
due to irreguhr memory access patterns would decrease with more
computation in the loop. All experiments reported in this section
were done on a Sparcstation 2, for which Dongarra [2] reports a
100 x 100 double-precision LINPACK performance of 4.0 Mflops.
(We measured 3.3 Mflops.) (A repetition of the experiments on
a single processor of an iPSC/g60 showed similar characteristics.)
We used the GNU C compiler, with optimization at the -02 level.
7
_q
o
r--I
q-4
en
2.5
1.5
1
0.5
a=l pffi32
I I
l i
i
:, ,,
l,
0, : ._
, ,, , .: : ,: :
i FI l_ i, Y ' ""-,'-' ;; " "
, _ il'_ _ ..... _'_1
t I t
t ! t
k=17 --
k=64 .....
k=l ......
k=block .....
:, ;, _ :, ;,
R .... ." .. : "... ., _X , _ , ;, ",-":
• • ". ', _ ,'. i _
," ' _ ,',i i_._
,, . . 'I
16
[ ,
32
stride s
I
48 64
Figure 5: Performance of DAXPY for various block sizes k.
A higher optimization level increased compilation time without
improving the quality of the generated code. With s = I the
loop compiled into 9 instructions, while the general case with table
lookup compiled into 17 instructions. Our measured performance
for the s = I case was 2.0 M/lops for long vectors.
6.1 Local address generation
We have seven parameters for a one-dimensional array:, two align-
ment parameters a and b, two layout parameters p and k, and three
section parameters t, h, and s. This is too large a space over which
to present results. We now explain how we reduce the number of
variables.
. Alignment stride a: For any constant c, the memory layout
for parameters ca and ck is exactly the same as that for a
and k. Thus we experiment only with a = 1; the effect of
changing a can be simulated by changing k instead.
. Alignment offset b: This is irrelevant for large sections.
For small sections where end effects could be important, we
observed that changes in b affected results by less than 5%.
We therefore take b = 0.
• Section lower bound t: Again, this is irrelevant for large
sections and has negligible effect even for small sections.
We therefore lake t = O.
• Section upperbound h: This affects thenumber of elements
in the DAXPY, which in turn affects the tradeoffbetween loop
starmp and iteration time in a single processor. We vary h
along with s, keeping h/s fixed, so that all our experiments
do the same amount of work regardless of stride.
• Section strldc s: This is the z-axis of our plots, as one of the
two independent variables.
• Number of processors p: We run our experiments with a
fixed number of processors. Varying the number of proces-
sors has the effect of scaling the plots.
• Block size k: This varies from plot to plot, as the second of
the two independent variables.
We use the execution time per element of the DAXPY loop as
our measure of performance. For a given set of parameters, we
compute the tables for each processor and time the execution of
the distributed loop. Let rm be the time taken by processor m to
execute the n,, iterations it receives. Then the execution time per
element is Trn
_e = max --.
m n_t
This factors out the effect ofload imbalanee. :_: : -: _: ::::
We plot 2/G (the megaflops per second per processor) against
stride s for various block sizes k. The plots are shown in Figure 5.
The curves are the sum of two effects.
a°l _7 p032
gO00
Iooo
7000
6ooo
H $1_0
3000
3000'
IOO0
0 _
a t @_ _'
i a • & • 6 • •
I_ 32 48 f4
block oiso k
Figure 6: Preprocessingoverhead for memory stride computations.
The graph corresponds to an alignment stride of 1. The multiple
points for each block size show the processing times for each of 32
processors.
1. Baseline: The lower envelope is a smooth, relatively fiat
function of s. The value ofthe envelope is approximately the
asymptotic performance for the single-processor DAXPY.
2. Jiggles: The plot jiggles up and down some, presumably due
to cache stride effects [1].
Although the generalloop has almost twice as manyinstmctions
as the unit-stride loop, the difference in performance is much less.
This is because the execution time is dominated by the floating-poim
operations and references to operands in memory. The references
to the AA4 arrayusually hit in cache, and the additional loop control
instructions probably execute in overlapped fashion.
The preprocessing time required to compute the AM tables is
shown in Figure 6, where we pIot the running time of Algorithm 1
against block size k. The running time is approximately O(k),
with certain special cases being handled much faster. Experimental
observation confirmed that the running times for Algorithm 2 were
no more than twice those of Algorithm 1.
If the A,Ad sequence is known at compile time, unrolling the
loop bodies avoids the indirection into the AAd array. Figure 7
shows that the benefits of this are mixed. While loop unrolling
removes the indirection, it also increases the size of the loop bod_
thus the benefits of a simpler loop body may be undone by misses
in the instruction cache caused by the larger loop body. This is a
factor for larger block sizes, since the amount of unrolling maybe
as large as k. We recommend unrolling only ff the AA4 sequences
are short, or can be compacted, e.g., by rtm-length encoding.
6.2 Commun|cation set generation
For the communication set generation algorithms, we measured the
time required to generate the FSM tables and the time required to
marshal the array elements into messages. We did not measure
the actual communication time, since that varies widely among
machines. The results are presented in Table 2.
7 Conclus|ons
Our table-driven approach to the generation of local memory ad-
dresses and communication sets unifies and subsumes previous so-
lutions to these problems. The tables required in our method are
(a)
to)
{ k I s J £ J FSMgen. time(ps)
1 3 10 360
17 3 10 1831
17 3 40 1870
64 3 10 6632
64 3 100 6508
blk 3 10 408
Run time/elt _s)
3.3
3.1
3.1
3.2
3.2
3.2
Run time/elt (_s)
5.3
14.3
5.4
[ k J s: I s2 [ FSMsen. time(_s)
1 3 5 360
17 3 5 1891
64 3 5 6542
blk 3 5 400 11.3
Table 2: Performance of communication set generation algorithms.
All numbers are for 32 processors. (a) Shift communication: the
shift distance is L to) Change of stride: the stride is changed from
821.081.
easy to compute, and the runtime overheads due to indirect address-
ing of data arc small. Our data access patterns exploit temporal
locality and make effective use of cache. We also support blocking
of messages to reduce message starmp overheads.
Our prototype implementation has demonstrated our perfor-
mance claims. An optimized implementation of our techniques
should deliver close to maximum performance for generalcyclic(k)
distributions of arrays.
Our algorithms are fully parallel in that each processor of a
parallel machine can generate its tables independently Finding
a fully parallel algorithm running in O(k) parallel time or with
O(p -{- k) total work remains an open problem.
Acknowledgments
J. Ramanujam proofread an early draft and suggested the simplifi-
cations to Algorithm 1. We also thank the referees for their helpful
coinments.
References
[1] BAILEY, D. H. Unfavorable strides in cache memory sys-
tems. RNR Technical Report RNR-92-015, Numerical Aero-
dynamic Simulation Systems Division, NASA Ames Re-
search Center, Moffett Field, CA, May 1992.
[2] DONOr, J. J. Performance of various computers using
standard linear equations software. Computer Architecture
News 20, 3 (June 1992), 22--44.
[3] DONOARRA, J. J., vAr_ DE GEUN, R., AND WALKER, D. W. A
look at scalable dense linear algebra h'braries. In Proceed-
ings of the Scalable High Performance Computing Confer-
ence 0Villiamsburg, VA, Apr. 1992), IEEE Computer Soci-
ety Press, pp. 372-379. Also available as technical report
ORNI]TM-12126 from Oak Ridge National Laboratory.
[4] Fox, G. C., ttmANANDAhq, S., KENNEDY, K., KOELBEL, C.,
KREMER, U., TSENG, C.-W., AND WU, M.-Y. Fortran D lan-
guage specification. Tech.Rep. Rice COMP TR90-141, De-
partment of Computer Science, Rice University, Houston, TX,
Dec. 1990.
9
o_
o
,--t
%4
lm
o
affil pffi32
3 I I I
k--17 (unrolled)
k--17 (table) .....
k=64 (unrolled) ......
k=64 (table) ......
2.5
2 l :;
I: 1 I , _ ' ' .... ' | %" _1.5 ,. , ' ' ','"
' t, I I ; I '_% t
', _ I I _ If ",J I f
Y
1
0.5
0 I
16
I I
32 48
stride s
64
Figure 7: Effect of loop unrolling on performance of DAXPY for various block sizes k.
[5] HiGH PERFORMANCE FORTRAN FORUM. High Performance
Fortran language specification version 0.4. Draft, Nov. 1992.
Also available as technical report CRPC-TR 92225, Center
for Research on Parallel Computation, Rice University.
[6] KNUTH, D.E. SeminumericalAlgoritb,2, second ed., vol. 2
of The Art of Computer Programming. Addison-Wesley Pub-
lishing Company, Reading, MA, 1981.
[7] KOELBEL C. Compile-time generation of regular communi-
cation patterns. In Proceedings of Supercomputing'91 (Albu-
querque, NM, Nov. 1991), pp. 101-110.
[8] KOELBEL, C., AND MEHROTRA, P. Compiling global name-
space parallel loops for distributed execution./EEE Transac-
tions on Parallel and Distributed Systems 2, 4 (Oct. 1991),
440-451.
[9] MACDONALD, T., PASE, D., AND M_TZER, A. Addressing in
Cray Researeh's MPP Fortran. In Proceedings of the Third
Workshop on Compilers for P aralle l Computers (Vienna, Aus-
lrla, July 1992), Austrian Center for Parallel Computation,
pp. 161-172.
[10] MmC_d_DANEY, R., SALT'Z, J. H., SMrm, R. M., CROW-
I.EY, K., AND NICOL, D. M. Principles of runtime support
for parallel processors. In Thirdlnternational Conference on
Supercomputing (July 1988), ACM Press, pp. 140--152.
10
