Compile-time optimization of near-neighbor communication for scalable shared-memory multiprocessors by Hudak, David E. & Abraham, Santosh G.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 15, 368-381 (1992) 
Compile-lime Optimization of Near-Neighbor Communication for 
Scalable Shared-Memory Multiprocessors 
DAVID E. HUDAK AND SANTOSH G. ABRAHAM 
Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, Michigan 48109-2122 
Scalable shared-memory multiprocessor systems are typically 
NUMA (nonuniform memory access) machines, where the exploi- 
tation of the memory hierarchy is critical to achieving high perfor- 
mance. Iterative data parallel loops with near-neighbor communi- 
cation account for many important numerical applications. In 
such loops, the communication of partial results stresses the mem- 
ory system performance. In this paper, we develop data place- 
ment schemes that minimize communication time where the near- 
neighbor interaction is determined by a stencil. Under a given 
loop partition, our compile-time algorithm partitions global data 
into four classes for each processor, with each class requiring 
specific consistency maintenance requirements. The ADAPT (Au- 
tomatic Data Allocation and Partitioning Tool) system was imple- 
mented to automatically partition parallel code segments for the 
BBN TC2000, a scalable shared-memory multiprocessor. ADAPT 
caches global arrays and maintains data consistency in software 
through instructions that flush data from private caches. Restruc- 
turing of a fluid flow code segment by ADAPT improved perfor- 
mance by a factor of more than 3 on the BBN TC2000. Features in 
current generation pipelined processors with multiple functional 
units permit the overlap of memory accesses with computation. 
Our experiments on the BBN TC2000 show that the degree of 
overlap is limited by architectural parameters, such as the number 
of CPU registers. 8 1992 Academic Press, Inc. 
1. INTRODUCTION 
Shared-memory multiprocessors offer a familiar model 
for programmers. Scalable multiprocessor systems can 
be classified as nonuniform memory access (NUMA) ma- 
chines where the memory latency depends on the loca- 
tions accessed. For instance, an access by a processor in 
the BBN Butterfly TC 2000 has a latency of 3, 11, or 38 
CPU clock cycles depending on whether the location ac- 
cessed is in the cache, local memory, or remote memory, 
respectively. Other scalable, shared-memory multipro- 
cessors such as the MIT Alewife and Stanford DASH 
multiprocessor systems [20] have nonuniform access la- 
tencies. The increased latency and reduced bandwidth of 
global memory have a substantial impact on perfor- 
mance. Restructuring of programs can reduce the num- 
ber of global memory accesses and dramatically improve 
performance. 
We believe that future multiprocessor systems will 
have complex memory hierarchies, which cannot be 
managed effectively by the hardware. Since the portabil- 
ity of parallel programs is an important issue, and since 
each multiprocessor will have a unique memory hierar- 
chy, the burden of managing the memory hierarchy will 
fall on the compiler. The development of compile-time 
schemes to manage local memories is therefore an impor- 
tant research topic. 
Our initial work toward a compiler that automatically 
compiles code to utilize the memory hierarchy of a scal- 
able multiprocessor system is based on the following 
premises. A large amount of execution time is spent in 
parallel loops and initial work should focus on such 
loops. Numerical programs are typically continuum 
models where each point in a multidimensional space can 
be updated in parallel but the newly updated values are 
required in the next time-step. The particular loop con- 
struct studied in this paper is used in coding such itera- 
tive data parallel programs. Thinking Machines Corpora- 
tion has recently introduced a specialized compiler to 
optimize such loops for the Connection Machine [4]. 
We have developed a theoretical framework for ana- 
lyzing communication for such loops. In earlier work, we 
also developed optimal loop partitioning schemes to de- 
termine the loop partition that minimizes the number of 
data points exchanged between processors [ 1, 161. In this 
paper, given the parallel code segment and the loop parti- 
tion, we divide each global array into four classes for 
each processor. The exclusive read-write set (ERW) may 
be moved by each processor into the highest level of its 
memory hierarchy. Consistency must be maintained on 
the shared read-exclusive write set (SREW) and shared 
read-no write set (SRNW). No accesses are made by the 
processor to the no read-write set (NRW). Such a data 
partition is automatically obtained from a few parame- 
ters, viz., communication parameters derived from the 
code segment and the loop partition. 
Results of this analysis can be used with different data 
placement schemes, e.g., placing the ERW set in the lo- 
cal memories of each processor. These schemes have 
been implemented and experimental results on the But- 
terfly TC2000 are reported. Even a simple data partition 
368 
0743-7315/92 $5.00 
Copyright 0 1992 by Academic Press, Inc. 
All rights of reproduction in any form reserved. 
OPTIMIZATION OF COMMUNICATION FOR SHARED-MEMORY MULTIPROCESSORS 369 
that caches the ERW set was shown to have a factor of 3 
improvement in execution time over currently used de- 
fault data assignments. We focus on minimizing the im- 
pact of communication time, first, by choosing the data 
assignment to minimize the communication time, and 
second, by overlapping as much of the remaining com- 
munication time with computation as possible. The over- 
lap of computation with communication was shown to 
yield execution time improvements of nearly 10% for 
some data assignments. 
Software systems that partition and manage the consis- 
tency of data using messages among processors are cur- 
rently being developed for distributed memory machines 
[12, 191. In addition, partitioning systems that manage 
shared data are being developed commercially for SIMD 
machines [4, 181. The optimization of code for a shared- 
memory NUMA system requires similar treatment. To- 
ward this end, we have developed the ADAPT (Auto- 
matic Data Allocation and Partitioning Tool) system to 
analyze the Fortran versions of the iterative parallel 
loops described in this paper and to generate code for the 
BBN TC2000 which employs a proper data partition and 
exploits the memory hierarchy. 
2. RELATED WORK 
Our goal is to develop compiler techniques for restruc- 
turing parallel loops for shared memory multiprocessors 
to minimize the performance degradation due to the la- 
tency and available bandwidth of the memory system. 
Related work includes work on loop partitioning, auto- 
matic data distribution for nonshared memory machines, 
locality enhancement, prefetching, and software consis- 
tency maintenance. 
Loop partitioning for two-dimensional iteration spaces 
is often achieved by tiling the iteration space with geo- 
metric shapes that tessellate, as described by Reed et al. 
[26] and Carr and Kennedy [7]. In contrast, this paper 
discusses a systematic method for reducing the impact of 
communication time on the execution time of any stencil- 
based, iterative data parallel loop by generating data 
placement strategies and overlapping communication and 
computation. Ramanujam and Sadayappan [25] and 
Tseng [28] present dependence-oriented partitioning ap- 
proaches for iteration spaces to be executed on message- 
passing, nonshared memory multiprocessors. In con- 
trast, our work is oriented toward scalable 
shared-memory multiprocessors. 
Research on the automatic distribution of data has 
been done for message-passing, nonshared memory sys- 
tems, e.g., Zima et al. [30], Pingali and Rogers [23], and 
Fortran-D [ 121. Our analysis uses an existing process de- 
composition (ADP) [ 1, 161 that minimizes data exchange 
between processors, and then determines the data place- 
ment which minimizes communication time. Thus, our 
system automates and optimizes all the steps involved in 
mapping an iterative parallel loop to a particular shared- 
memory multiprocessor system. 
Performance improvement through exploitation of 
memory hierarchies has been previously studied by Gan- 
non et al. [ 141, Cat-r and Kennedy [8], and Wolf and Lam 
[29]. These optimizations attempt to maximize reuse of 
cache data in the context of a finite cache, i.e., minimiz- 
ing uniprocessor misses. They have a secondary effect of 
improving multiprocessor performance by reducing the 
bandwidth requirements of each processor, thereby re- 
ducing contention in the memory system. In contrast, our 
work considers a multiprocessor environment where 
each processor has its own memory hierarchy. 
In prefetching, the data are brought to a higher level in 
the memory hierarchy before they are required. Since 
computation continues during the prefetch, the global 
memory access latency is hidden; see, e.g., Gannon et al. 
[14], Gornish et al. [1.5], and Callahan et al. [6]. How- 
ever, prefetching only helps to hide large memory laten- 
ties, and does not reduce bandwidth requirements. Our 
work is based on multiprocessors that lack explicit sup- 
port for prefetching. However, we exploit the parallelism 
between the access unit and the floating point unit in the 
Motorola 88000 processor [22] to overlap the latency of 
the remote memory with useful computation. Unlike ear- 
lier analytic or simulation work, we measure actual exe- 
cution times to illustrate the achievable overlap of com- 
munication and computation. Automatic compile-time 
maintenance of consistency for shared data has been ex- 
amined by Cheong and Veidenbaum [9] and by Cytron 
et al. [lo]. 
3. MODEL ASSUMPTIONS 
3.1. Multiprocessor Model 
The data placement and overlap strategies discussed 
here are applicable to a wide range of shared-memory 
architectures with a scalable memory hierarchy. We as- 
sume a multiprocessor with a globally shared address 
space across physically distributed memories. Since cur- 
rent methods for hardware-maintained cache consistency 
do not scale well to large numbers of processors, our 
model multiprocessor assumes no hardware support for 
cache consistency. We assume a three-level memory hi- 
erarchy consisting of a private cache, private local mem- 
ory, and globally shared remote memory. Note that the 
effects of contention for shared communication media 
(memory banks, switches, etc.) are not considered in this 
work since the focus is on the minimization of overall 
communication requirements, and not on optimization 
for any particular interconnection configuration. 
The BBN TC2000 Butterfly parallel processor [2] is a 
370 HUDAKANDABRAHAM 
shared-memory MIMD machine composed of Motorola 
88000 processors, each having an individual cache. The 
processor is resident on afunction board with a memory 
module. The function boards are interconnected by a 
multistage interconnection network. The BBN TC2000 is 
a nonuniform memory access machine. From the per- 
spective of a particular processor on the TC2000, the 
memory hierarchy is that processor’s cache, local mem- 
ory (i.e., the region of shared memory addresses which 
correspond to the memory module which is resident on 
the function board), and global memory (i.e., other pro- 
cessor’s local memories). Other scalable shared-memory 
machines have similar memory hierarchies [20] and con- 
form to our model, but substitute a mesh for the multi- 
stage interconnection network. The latencies for various 
memory operations as determined by BBN [2] are 0.15 ps 
or 3 CPU cycles for a cache access, 0.60 ps or 11 CPU 
cycles for a local memory access, and 1.89 ,LLS or 38 CPU 
cycles for a global memory access. 
3.2. Program Model 
The iterative data parallel loops analyzed in this paper 
are a collection of perfectly nested loops with an outer- 
most sequential loop that controls the execution of inner 
data parallel loops. For simplicity, a two-dimensional 
parallel iteration space of size N x N is assumed, al- 
though many of our methods generalize to higher dimen- 
sions [ 171. We assume that the bounds of the inner paral- 
lel loop are large enough to warrant parallel execution. A 
cycle is an execution of a single iteration of the outer 
sequential loop. The body of the loop consists of a series 
of assignment statements involving two-dimensional ar- 
ray variables. Our analysis assumes that the code body 
honors the single assignment property, i.e., that each 
array location is written to by only one iteration of the 
parallel loops. Additionally, we assume asynchronous se- 
mantics for the updating of arrays within the parallel 
loops. 
This work develops data storage techniques that opti- 
mize near-neighbor communication. Typically, regular 
near-neighbor communication is expressed in applica- 
tions through the use of subscript offsets, as in Fig. 1, and 
irregular communication is characterized by maintaining 
an array of pointers. Such arrays have been analyzed in 
substructuring methods for finite element domains [ 111. 
Since such run-time information is not amenable to com- 
pile-time analysis, we focus on subscript offsets. Such 
offsets appear in many numerical application programs, 
e.g., asynchronous solutions to partial differential equa- 
tions, continuum modeling, and image smoothing 111. 
Therefore, the subscript expressions for right-hand side 
references of an array are restricted to one of the parallel 
loop indices plus or minus a small constant. The ordering 
of the appearance of loop indices is assumed to be identi- 
do k = 1, 1000 
do j = 2, N-l 




do j = 2, N-l 





FIG. 1. Example of an iterative data parallel loop. 
cal for all array references. For simplicity, the subscript 
expressions for left-hand side references of an array are 
restricted to parallel loop indices. Different types of sub- 
script expressions, such as those found in Gaussian elimi- 
nation, induce different types of communication, thus re- 
quiring further analysis. This is an avenue of important 
future work. 
In previous work [I, 161, we developed a theoretical 
framework for analyzing near-neighbor communication 
in iterative data parallel loops. Since we use the same 
framework in this paper, we summarize relevant work. 
Given an iterative parallel loop that updates a single, two- 
dimensional, global array as in Fig. 1, the communication 
is determined by the stencil, S, which is the offsets of the 
array accesses: 
For example, for Fig. 1, 
s = -x0, -11, (0, 11, c-1, 01, (1, 0)) 
Each offset pair in S is an access vector. Figure 2 
shows the access vectors for a square partition under the 
stencil {(1, 0), (2, 0)). A loop partitioning, p, is a mapping 
from the iteration space to a processor identification 
number (PID). The communication is defined to be the 
number of data points read (in each cycle) by a particular 
processor that are computed by another processor. Note 
that the first offset in each pair induces communication 
across the horizontal plane, as in Fig. 2, while the second 
offset induces communication across the vertical plane. 
Therefore, stencil elements provide communication per 
unit length across horizontal and vertical partition bor- 
ders. For rectangular partitions of dimensions h X u, the 
communication is expressed as C, = nhh + nvu, where fib 
and n, are measures of communication along the horizon- 
tal and vertical dimensions. These are obtained from S by 
applying either the additive or the max-min construction 
procedure. The calculation of communication weight per 
orientation using the additive construction assumes that 
OPTIMIZATION OF COMMUNICATION FOR SHARED-MEMORY MULTIPROCESSORS 371 
F,= 1 
F,= 2 
F, = 3 
F, = 3 
F,= 2 
F, = 1 
For a machine with private caches, nh and n, are replaced 
by ni and n”, in the above. A more detailed treatment of 
this work is available elsewhere [ 1, 161. 
In a private-cache bus-based multiprocessor, the 
movement of data between cache and main memory is 
managed by the hardware and the specification of the 
loop partition, p, determines the communication over- 
head [16]. In a scalable multiprocessor such as the BBN 
Butterfly, there is an additional degree of flexibility be- 
cause the data assignment can be specified by the soft- 
ware and affects the communication time. A data assign- 
ment,a, isamappingfrom{l,2, . . . . N} X {1,2, . . . . N}+ 
(1, 2, *-., S}, i.e., a mapping from a data element to a 
processor, a. The element is stored in a memory that is 
FIG. 2. The partition exterior is heavily referenced 
each reference made to a point outside the partition re- 
quires interprocessor communication, and is given by 
n{ = 5 Inhil. 
i=l 
(3.1) 
The calculation of communication weight using the max- 
min construction procedure assumes that interprocessor 
communication is only required once to establish a local 
copy of a datum and all successive reads (for the duration 
of the cycle) can be performed locally. In this case, the 
COIIIIYIUniCatiOn weight per orientation iS given by nh = 
nh’ -!Y n;, where n$ = max({nhilnhi 2 0} U (0)) and nh = 
Imin({nhilnhi < 0} U {O})l, where nhi is the first element of 
the ith access vector. The constructions for nv are analo- 
gous. 
For multiprocessor systems with caches, the commu- 
nication weights are converted from the number of data 
points shared between processors to the number of cache 
lines shared between processors [l]. Let 1 be the number 
of data points per cache line. Assuming column-major 
storage, the number of data points required per unit 
length along the horizontal boundary must be rounded up 
to the number of cache lines required per unit length nf, = 
[nh/l]. For the vertical boundary we must compensate for 
the aggregation of multiple references into a single cache 
line, n”, = n,/l. For example, if 1 = 4 and nh = nv = 1, nf, 
= 1, since one cache line is required per unit length along 
the horizontal border, and n’, = l/4 = 0.25, since one of 
every four references along the vertical border requires a 
new cache line. 
Assuming that 9, the number of processors, is fixed for 
all cycles, we prove in [l] that 
hopt = &$ uopt = $$. (3.2) 
closer to processor a than any other processor. In this 
paper, we determine a data assignment for a single global 
array based on the predetermined loop partition and the 
references made to the array in the code body. T,, the 
communication time incurred by a processor on each cy- 
cle, is a function of the loop partition and the data assign- 
ment. There are two ways to reduce the impact of T, on 
the execution time of the loop. First, T, can be reduced 
by altering p and (+. Second, T, can be overlapped with 
the computation time of the loop. 
4. DATA ASSIGNMENTS 
Three factors should be considered in specifying the 
data assignment. First, the consistency requirements 
may limit the possible locations for data. Second, subject 
to the first constraint, data that a processor references 
heavily should be assigned to the private levels of that 
processor. A third factor in determining data placement 
is the storage size available at each level in the hierarchy. 
Other researchers have developed efficient techniques 
for handling storage size limitations [ 13, 291 which can be 
applied following the use of our techniques. 
In the general data assignment problem, an element of 
the global array may be assigned to several distinct stor- 
age locations in the multiprocessor memory hierarchy. 
First, we restrict the problem by simplifying the memory 
hierarchy to consist of two levels, i.e., local and remote 
memory. Each processor has fast access to an associated 
local memory and slower access to the remote memory, 
which is similar to a TC2000 when the private caches are 
not considered. Second, we require that each element of 
the global array be present in precisely one of the proces- 
sors’ local memories. These two restrictions simplify the 
data assignment problem to a data partitioning problem. 
We consider the optimization of the one-to-one mapping 
from each element of the global array to a particular local 
memory. 
372 HUDAKANDABRAHAM 
4.1. Data Partitioning 
We assume that a loop partition p has been specified to 
minimize the data points exchanged between processors 
using ADP. We denote the rectangle that has its lower 
left-hand corner at coordinates (i, , jr) and its upper right- 
hand corner at coordinates (i2, j,) as [i, : i2, jr : j,]. There- 
fore, the partition of the iteration space assigned to pro- 
cessor a under loop partition p is the rectangle [il: i2, 
jr : j,], with lower left-hand corner (il , jr) and upper right- 
hand corner (il, j,). The target set ofy under S, where 
y = (i, j) E A, the global matrix, and S is the stencil set, 
is Ts(y> = {(i + &,I, j + Q), . . . . (i + nhk, j + nhk)} and 
gives all data points required for the computation of y 
[16]. The read set, %(a, p), is the set of data which are 
read by processor a under loop partition p, %(a, p) = 
(x13y s.t. p(y) = a and x E Ts(y)}. The write set, W(a, p), 
consists of data which are written by processor a under 
loop partition p. Note that W(a, p) = {(i, j) E N x 
Nlp(i, j) = a} since each iteration updates A(i, j). The use 
set, %.(a, p) = %(a, p) U W(a, p), consists of data that are 
read or written by processor a under the loop partition, p. 
The use reference frequency of a data point A(i, j) by a 
processor a is the number of times processor a references 
A(i, j), and is denoted Fcu,,)(i, j). Consider a data parti- 
tion (T that maps each element of the global array to a 
single processor. The local references of processor a are 
L(a) = {(i, j) E N x Nla(i, j) = a}. 
For this section we assume a true partition of the data 
set, i.e., a division of the data set into mutually disjoint 
subsets. If tl is the latency to access local memory and t, 
is the latency to access remote memory, the latency for 
accesses to locations in L(a) is tl, while the latency for 
accesses to all other locations is t,. Then, T, , the com- 
munication time, reduces to 
Tc = 2 (F(u,dL j)tl) + c (F(u,& jM. (4.1) 
The objective of minimizing execution time is equiva- 
lent to the objective of minimizing communication time, 
since the loop partition evenly distributes a fixed amount 
of computational work among the processors. The fact 
that the loop partition is fixed also implies that 
E(i,j)ENxNFcu,a)(i, j) is constant. Therefore, the perfor- 
mance is optimized when C(i,J~EL(a~F~u,a)(i, j) is maxi- 
mized . 
Consider Fig. 2, which shows the target sets for a 
square partition under the stencil ((1, 0), (2, 0)). The val- 
ues of Fu,,) are given for various rows. Observe that the 
row across the top border of the square has a frequency 
access of 2, while the row along the bottom border has a 
frequency access of 1. Therefore, the data partition that 
maximizes local accesses should hold the row across the 
top border rather than the row along the bottom border. 
The data partition that maximizes local accesses is ob- 
tained by shifting the loop part {(i, j)lp(i, j) = a} up by 
one row. 
In general, the information provided by the stencil can 
determine how to shift the data partition. By restricting 
the discussion to rectangles, we can consider shifts along 
two dimensions: vertical shifts and horizontal shifts. 
We will discuss vertical shifts, the horizontal shifts 
being analogous. Consider a stencil s = {(nhi , &I), . . . . 
(nhk, &k)} that is sorted by nhi, i.e., nhl 5 nh2 5 ..* 5 nhk. 
THEOREM 1. Let r, = I{nhj s.t. nh; < O}l (where the bar 
notation refers to set cardinality), r, = i{nhi s.t. nhi = O>l, 
and rp = I{& s.t. nhi > O}/. Upward shifts are made when 
rp > rn + rz, and downward shifts are made when r,, > 
rp + r,. Assuming Ii1 - izI S maX{lnhll, inhkl}, the shift, 
s, from the loop part that maximizes the number of 
local references made by a processor, max.J~~~~~+,F~u,a~ 
WI, is 
I 
nhj where i = y, k odd. 
Sopt = (4.2) 
k 
nhr where i = -, k even. 
2 
Proof. r,, and rp represent the quantity of references 
made across the il and i2 borders, respectively. rz repre- 
sents the volume of references made exclusively to the 
partition interior. Shifting is done in order to include data 
that are more heavily referenced than some data which 
are currently included. Vertical shifts can be done either 
in the upward or downward direction. An upward shift 
includes data higher than i2 at the cost of excluding data 
near iI. Therefore, upward shifts should be done when 
rp > r, + r,. Similarly, downward shifts should be 
done when r, > r, + rp. 
We focus on an upward shift, the downward shift 
being analogous. The number of local accesses is 
maxS(~~i,F~u,,,(i) + ~~~~+,FCu,,,(i) - ~~$~+,F~u,,) 
(i - 1)). Since x&Fcu,,,(i) is constant with respect to s, 
the objective, this is restated as maxs[~~Z1(F,u,,,(i, + I) - 
F(u,diI - 1 + ON. Assuming Ii, - &I 9 max{lnhlj, Inhkl), 
we have FCU,a)(iZ + f) = [{nhi s.t. nhi 2 /}I. Therefore, 
FcUJiZ + 1) is a monotonically decreasing function of 1 
decreasing from rp for 1 = 1 to 0 for 1 > nhk. Similarly, 
FcU&)(i[ - 1 + 1) = I{nhi s.t. nhi < f}l is a monotonically 
increasing function of 1. Recall that Iit - izI + max{lnhIl, 
Inhkl}, (FcUJi2 + I) - F(u,,)(i, + 1)) is a monotonically 
decreasing function of 1 decreasing from rp - (r, + r,) for 
l = I t0 -(V, f r, + Yp) for 1 > nhk. Therefore, the 
summation is maximized when FcU,a)(i;! + sopt) - FCu,,) 
(il - 1 + s,& 2 0 and Fcu,,)(i2 + sopt + 1) - Fcu,,)(i, - 1 + 
OPTIMIZATION OF COMMUNICATION FOR SHARED-MEMORY MULTIPROCESSORS 373 
sopt + 1) < 0, i.e., when I{nhi s.t. nh; 2 S,rt}J 2 [{nhj s.t. nhi 
< S,rt}] and I{nhj S.t. nhi 2 Sopt f l}i < [{nhi s.t. nhi < Sopt + 
1}1. Since S,rt partitions the nhi set into two subsets, the 
optimum shift, Sopt, is given by Eq. (4.2). And so the 
claim is shown. n 
The above result can be extended as follows. Under 
the assumption that the loop part dimensions are much 
larger than the stencil, the optimal data partition within a 
constant factor involving the product of max(ln& and 
max(ln,il) is obtained by applying the optimal shifts in 
each direction as specified by Theorem 1. The proof is 
not included due to space limitations. The main result of 
this section is that for a multiprocessor memory hierar- 
chy consisting of just local and remote memories, a sim- 
ple procedure derived from Theorem 1 can be used to 
find the optimal data partition that minimizes communi- 
cation time for iterative parallel loops. 
4.2. Hardware Redundancy 
In this section, we expand the scope of the data assign- 
ment problem to include multiprocessor memory hierar- 
chies with private caches. The private level of a memory 
hierarchy is only accessible to a particular processor, 
e.g., the private caches on the TC2000. In such a system, 
frequently accessed data can be copied into a private 
cache, thus introducing redundancy in storage, i.e., mul- 
tiple copies of a datum. This redundancy is referred to as 
hardware redundancy because it is largely managed by 
the hardware. Only one logical address is used for all the 
multiple copes of the datum. In contrast, software redun- 
dancy, which is discussed in Section 7, involves multiply 
addressed copies of the same datum in different local 
memories managed explicitly in software. 
In contrast to the data partitioning problem, where the 
partition only influenced the performance, hardware re- 
dundancy also involves correctness and consistency con- 
siderations. In the absence of hardware or software co- 
herency schemes, only those data elements that are used 
exclusively by a processor can be cached by that proces- 
sor. In this section, we assume that the data have been 
partitioned as described previously into the different lo- 
cal memories. We exploit the lower latency of the cache 
by selectively declaring regions of the data part assigned 
to a local memory to be cacheable. 
Every array location has read and write characteristics 
(with respect to a given processor) in the set {shared, 
exclusive, no}. For example, an array location which is 
exclusively read from and written to by a single proces- 
sor is an exclusive read-exclusive write location with 
respect to that processor. Potentially, locations could be 
classified into as many as nine different categories of 
read-write characteristics. However, only four of these 
categories are of interest in our current context. Since the 
single assignment property has been assumed, all catego- 
ries with the shared-write characteristic are eliminated. 
Cycle-by-cycle communication is only induced (assum- 
ing sufficient local storage) by array locations which are 
both read from and written to in a single cycle. Read-only 
arrays have fixed values across all cycles, and represent 
startup communication which should not influence parti- 
tioning decisions. 
Global array locations fall into one of four sets. A 
datum A(i, j) belongs to the exclusive read-write set 
of processor u if A(i, j) E %(a, p) rl %‘“(a, p) and 
A(& j> $ WP, P) U WP, P) VP $: a. A datum AC, j) 
belongs to the shared read-exclusive write set of proces- 
sor a if A(i, j) E W(u, p) and A(i, j) $Z W(p, p) Vp # a 
and 3p # a s.t. A(i, j) E %(p, p). A datumA(i, j) belongs 
to the shared read-no write set of processor a if A(i, j) E 
%(a, p) and A(i, j) @ W(a, p) and 3p # a s.t. A(i, j) E 
%V”(p, p). In addition, the data not used by processor a 
belong to the no read-write set (NRW) of processor a. 
An important feature of our scheme is that these sets are 
identified at compile-time as a function of the array di- 
mensions, the number of processors, and the communi- 
cation parameters (n,’ , nh, n: , n;). 
THEOREM 2. Given a loop part of processor a, p(a) = 
[il : i2, j, : j,] of dimensions h X v, the ERW set is [i, + 
nz :i2 - n;, j, + n,+:j, - 61, the SREW set is 
p(a)\ERW = [i, : iz, j, : jJ\[i, + nh’ : iz - nh, j, + n: : j, - 
n;], the SRNW set is contained by [il - nh : i2 + nl , j, - 
n; : j, + n:]\[i, : i2, j, : j,], and the NRW set is a superset 
of N x N\(ERW U SRNW U SREW). 
Proof. From the code construct, observe that every 
iteration writes only one data element. Therefore, the 
write set, rllr(u, p) = p(a) = [il : iz, j, : j,], is the exclusive 
write set. A subset of the exclusive write set is the shared 
read-exclusive write, which is given by 
{G, A E 1i1 : i2, .A : jd I 3( L, K) $i [il : i2, j, : j2] s.t. (i, j) 
E Ts(L, K)). 
(L, K) fits at least one of the following criteria: L < i, , or 
L > i2, or K < jr, or K > j2. Therefore, the shared read- 
exclusive write set is 
{(i, j) E [il : iz, j, : j,] I 3(~, K) with L < i, or L > i2 or 
K < j, or K > j2 s.t. (i, j) E &(L, K)} 
= {(i, j) E [i, : i2, j, : j2] Ii < i, + n; or i > i2 - nh 
or j < j, + n: or j > j, - n;} 
=[i,:i2,j,:j2]\[i,+nh+:i2-nh,j,+nII::j2-nV]. 
The above expressions of the SREW set and the EW 
set yield the expression for the ERW set. The shared 
read-no write set is 
374 HUDAK AND ABRAHAM 
{(i, j) $ [in : i2, .A : j21 1 3(6, K) E [il : i2, j, : j21 s.t. (i, j) 
which, from the definition of the communication parame- 
ters, implies that at least one of the following is true: i 5 
i2+nh+,oriIil-nh,orj%j2+n:,orjrj,-nh. 
And so the claim is shown. n 
For example, consider the stencil S = ((3, 2), (1, 3), 
(-1, -3), (-2, 3)). Observe that nl = 3, n; = 2, n: = 3, - 
nv = 3. Consider Fig. 3, where p(a) = [i, : i2, j, : j2] and 
the ERW, SREW, and SRNW sets are shown. The empty 
corners of the outermost rectangle correspond to regions 
of the NRW set which are included in our approximation 
of the SRNW set as demonstrated in Theorem 2. 
Our compile-time analysis of communication and the 
subsequent partitioning of the array into four sets with 
respect to each processor permits the introduction of ca- 
ching schemes. The simpler scheme only caches the 
ERW set and does not require cache invalidates. The 
more sophisticated scheme achieves even smaller com- 
munication time by caching the entire exclusive write set 
and using cache invalidates to maintain consistency, i.e., 
by flushing the SREW from the cache at the end of each 
cycle, thus updating the copy of the SREW set in main 
memory. 
Let us first analyze the simpler scheme without cache 
invalidates. In this scheme, at the beginning of the paral- 
lel section of the program, an array of the appropriate 
dimension is allocated by each processor in its local 
memory and declared cacheable. Each processor copies 
the portion of the array corresponding to its ERW set into 
FIG. 3. The regions of a data set. 
this local array. Also, each processor allocates a non- 
cacheable array for storing its SREW in its local memory. 
Subsequent references during the execution of the paral- 
lel section are made to the local copies. The communica- 
tion time is reduced to T, = C(i,j)E ERWFcU,a)(i, j)t, + 
x(i,j)E SREW~KJ,u)k .dfl + x((i,j)E SRNW~(U,& j>t,. If cache 
invalidates are used, the communication time is ex- 
pressed as T, = C(i,j)E ERW U SREW~KI,a)(i7 .dfc + x(i,j)E 
SRNWF&r,a)(i, j)t,, ignoring the time required for cache in- 
validates. 
Consider the example presented in Fig. 1. Assuming a 
square partitioning with parts of size 25 x 25, and using 
the times supplied by BBN, i.e., t, = 1.89 ps, tl = 0.60 
ps, and tc = 0.15 ps, Eq. (4.1) yields T, = 571.56 pus per 
cycle, the former expression yields T, = 331.51 us per 
cycle, and the latter yields T, = 290.31 ps per cycle. 
Clearly, the ability to exploit an efficient loop partitioning 
strategy by moving large quantities of data high into the 
memory hierarchy has a significant impact on communi- 
cation time. 
5. EXPLOITING OVERLAP 
Research on iterative parallel loops has focused on re- 
ducing communication time through exploitation of the 
memory hierarchy. An alternative approach to reducing 
the overhead of communication is to overlap computa- 
tion and communication time. However, aspects of a pro- 
cessor’s architecture can limit the maximum achievable 
overlap. For instance, the number of available registers 
may be too few to hold the partial results of many loop 
iterations, or the processor may lack special hardware 
required for a large maximum overlap (e.g., separate 
ports to local and global memory). In practice, program- 
ming and compilation techniques also influence the over- 
lap achieved. Our objective is to focus on both reducing 
communication time and increasing overlap to reduce 
communication overhead. 
Results by Callahan et al. IS] and Mangione-Smith ef 
al. [21] are useful in the subsequent discussion. A proces- 
sor’s resources are broadly classified into compute and 
access resources [S]. The maximum performance of a 
particular loop is achieved once one of the resources is 
fully utilized. Accordingly, loops are classified into com- 
pute-bound and memory-bound loops. In our framework, 
loop iterations can be similarly classified as compute- 
bound if all data references can be satisfied by the cache 
or communication-bound if some data references require 
local or remote memory accesses. 
The compute-bound set is the subset of p(a) = [if : i2, 
j, : j,] whose use set is contained by the ERW set of pro- 
cessor a, and is given by Z = [il + nh: i2 - nh, jl + 
12, : jZ - n,]. Z is much larger than the rest of the itera- 
OPTIMIZATION OF COMMUNICATION FOR SHARED-MEMORY MULTIPROCESSORS 375 
tions to be performed, r = p(a)\E. We propose a com- 
piler which schedules code for a communication-bound 
iteration from r together with a sufficient number of com- 
pute-bound iterations from E. We refer to such a group of 
iterations as a node. Code scheduling within a node or- 
ders the instructions as follows: loads for the iterations 
from ti (which are all satisfied by the cache), loads for the 
iterations from r (which may require long latencies), exe- 
cution of the iterations from H (which are executed si- 
multaneously with the loads of r), and execution of the 
iterations from r. Finally, the results computed by the 
iterations in the node are stored. 
Despite a lack of hardware support, the overlapping of 
computation and communication can be accomplished on 
the BBN TC2000 through special code scheduling. The 
Motorola 88000 [22] issues one instruction on every cy- 
cle, unless there is a stall in the instruction issue unit. The 
instruction issue unit stalls when it must dispatch to a 
pipe which is full. The memory access pipe on the 88000 
has three stages. The pending accesses to local and re- 
mote memory wait at the third stage for their data. 
Should the load pipe be filled with two memory opera- 
tions that are waiting behind a pending memory opera- 
tion, the instruction issue unit stalls on another memory 
access instruction. The feature of the 88000 that influ- 
ences our methods for reducing communication overhead 
is the overlap possible between the access and floating 
point units that enables us to execute floating-point oper- 
ations while waiting for memory. 
Some architectural specifics of the Motorola 88000 
point to fundamental limits to maximum overlap. The 
memory system of the 88000 operates in a pipelined fash- 
ion. The in-order operation of memory requires that all 
loads for a computation to be overlapped with a remote 
memory access must be issued before the remote access 
is issued. In order to avoid waiting on the remote mem- 
ory access, data for compute-bound iterations must be in 
registers before initiating the remote load. The number of 
CPU registers seriously limits the maximum number of 
compute-bound iterations that can be overlapped, as the 
following analysis indicates. 
Assume that a node consists of one communication- 
bound iteration requiring a single remote load and several 
compute-bound iterations. A remote load requires at 
least 38 cycles to complete. This latency is completely 
overlapped only if 38 floating point operations whose op- 
erands are already in registers can be issued. Assuming 
four floating point operations and four operands per com- 
pute-bound iteration for the code in Fig. 2, at least 38 
registers are required. Composing a node using a square 
compute-bound tile of size 3 x 3 with each communica- 
tion-bound iteration will reduce register requirements to 
approximately 25, but this introduces additional compila- 
tion complexity [4]. 
6. SOFTWARE REDUNDANCY 
In software redundancy, the data assignment is ex- 
tended to create additional copies of the data in the local 
memories of individual processors. Consistency is main- 
tained by inserting separate stores in the instruction 
stream for each update. Software redundancy is a natural 
extension to the hardware redundancy already exploited 
to reduce latencies; e.g., one datum may be simulta- 
neously in a memory location and in cache. In this sec- 
tion, we permit the data assignment to be a one-to-many 
mapping from the elements of the global array to proces- 
sors. A particular element may appear in several local 
memories. We concern ourselves with maximum soft- 
ware redundancy, which is the replication of data ele- 
ments so that each processor has a local copy of all ele- 
ments in its use set. The corresponding data assignment, 
(T, maps (i, j) to those processors that access (i, j), and 
is the inverse of the use set mapping, %(a, p): v(i, j) = 
{alk j) E Way ~11. 
Two factors to be considered in using software redun- 
dancy are the extra memory space and additional consis- 
tency updates it incurs. In the following, we quantify 
each of these factors. The amount of memory space allo- 
cated per processor is I%(a, p)I as compared to IL(a)1 
previously. Note that l%(a, p)I = I[il - n; : i2 + nl, j, - 
n; : j, + n:]l = (u + nh)(h -t IZ”), while IL(a)1 = hu. 
Therefore, the additional storage required is l%(a, p)I - 
/L(a)/ = un, + hnh + nhnv and the fractional increase in 
memory requirements is 
unv + hnh + nhnv 
hu (6.1) 
Observe that, since h and u are typically at least an 
order of magnitude larger than nh and n, , the value com- 
puted by Eq. (6.1) is small. Since our analysis accurately 
identifies elements used by a particular processor, even 
maximum software redundancy only marginally in- 
creases storage requirements. 
Let us examine the impact of maximum software re- 
dundancy on consistency traffic. 
LEMMA 1. When using maximum software redun- 
dancy, the increase in the number of local loads is hng + 
a unv, where n{, nt are quantified by Eq. (3.1) and the 
increase in the number of remote stores is hnh + un, 
and the savings in communication time is (hni + unt) 
(t, - t,) - (hnh + Un,)t,. 
Proof. The increase in the number of local loads is 
equal to the number of read references made by a proces- 
sor to its SRNW set. The communication parameters ob- 
tained by the additive construction procedure quantifies 
376 HUDAK AND ABRAHAM 
the number of such references per unit length of the part 
border [I, 161. Therefore, the increase in the number of 
local loads is hni: + unt. There is a corresponding de- 
crease in the number of remote loads, resulting in a net 
decrease in communication time of (hni: + unc)(tr - t,). 
A data element in the SRNW set is computed by an- 
other processor. When the maximum software redun- 
dancy data assignment scheme is used, such points re- 
quire consistency updates by another processor. 
Therefore, the number of points in the SRNW set is equal 
to the number of extra stores required to implement this 
scheme. The size of the SRNW set is approximately 
hnh + un,. Therefore, the communication time is in- 
creased by (hnh + un,)t,. w 
For data which are written by one processor and read 
by another, the consistency is maintained on the TC2000 
among the multiple copies by the processor performing 
the update. The traditional method of transferring data 
between processors is a demand-based, or pull, protocol. 
Software redundancy replaces this with a push protocol. 
A processor which is writing a datum back to memory 
has a list of storage locations which also must be up- 
dated. The processor updates its copy of the datum, 
along with all copies on the list. The net effect is that the 
processor “pushes” the new data value into the local 
memory of the processor which is waiting on the datum. 
7. ADAPT 
The ADAPT (automatic data allocation and partition- 
ing tool) system generates code to automatically manage 
data assignments for iterative parallel loops. ADAPT 
consists of a set of routines which are implemented 
within the PAT (Parallelizing Assistant Tool) system de- 
veloped at Georgia Tech by Appelbe et al. [27]. Existing 
facilities within PAT were used to identify triply nested 
loops in sequential FORTRAN code and to analyze the 
code body of the innermost loop for array references. 
The ADAPT routines analyze these references to deter- 
mine the optimal aspect ratio. In addition, BBN parallel 
FORTRAN is generated to implement a partitioning 
based on the size of the iteration space, the number of 
processors, and the aspect ratio. 
In addition to partitioning, ADAPT also exploits the 
memory hierarchy of the BBN by declaring shared arrays 
to be cacheable. ADAPT uses the access vectors ob- 
tained from an analysis of the input code to determine the 
SREW and SRNW sets for each part. Cache flush in- 
structions are then inserted into the code to flush the 
SREW and SRNW sets after each cycle. The false shar- 
ing of data between processors introduced by cache lines 
further complicates consistency maintenance and will be 
discussed shortly. 
7.1. ADAPT’s Preamble and Run-Time Partitioning 
Assuming the stencil set is fixed, partitioning at com- 
pile time is desirable in order to avoid the expense of 
communication analysis at run time. However, a strictly 
static partitioning approach is not practical for many pro- 
grams where the array and loop bounds, as well as the 
number of processors, are not known at compile time. 
ADAPT’s solution to this dilemma lies in the recognition 
of the partitioning problem as two distinct phases: com- 
munication analysis (i.e., determination of the optimal 
aspect ratio from the access vectors) and partition gener- 
ation (i.e., the determination of the part boundaries for 
each processor executing the parallel loops). At compile 
time, ADAPT performs communication analysis and col- 
lects other information from the program which is re- 
quired for partition generation, i.e., the bounds of the 
parallel loops (which may be expressions that cannot be 
evaluated at compile time). This information is placed 
within a preamble which is inserted just in front of the 
parallel loop in the output code generated by ADAPT. At 
run time, each processor executes the preamble prior to 
the first cycle, thus completing partition generation. 
Eq. (3.2) give the dimensions of a rectangular partition 
with parts of a given size (i.e., N2/9) that has the mini- 
mum value of C, = hnh + uny. However, the ability of 
the partition to tessellate the iteration space is not guar- 
anteed. The partition generation algorithm of the ADAPT 
preamble takes a different approach: it considers the set 
of rectangular partitions that tessellate the iteration 
space, and selects the one with the aspect ratio that is 
closest to the optimal aspect ratio. 
Assume an N x M iteration space. For a given number 
of processors, 9, the set of rectangular partitions which 
tessellate can be generated from the set of divisors of 9. 
Let qr = 8, and assume (for the moment) that q divides 
N and r divides M. In a treatment similar to the OPTAL 
algorithm of Polychronopoulos [24], the rows of the itera- 
tion space are assigned into q classes and the columns are 
assigned into r classes. The part to be executed by pro- 
cessor p is located in row class (p/r) and in column class 
(p and r). For such a partitioning, the aspect ratio is Nqi 
Mr. The preamble of ADAPT examines this aspect ratio 
for all possible values of q and r. The values of q and r for 
which the aspect ratio is closest to the optimal aspect 
ratio, i.e., hoptluOpt is chosen as the partition. 
Now consider the case when q does not divide N 
evenly, i.e., let N mod q = O,, where 0, # 0. In such a 
case, load imbalance is introduced. The first oq row 
classes contain an extra row while the remaining q - 0, 
row classes contain [N/q] rows. The case when r does 
not divide M evenly is handled analogously. Under these 
assumptions the maximum number of iterations that any 
processor must execute in addition to the original [N/q] 
OPTIMIZATION OF COMMUNICATION FOR SHARED-MEMORY MULTIPROCESSORS 377 
[M/r] iterations is [N/q] + [M/r] - 1. By replacing the 
integer-valued functions with real-valued functions, the 
maximum relative load imbalance is 
which simplifies to (NV + Mq - qr)l(NM). And, since 
values of N and M are typically orders of magnitude 
greater than q and r, the relative load imbalance intro- 
duced by ADAPT’s partitioning scheme is usually negli- 
gible. 
7.2. False Sharing and Consistency 
In order to exploit the memory hierarchy of the BBN 
TC2000, ADAPT declares the global arrays used within 
the iterative parallel loops to be cacheable. Since the 
cache is a private level of the TC2000 memory hierarchy, 
automatic consistency of data is not provided. ADAPT 
maintains the consistency of data in software through the 
use of cache flush instructions. ADAPT estimates the 
SREW and SRNW sets for each partition using the ac- 
cess vectors obtained from the input code and Theorem 
2. ADAPT then generates cache flush instructions to 
flush the SREW and SRNW sets of each part. These 
instructions are inserted after the code responsible for 
updating matrices and are synchronized using a barrier, 
so the activities enforced by ADAPT within a single cycle 
are: (1) Update partition elements, reading most recently 
updated copies of shared data elements from memory. (2) 
Flush shared data elements into global memory. (3) Syn- 
chronize at a barrier to prevent processors from begin- 
ning the next cycle before the shared data is resident in 
global memory. 
Assuming column major storage, contiguous data 
points within a column are located in contiguous memory 
locations. For the Motorola 88000 processors used in the 
BBN TC2000, the cache line size is 16 bytes and the 
floating point data type is 4 bytes long. Therefore, assum- 
ing the matrix begins on a cache line boundary, exactly 
four data elements from the global matrix are contained 
on a single cache line. This inclusion of four data points 
on a single cache line complicates the consistency main- 
tenance on the BBN TC2000. For example, consider a 
cache line, the first of whose points is updated by a par- 
ticular processor while the remaining three points are 
updated by another processor. During the update of their 
respective partitions, the two processors read data ele- 
ments from the locations contained on the cache line and 
a copy of the cache line is created in each processor’s 
cache. The first processor updates the first element on 
the cache line, and now possesses a line containing one 
current value and three “stale” values. Similarly, the 
second processor possesses a cache line containing one 
stale value and three current values. After the partition 
updates, the cache lines are flushed back into global 
memory. Assume that the first processor flushes the 
cache line, followed by the second processor. The flush 
performed by the second processor replaces the value in 
memory that was updated by the first processor with the 
stale value possessed by the second processor. Indeed, 
regardless of the order in which the two processors flush 
their cache lines, stale data will reside in memory. 
Though the partitioned loop does not contain depen- 
dencies between array values within a cycle, there are 
output dependencies that must be maintained on cache 
lines for correctness. The key to a general and elegant 
solution to this problem lies in recognizing the need to 
maintain certain output dependencies on cache lines. De- 
pendencies are usually maintained by inserting synchro- 
nization operations. If in a particular parallel segment a 
cache line is updated by at most s processors, correct 
execution is in general obtainable by inserting s synchro- 
nization operations so that in each phase no more than 
one processor updates any cache line. Following each 
phase, all processors flush all cache lines that can possi- 
bly be updated in that phase. All processors except at 
most one have a clean copy of the line and do not write to 
memory. Only the processor having the dirty copy of the 
line writes to memory. 
In the context of rectangular partitioning of iterative 
parallel loops, the false sharing problem has to be ad- 
dressed for those cache lines in the horizonal borders of 
each part. All other cache lines are exclusively updated 
by one processor. The horizontal border cache lines may 
be updated by two processors. Therefore, we further sub- 
divide each parallel segment into two segments and insert 
an additional synchronization barrier as follows: (1) Up- 
date the top half of each part, reading the most recently 
updated copies of shared data elements from memory. (2) 
Flush shared data associated with the top half of each 
part into global memory. (3) Synchronize at a barrier to 
prevent simultaneous access of shared cache lines by 
processors. (4) Update the bottom half of each part, read- 
ing the most recently updated copies of shared data ele- 
ments from memory. (5) Flush shared data associated 
with the bottom half of the part into global memory. (6) 
Synchronize at a barrier to prevent processors from be- 
ginning the next cycle before the shared data are resident 
in global memory. 
8. EXPERIMENTAL RESULTS 
Experiments were run on a 45processor BBN TC2000 
at Argonne National Laboratories. For the first suite of 
experiments, the code presented in Fig. 1 was restruc- 
378 HUDAK AND ABRAHAM 
tured by hand to implement various data assignments and 
varying degrees of overlap. In the second suite of experi- 
ments, the SHOPF code segment from the BBN manual 
was restructured using the ADAPT system. 
8.1. Data Assignment Experiments 
For the first set of experiments in this subsection, we 
used a simple column partition to simplify the implemen- 
tation of various data assignments and overlap. To fur- 
ther simplify the problem, each processor communicated 
across only one boundary. The relative performance of 
the candidates in this experiment is unaffected by these 
choices. 
The data assignments for the sample code were made 
by considering the placement of the ERW and SRNW 
sets of each processor. The following notation is used in 
this section to abbreviate the levels of the hierarchy; “c” 
stands for cache, “lm” for local memory, and “rm” for 
remote memory. The data assignment is specified by an 
ordered pair, (hierarchy level holding the SRNW, hierar- 
chy level holding the ERW). The SREW set is automati- 
cally placed with the SRNW set. The location of both 
sets is referred to as the SRNW location for brevity. 
Execution times from the BBN TC2000 in ps are given 
for various data assignments in Table I. 
The largest execution time occurs, not surprisingly, 
when no special attention is paid to the data assignment. 
This is the case labeled (rm, rm) in Table I, and corre- 
sponds to scattering the array among the memory mod- 
ules executing the program, using the BBN “scatter” 
command. In order to make a fair comparison, the time 
obtained for the “scatter” data distribution is compared 
to a data assignment with no redundant storage. Our data 
partitioning scheme described in Section 5.1 places each 
processor’s exclusive write set in its local memory, and is 
the (rm, lm) case in Table I. The observed decrease in 
execution time is a factor of 3.63, for a savings of 72.45 
percent. For this application, utilization of a data assign- 
ment dramatically improves the performance of the 
BBN. 
The experiments are based on the placement of the 
ERW set and the SRNW set for each processor. In our 
experiments, software redundancy and hardware redun- 
dancy are both exploited. The (rm, c) case results from 
TABLE I 




Relative Performance of a Processor for Various Data 
Assignments 
ERW 
the implementation of hardware redundancy using local 
arrays which are declared to be cacheable. The local ar- 
rays are used in this section to illustrate the performance 
found at various levels of the memory hierarchy and in 
the implementation of overlap. Software redundancy, as 
described in Section 7.1, is used when the shared points 
are placed in the local memory. The results presented in 
Table II demonstrate the obvious observation that reduc- 
ing the latencies for all data points results in the minimum 
execution time. However, we can separately analyze the 
effects of software redundancy and hardware redun- 
dancy for this application. The experimentally observed 
reduction in execution time using hardware redundancy 
in 27%. The observed reduction in execution time using 
software redundancy is 7%. 
In order to test the effects of data assignment on a 
more complex loop partition, the code in Fig. 1 was parti- 
tioned using squares and executed on 16 processors of 
the BBN TC2000. An analysis of the loop in Fig. 1 when 
partitioned for N = 100 and ?J’ = 16 indicates that 96% of 
all references are made to the ERW set, while 4% of all 
references are made to shared points. When using square 
partitions, the number of accesses made to the ERW set 
is proportional to the area of the partition, which grows 
quadratically in N. Meanwhile, the number of accesses 
made to the shared points is proportional to the perimeter 
of the partition, which grows linearly in N. Therefore, as 
N grows, the performance of the (rm, c) assignment 
should improve relative to the (rm, rm) assignment. Ex- 
periments varying N from 100 to 400 were run on the 
BBN TC2000. The ratio of the execution time of the (rm, 
rm) to the execution time of the (rm, c) assignment ob- 
tained is presented in Fig. 4. Note that, as N increases 
from 100 to 240, the ratio increases. However, as N ex- 
ceeds 240, the ratio drops dramatically. This is because 
the Motorola 88000 processors have only 16 kilobyte of 
data cache, and N = 240 is the largest value of N for 
which the ERW set entirely fits in the cache. 
In our experiments, the node construct overlapped one 
load of a shared point (in local or remote memory) with a 
number of iterations from the compute bound set. Due to 
the limited number of registers, any attempted overlap of 
more than three iterations resulted in partially computed 











100 150 200 250 300 350 4 
N 
FIG. 4. The ratio of (rm, rm) to (rm, c) as N increases. 
results being spilled back to memory. The execution 
times (in microseconds) are given for the (lm, c) data 
assignment and the (rm, c) data assignment in Fig. 5 as 
the number of overlapped iterations increases. For an 
overlap of three, the improvement in execution time is 
8.65% for the (rm, c) case and 10.21% for the (lm, c) case. 
8.2. ADAPT 
The previous experiments have compared one data as- 
signment with another, using the same partitioning 
method for each program. It is important to compare the 
result of our partitioning with more traditional methods 
of scheduling parallel programs. The SHOPF routine [3] 
is a code segment extracted from a fluid flow application. 
It involves the update of a global matrix using the stencil 
{CL 01, (1, 11, (1, O), (1, -11, (0, 3, (0, I), (0, co, (0, -l), 
(0, -2),(-l, l), (-l,O), (-1, -l), (-2,O)). Twoversions 
of the code were used in experiments. The first version 
used chunk scheduling. The second version was restruc- 
tured code generated by ADAPT. 
Three experiments were conducted on the code gen- 
TABLE III 
Execution Time of SHOPF with N = 200, 9 = 16 
Processors Processors 
dividing dividing 
Aspect ratio columns rows Time(sec.) 
0.0625 16 1 32.12 
0.25 6 2 26.51 
1 4 4 29.16 
4 2 6 32.53 













1  2 
Overlap 
FIG. 5. Execution time in ~LS vs degree of overlap. (13) Im, c; (a) 
rm, c. 
erated by ADAPT. ADAPT was used to generate code 
with varying aspect ratios to determine the impact of 
altering the aspect ratio on performance. The possible 
aspect ratios for 16 processors, along with their column 
and row assignments, are given in Table III. Using the 
max-min construction, &, = 4 and ny = 4, so the Optimal 
aspect ratio is 1. However, since a cache line in the BBN 
TC2000 system contains more than one data point, the 
effects of cache lines on communication must be consid- 
ered as detailed in [l]. From this treatment, the optimal 
aspect ratio is 0.25, as is demonstrated experimentally in 
Table III. Facilities within ADAPT to compensate for 
cache line effects have been added. In order to observe 
the effects of matrix size on performance, both versions 
of the code were run on 16 processors for varying matrix 
sizes. The exploitation of the BBN memory hierarchy 
improved performance of the ADAPT-generated code 




E F 40 
20 
FIG. 6. Execution time of SHOPF with 8 = 16, aspect ratio of 0.25. 
HUDAK AND ABRAHAM 
which provides for maximum reuse of registers. Also, 
our analysis can be extended to other memory hierar- 
chies. Additionally, our approach must be generalized to 
a wider range of numerical applications. These are major 
areas for future work. 
ACKNOWLEDGMENTS 
The authors thank Bill Appelbe, Kurt Stirewalt, and particularly 
Kevin Smith for their assistance in using the PAT system. Argonne 
National Laboratory provided access to the BBN TC2000 computer. 
o- 
4 8 12 16 20 
Processors 
1. 
FIG. 7. Execution time of SHOPF with N = 200, aspect ratio of 
0.25. (b) Chunk scheduling, (+) ADAPT output. 
2. 
of 3.27 when N = 100 to 3.69 when N = 200, as shown in 3. 
Fig. 6. Finally, the code was compared for various num- 




The nonuniformity of memory access times found on 
large-scale, shared-memory multiprocessors is a direct 
result of scaling the systems to large numbers of proces- 6. 
sors. However, this is not an unfortunate result that must 
be hidden from the compiler. If exposed, the compiler 7 
’ can exploit the nonuniformity to extract even greater per- 
formance. In this paper, we examined automatic methods _ 
REFERENCES 
Abraham, S. G., and Hudak, D. E. Compile-time partitioning of 
iterative parallel loops to reduce cache coherence traffic. IEEE 
Trans. on Par. andDist. Sys. 2, 3 (July 1991), 318-328. 
BBN Advanced Computers, Inc. Znside the TC2000 Computer. 
BBN Advanced Computers, Inc., Cambridge, MA, 1990. 
BBN Advanced Computers, Inc., TC2000 Fortran Reference, BBN 
Advanced Computers, Inc., Cambridge, MA, 1990. 
Bromley, M., Heller, S., McNerney, T., and Steele, G., Jr. Fortran 
at ten gigatlops: The Connection Machine convolution compiler. In 
Proc. ACM SZGPLAN Conference on Programming Language De- 
sign and Implementation, 1988, pp. 58-62. 
Callahan, D., Cocke, J., and Kennedy, K. Estimating interlock and 
improving balance for pipelined architectures. In International 
Conference on Parallel Processing, 1987, pp. 295-304. 
Callahan, D., Kennedy, K., and Porterfield, A. Software pre- 
fetching. In Arch. Support for Programming Languages and Oper- 
ating Systems--IV, 1991, pp. 40-52. 
Carr, S., and Kennedy, K. Blocking linear algebra codes for mem- 
ory hierarchies. In Proc. SIAM Conference on Parallel Processing 
for Scientijic Computing, Chicago, IL, December 1989. 
for reducing the impact of communication time on the 
execution time of a parallel loop. We identify and focus 
on iterative data parallel loops. For the optimal loop par- 
titions generated by ADP, we develop an optimal data 
partition that minimizes communication time. 
The ADAPT (automatic data allocation and partition- 
ing tool) system was developed in order to automatically 
partition programs. The use of ADAPT on a fluid flow 
code segment improved performance by over a factor of 
3 over the partitioning method suggested by BBN. In 
order to reduce communication overhead even further, 
we consider the overlapping of compute-bound iterations 
with memory-bound iterations. Certain machine fea- 
tures, e.g., the size of the register file and the single mem- 
ory access pipe on the Motorola 88000, limit the maxi- 
mum achievable overlap. 
Many opportunities exist for future work in this area. 
We are currently working on extending this analysis to 
multiple loops with potentially different access patterns. 
A classification of loops to determine the optimal amount 
of software redundancy may lead to improved perfor- 
mance. Improvement of maximum overlap can be 
achieved through a two-level loop blocking scheme 
8. Carr, S., and Kennedy, K. Compiling scientific code for complex 
memory hierarchies. In Proc. Hawaii International Conference on 
System Sciences, 1991, pp. 536-544. 
9. Cheong, H., and Veidenbaum, A. Compiler-directed cache man- 
agement in multiprocessors. IEEE Computer 23,6 (June 1990), 39- 
47. 
10. Cytron, R., Karlovsky, S., and McAuliffe, K. Automatic manage- 
ment of programmable caches. In International Conference on Par- 
allel Processing, 1988, pp. 229-238. 
11. Farhat, C. A simple and efficient automatic FEM domain decom- 
poser. Computers and Structures 28(S) 579402, 1988. 
12. Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., Kremere, U., 
Tseng, C., and Wu, M. Fortran D language specification. Tech. 
Rep. TR90-141, Department of Computer Science, Rice University, 
Dec. 1990. 
13. Gallivan, K., Jalby, W., and Cannon, D. On the problem of opti- 
mizing data transfers for complex memory systems. In ACM Znter- 
national Conference on Supercomputing. St. Malo, France, 1988, 
pp. 238-253. 
14. Gannon, D., Jalby, W., and Gallivan, K. Strategies for cache and 
local memory management by global program transformation. J. 
Parallel Distrib. Compur. 5, 5 (Oct. 1988), 587-616. 
15. Gornish, E. H., Granston, E. D., and Veidenbaum, A. V. Com- 
piler-directed data prefetching in multiprocessors with memory hi- 
erarchies. In ACM International Conference on Supercomputing. 
1990, pp. 354-368. 












Hudak, D. E., and Abraham, S. G. Compiler techniques for data 
partitioning of sequentially iterated parallel loops. In ACM Znterna- 
tional Conference on Supercomputing. 1990, pp. 187-200. 
Hudak, D. E., and Abraham, S. G. Multidimension extensions to 
adaptive data partitioning. Tech. Rep. CSE-TR-85-91, The Univer- 
sity of Michigan, 1991. 
Knobe, K., Lukas, J., and Steele, G., Jr. Data optimization: Allo- 
cation of arrays to reduce communication on SIMD machines. J. 
Parallel Distrib. Comput. 8 (1990), 102-118. 
Koelbel, C., and Mehrotra, P. Compiling global name-space paral- 
lel loops for distributed execution. IEEE Trans. Parallel Distrib. 
Systems. 2, 4 (Oct. 1991), 440-451. 
Lenoski, D., Laudon, J., Gharachorloo, K., Gupta, A., and Hen- 
nessy, J. The directory-based cache coherence protocol for the 
DASH multiprocessor. In Z7th Znternational Symposium on Com- 
puter Architecture. 1990, 148-159. 
Mangione-Smith, W., Abraham, S., and Davidson, E. The effects 
of memory latency and fine-grain parallelism on Astronautics ZS-1 
performance. In Proc. Hawaii International Conference on System 
Sciences. 1990, 288-296. 
Melear, C. The design of the 88000 RISC family. IEEE Micro (Apr. 
1989) 26-38. 
Pingali, K., and Rogers, A. Compiling for locality. In Znternational 
Conference on Parallel Processing. 1990, 142-146. 
Polychronopoulos, C. On Program Restructuring, Scheduling, and 
Communication for Parallel Processor Systems. Ph.D. thesis, Uni- 
versity of Illinois at Urbana-Champaign, Aug. 1986. CSRD Report 
595. 
Ramanujam, J., and Sadayappan, P. Compile-time techniques for 
data distribution in distributed memory machines. ZEEE Trans. 
Parallel Distrib. Sys. 2, 4 (1991), 472-482. 
Reed, D. A., Adams, L. M., and Patrick, M. L. Stencils and prob- 
lem partitionings: Their influence on the performance of multiple 
processor systems. IEEE Trans. Comput. C36, 7 (July 1987) 845- 
858. 
27. Smith, K., and Appelbe, W. PAT-An interactive Fortran parallel- 
izing assistant tool. In International Conference on Parallel Pro- 
cessing. 1988, 58-62. 
28. Tseng, P.-S. A Parallelizing Compiler for Distributed Memory Par- 
allel Computers. Ph.D. thesis, Carnegie-Mellon University, Pitts- 
burgh, PA, May 1989. 
29. Wolf, M., and Lam, M. A data locality optimizing algorithm. In 
Proc. ACM SZGPLAN 1991 Conference on Programming Lan- 
guage Design and Implementation, June 1991, pp. 30-44. 
30. Zima, H., Bast, H., and Gerndt, M. Superb: A tool for semi-auto- 
matic MIMDiSIMD parallelization. Parallel Comput. 6, (1988) l- 
18. 
DAVID E. HUDAK is a Ph.D. student in the Electrical Engineering 
and Computer Science Department at the University of Michigan, Ann 
Arbor, and a research assistant in the Advanced Computer Architecture 
Laboratory. His research interests focus on hardware and software 
methods for improving the performance of multiprocessors. David Hu- 
dak has the B.S. in mathematics from Bowling Green State University, 
and the M.S. in computer science from the University of Michigan. 
SANTOSH G. ABRAHAM is currently an Assistant Professor in the 
Department of Electrical Engineering and Computer Science at the 
University of Michigan, Ann Arbor. From 1984 to 1987, he was a re- 
search assistant in the Center for Supercomputing Research and Devel- 
opment at the University of Illinois. His research interests are in the 
areas of parallel processing, compilation for parallel systems, and com- 
puter architecture. Santosh Abraham received the B. Tech. degree from 
the Indian Institute of Technology, Bombay, in 1982, the MS. degree 
from the State University of New York, Stony Brook, in 1983, and the 
Ph. D. degree from the University of Illinois, Urbana, in 1988-all in 
electrical engineering. 
Received September 1, 1991; revised February 28, 1992; accepted April 
17. 1992 
