Automatic data/program partitioning using the single assignment principle by Bic, Lubomir et al.
UC Irvine
ICS Technical Reports
Title
Automatic data/program partitioning using the single assignment principle
Permalink
https://escholarship.org/uc/item/1xg6h283
Authors
Bic, Lubomir
Nagel, Mark D.
Roy, John M.A.
Publication Date
1989
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
7i
M
tj
t, X>
Automatic Data/Program Partitioning Using the
Single Assignment Principle^
LubomirBi<^
ZTX Notice; This Material
may be protected
January 1989 by Copyright Law
(Title 17 U.S.C.)
Technical Report #89-08
Abstract
Loosely-coupled MIMD architectures do not suffer from memory contention; hence large
numbers of processors may be utilized. The main problem, however, is how to partition
data and programs in order to exploit the available parallelism. In this paper we show
that effrcient schemes for automatic data^rogram partitioning and synchronization may be
employed if single assignment is used. Using simulations of program loops common to
scientific computations (the Liveimore Loops), we demonstrate that only a small fraction
of data accesses are remote and thus the degradation in network performance due to multi
processing is minimal.
Keywords: array, cache, multiprocessor, distributed, single assignment
Department of Information and Computer Science
University of California at Irvine
Irvine, California 92717
This work was supportedby the NSF Grant CCR-8709817. Copyright© 1989.

Automatic Data/Program Partitioning Using the
Single Assignment Principle
1. Introduction
In scientific computations, most potential parallelism may be found in highly structured
data such as vectors and arrays. Current parallel architectures designed to exploit this
parallelism include pipelined processors (with vectorizing compiler support), SIMD array
processors, and MIMD architectures with either tightly or loosely-coupled processing
elements. Pipelined parallel processors form a specialized class of architectures that are
capable of achieving large speedups on structured numerical computations containing large
amounts of vectorizable code. However, as is noted in [R&F87], the maximum speedup is
limited and depends greatly on the proportion of vectorizable code present SIMD
multiprocessor architectures also are capable of extracting large amounts of parallelism
from vector problems, and the amount of parallelism generally increases with increasing
numbers of processors; however, the problem domain of most SIMD architectures is
limited (i.e., SIMD is not suited to general computing).
In this paper we consider loosely-coupled MIMD architectures. Because MIMD
architectures do not suffer from memory contention, they have the greatest potential for
large-scale parallelism. The main problems with loosely-coupled MIMD include: (1) the
need to find a good partitioning of the programs/data, (2) the need to introduce
synchronization primitives to avoid race conditions, and (3) the need to introduce
communication primitives for exchanging data among processors.
In this paper we show that a single assignment policy can produce a large degree of
parallelismwhilekeeping the amountof communication overhead low. In particular, for
programs that follow the single assignment policy, we show the following:
• A simple automatic scheme for partitioningof data and programs can be employed.
• Synchronization can be fully automatic usinga memory tagging mechanism
• Communication overhead for remote data accesses can be greatly reduced by using
data caches. Furthermore, due to the single assignment policy, cache coherence problems
are eliminated
The paper is organized as foUows. In Section 2 we discuss the benefits of single
assignment with respect to automated data/program partitioning. Section 3 shows how
synchronization may be simplified by using the single assignment principle. Next, in
Section 4, we present a simple scheme for data/program partitioning and discuss how
caching reduces communication overhead. In Section 5, we discuss some of the
disadvantages of single assignment programming and some approaches for overcoming
these problems. In Sections 6 and 7, we discuss our simulation and the resulting access
distribution classes. Finally, in Sections 8 and 9, we end with some conclusions and a
discussion of our future research.
2. Single Assignment and Data/Program Partitioning
Partitioning a program over multiple processing elements involves both data and
program partitioning. The programmer can do this partitioning (e.g., fork/join and '
cobegin/coend). This requires experienced parallel programmers and too much debugging
time to be generally applicable. It is also possible for the compiler to detect parallelism and
partition the program accordingly—most currently known methods are NP-complete,
although some progress has been made in this area ([PEI86] and [PGW87]). The main
problem with all these approaches is that data (variables) may be read and writtenfrom
many parts of the program (i.e., by different instructions). It is difficult to decide where to
place a variable with respect to the possible instructions thataccess thatvariable.
Furthermore, synchronization primitives must be inserted to prevent race conditions, and
communication primitives mustbe insertedto allowsharing of data across PEs. These
increase the chances for program errors.
Most of these problems can be drastically simplified if single assignment principles are
used. These principles require that no variable ever be assigned more than once throughout
its scope. When extended to arrays, the definition is less clear. Some have defined single
assignment on arrays as the requirement that the array be treated as a single object and thus
can be assigned to only once [DEN75]. This is acceptable in languages that have a
complete set of array operations, yet forming such a complete set can be very difficult. A
better definition of single assignment for arrays is that each element of an array may be
assigned only once. This allows a great deal more flexibility in the use of arrays, and this
relaxation of single assignment rules does not cause any problems given proper hardware
support.
With single assignment, only one instruction will ever write to a variable; it is the
producer of the data. We can use this fact by requiring that each variable be mapped onto
the same PE as its producer (i.e., the instruction that writes it). In other words, we do not
partition programs explicitly, only data. The corresponding instructions can be mapped
implicitly.
This is particularly attractive with arrays, which are typically accessed via loops. Each
array is subdivided into segments and these are distributed across the PEs. The loop body
accessing each cell is the same for each PE, and hence a copy is distributed to each PE.
Each PE uses this copy to produce its array subrange. More precisely, the partitioning of a
loop takes place as follows:
• Data partitioning is accomplished by segmenting each array into pages of some fixed
(perhaps parameterized) size. A pagep is allocated to the local memoryof PE P ifp
= P mod N, where N is the total number of available PEs.
• Control partitioning will be done by assigning to eachPE the responsibility for
updating the elements in all thearray pages it contains in its local memory.
As a simple example of this partitioning method, suppose we have a multiprocessor
with four PEs and a page size of 32 elements. Given three arrays A, B, and C, (each of
size 100) PE 0, PE 1, and PE 2 will each contain a single page of each array. PE 3 will
contain a partial page (4 elements) of each array. For the following simple loop:
DO 10 i = 1,100
10 A(i) = B(lOl-i) + C(i)
all four processors begin executing simultaneously—PE 0 fills A(1..32), PE 1 fiUs
A(33..64), PE 2 fills A(65..96), and PE 3 fills A(97..100). Note that for most of the loop,
eachprocessor mustaccess elements of arrayB that lie on a different processor than the
executingprocessor. A methodfor eliminating nearlyall of thiscommunication overhead
will be presented in Section 4.
Thus, data and program partitioning are achieved using simple rules which take
advantageof single assignment. These rules are sufficientfor most common forms of
loops (see Section 6).
3. Synchronization Through Single Assignment Programming
Single assignmentprinciples allow the implementationof a simple automatic
synchronization mechaitism. Each memory cell has two states—^undefined or defined. If a
cell is undefined, it may also have a queue of read requests associated with it. Hardware
enforces the write-before-read requirement. Some examples of architectures that have this
type of write-once/read-many memory access mechanism include HEP [S&F83] [SMI81]
and I-structure memory in dataflow [ANP87] [A&C86].
Prior to execution,an array is either undefinedor filled with initialization data (if
specified in theprogram). EachPEmaywrite only intoundefined array cells andonly into
those mapped to thatPE (i.e., eachPE is the producer of only thearray subranges mapped
to it). This is achieved by screening the array indices so thattherighthand side of the
assignment is evaluated only for a given PE's subranges. Whether only the correct indices
are generated, or if they aU are generated and then screened is animplementation detail.
The point is not to performthe calculations in a PE not responsible for writingthe
associated element.
Race conditions are avoided by this single assignment policy. There will never be a
race condition for writes to memory cell, since only one PE may write to any particular cell
and writing more than once results in a runtime error.
Thus the single assignment rule automatically enforces synchronization in a distributed
manner, no exphcit synchronization mechanisms are necessary—a major issue in other
programming paradigms.
4. Inter-processor Communication
We have seen that single assignment yields simple partitioning and synchronization
schemes. Remote read accesses, however, are not eliminated, since any instruction may
read any data item. If data is mapped onto the reading PE, the access is local, otherwise it
is remote; the PE must request the value from the responsible PE by sending a message.
Remote reads are synchronized just like local reads—^ifthe data item is not available, the
request is queued, and if the data item is available, the page containing that item is sent
back. During this remote read the requesting PE can perform other useful work. The
requesting PE may resume filling its subrange when the page arrives. This is where the
benefits of array caching come in, and array caching is greatly simplified because of the
single assignment principle.
Since the central idea in single assignment programming is to permit only one write to
any element, by requiring single assignment we can guarantee that a page fetched from a
remote PE and cached locally will not need any further updates during the lifetime of the
array, ignoringfor now the possibilityof partially filled pages. Given this, each PE may
safelycache a remotelyfetchedpage in a local data cache,preventing future accesses of the
sameremote page. The cacheused will be of fixed sizeand thus must use somesortof
page replacement strategy. For our simulation, we chose a least-recently-usedpage
replacement strategy. This choice leads to some interesting results discussed in a later
section.
Without single assignment, partitioning data amongPEs is possible, but it would
require excessivecommunication overhead to allow any instmction to write to any location
of an array. In addition, caching would be nearly useless as each write performed would
require the updateof all remotecachescontaining the modifiedpage. The machinecould
broadcast or multicast these updates to avoid the inefficiencies of individual messages, but
the broadcasts would still strain the network facilities. Not only that, but without single
assignment thecaches would be inconsistent for theduration of thepagemodification
broadcast (cache coherency problem). If no cache approach is taken, no page modification
broadcasts will be necessary, and there will be no inconsistency problems. But, the use of
caching leads to considerable decreases in total remote accesses performedas is shownlater
in our simulation results.
We consider a set of loops (extracted from the Livermore Loops benchmark program)
with data access pattems that are typically found in scientific programs. Using these we
show that a simple data partitioningapproach works well even with many PEs.
The main questions we are interested in answering are:
Given a simple static data partitioning scheme,
• how important is each program's access pattern?
• what is the overall percentage of remote accesses?
• how much can this be reduced by adding a data cache?
• how well balanced are the remote accesses?
5. Problems with Single Assignment
From the above discussion we see that enforcing single assignment policy can offer
several advantages forMIMD architectures. Experience with single assignment languages
has shown, however, that it is difficultto implement programs under such a restriction.
Much of this attitude arises from the ingrained nature of the von Neumann model. The
requirement of single assignment is not as restrictive as it might appear—from a
programming standpoint, there are several alternatives, including:
• Use a single assignment language. Here the rule is enforced automatically through the
language semantics. In some cases, adherence can be determined completely at compile
time (e.g., functional languages [ACK82]).
• Use a conventional language. In this case, the burden is placed on the programmer to
ensure that the rule is not violated. In most cases, the same programming techniques
and algorithms can be used, but arrays cannot be reused—once written, they cannot be
changed. A way to relax the single assignment policy in a controlled manner so that
memory costs do not become too high is presented below. Conventional compilers can
be modified to perform data path analysis to help programmers adhere to single
assignment rules.
• Use an automatic conversion tool. For many conventional loops, this conversion will
be straight-forward and can be done by a translator program. These translators will
tend to increase the amount of memory used for array storage, especially in those
programs that reuse arrays many times in the same loop.
In statically allocated systems, the resulting inefficiency with memory usage can be
solved by providing a special array re-initialization construct. Each PE's re-initialization
must synchronize in some way with the re-initializationrequests of all other PEs. We have
formulated a method for performing this synchronization that is based on the concept of a
hostprocessor. In this method, each array in a computation has a specific PE assigned to it
as an administrative center called the host processor. The host processor serves as a
gathering point forre-initialization messages. In order toevenly divide this work among all
PEs, the compiler ensures that the hostprocessors areevenly distributed among the arrays.
For the re-initialization of somearrayA,each PE sends a re-initialization message toA's
host processor. These messages are collected until the last PEhas requested
re-initialization. Once this happens, the host processor for A broadcasts a message to the
other PEs informing them that A can now be reused. Thus, the host processor acts as a
synchronization point for A so that no PE uses attempts to write to an out-of-date version
of A. This prevents the creation of too many copies of an array in tight loops at the
expense of an artificial synchronization point. Deallocation of arrays must be based on the
same kind of host processor synchronization. (A more complete discussion of this
mechanism is beyond the scope of this paper.)
6. Description of the Simulation
In order to study the effect of using a single assignment MIMD machine with a per-PE
arraycache, we implementeda simulation to measurethe distribution of local, cached,and
remote reads for an abstract multiprocessor architecture. The parameters that we varied
were:
• number of processors
• page size (in units of atomic data elements)
Since the main goal of the simulation was to show that an array cache would decrease
the percentageof remote accesses required, we chose a small fixedcache size (256
elements). Since the number of cache pages is dependent on the page size, the number of
cache pages varied as well, but was not a simulation variable. Even a cache size this small
proved sufficient to reduce the remote access percentage in many cases.
7. Simulation Results
Using the Livermore Loops, we mapped arrays onto a set of PE using the partitioning
scheme described above with multidimensional arrays mapped to a linear address space
through row-major ordering. Accesses to array elements were categorized as follows: write
(always local), local read, cachedread,remote read. The totals of eachaccess typewere
accumulated for the execution of each program. For each loop, the percentage of all reads
which were remote (% of ReadRemote) indicates howwellour approach handles the loop
access pattern. Another important measure of performance is the distribution of work
among the processors. The following sections present the results of the simulation.
7.1. Remote Access Overhead
By examining graphs produced by the simulation data, we were able to classify the
various loops based on their access patterns. The four classes we observed are described
below.
7.1.1. Class 1: Matched Distribution
The first class we observed consisted of those access pattems that have all array indices
equal to one another throughout the execution of the loop, i.e. there is no skewing of array
accesses. A typical loopfragment from a member of thisclass, 1-DParticle in a Cell, is:
DO 1 k = l,n
1 RX(k) = XX(k) - IR(k)
Note how the same array index is used for all array accesses in the calculation. Access
pattems that fall into this class will always achieve a 0% remote access ratio. Caching has
no effect on the access ratio since each PE can write to its segments by reading segments of
the other arrays locally.
7.1.2. Class 2: Skewed Distribution
The second class we observed, skewed distribution (SD), displays a sequential access
pattemsin matched distribution, but the indices usedin each arrayareoffsetfromone
another by a constant. As the index steps through the arrays, remote accesses willneed to
be performed for the elements lyingpastpageboundaries. Since a pageboundary implies a
remote access (exceptfor the singlePE case), the loops in this class performremote
accesses.
We found that loops in this class occur often in the Livermore Loops. Forexample,
Hvdro Fragment. Tri-Diagonal Elimination. Equation of State Fragment. Explicit
Hydrodynamics Fragment. First Sum, and First Differenrial were all in this class. The
inner loop fragment from Hydro Fragment is:
DO 1 k = l,n
1 X(k) = ,Q + Y(k) * (R*ZX(k+10) + T*ZX(k+ll))
SD accesspatterns tend to achieye a yery low (< 10%) remote access ratio (see Figure
1). This is because the access patterndisplays a large amountof localityof reference—the
number of remote accesses is usually small as the skew is generally a few elements. When
the skew is large, the remote access percentage increases, but caching eliminates the cost of
a largerskew. Theeffectof caching in thiscasedepends on the value of the skew
constant. For a skew of one, the cache has no effect, for a skew of two, the cache saves
one remote access, and so on. For larger page sizes, the cache helps proportionally to the
page size. Ofcourse, if the page size is toolarge, the work will notspread over a sufficient
number of PEs.
20.00% • •
15.00% • •
10.00% • •
5.00% • •
0.00% •
Hydro Fragment
T r
4 8 16
Number of PEs
Cache, ps 32
No Cache, ps 32
Cache, ps 64
No Cache, ps 64
RGURE 1. SKEWED ACCESS PATTERN (SKEW OF 11). CACHING IS
IMPORTANT IN THIS COMMON CLASS.
10
7.1.3. Class 3: Cyclic Distribution
This third class, cyclic distribution (CD), occurs when a fixed set of pages is accessed
in a cyclic order. The Incomplete Choleskv-Coniugate Gradient is an excellent example of
this. The bulk of the loop is:
II = n
IPNTP = 0
22 IPNT = IPNTP
• IPNTP = IPNTP +11
II = II/2
i = IPNTP
DO 2 k = IPNT+2, IPNTP, 2
i = i + 1
2 X(i) = X(k) - V(k)*X(k-l) - V(k+l)*X(k+l)
IF (II.GT.l) GOTO 22
Note that this is single assignment; the characteristics of this loop restrict the value of i
such that i>k+l. The access distribution is cyclic because the write index (/) is changing
twice as slowly as the read index (k). This allows caching to become nearly perfect as the
number of PEs increase. At 32 PEs with cache size of 64, each PE is responsible for the
writing of only one page. Once a remote read is done, the remote page remains cached.
Without a cache, CD displays poor performance, since the accesses jump from page to
page and most are remote. However,with a cache the percentage of remote accesses
decreases as the cache size increases and as the number of PEs increases. The explanation
for this is that as the computation gets spread over more and more PEs, the total size of the
cache increases. Thus, as the number of PEs increases and each PE is responsible for
writinga smallerportion of the array, the cycle length tends to decreasefor each PE.
Given this, each PE is more likely to contain all of an access cycle in its cache (see Figure
2).
11.
100.00% -•
0 90.00% -•
f
80.00% -•
R 70.00% ••
e
60.00%
a
d 50.00%
s 40.00%
R 30.00%
e 20.00%
m
0 10.00%
t 0.00% d
e 1
IncompleteCholesky- ConjugateGradient
4 8 16
Number of PEs
Cache, ps 32
•®- No Cache, ps 32
Cache, ps 64
No Cache, ps 64
RGURE 2: CYCLIC ACCESS PATTERN. CACHING AND PAGE SIZE CAN
REDUCE THE PERCENTAGE OF REMOTE READS SIGNIFICANTLY.
The 2-D Explicit Hydrodynamics Fragment is an example of CD in which the cycling
arises from the multidimensionality of the arrays. In one dimension, skewed distribution
occurs, but in the other dimension, the pages are accessed in a cycle, so we observe a
decrease in the percentage of remote accesses as the number of PEs increases. This
behavior can be seen in Figure 3. An inner loop fragment from the 2-D Explicit
Hvdrodvnamics Fragment is:
DO 70 k = 2,6
DO 70 j = 2,n
ZA(j,k) = (ZP(j-l,k+l) + ZQ(j-l,k) - ZP(j-l,k) - ZQ(j-l,k))
* (ZR(j,k) + ZR(j-l,k)) / (ZM(j-l,k) + ZLi (j-l,k+l) )
ZB(j,k) = (ZP(j-l,k) + ZQ(j-l,k) - ZP(j,k) - ZQ(j,k))
* (ZR(j,k) + ZR(j,k-l)) / (ZM(j,k) + ZM(j-l,k))
70 CONTINUE
Notice how both indices are skewed such that a cycle occurs in the access pattern.
12
% S.00%
0 1.00%
f
6.00%
R
e 5.00%
a
d 4.00%
s
3.00%
R
e 2.00%
m
0 1.00%
t
e 0.00%
2-D Explicit Hydrcxiynamics Fragment
4 8 16
Number of PEs
Cache, ps 32
No Cache, ps 32
Cache, ps 64
No Cache, ps 64
RGURE 3: Cyclic and Skewed access pattern Combination.
EXfflBITS EXCELLENT RESULTS AIDED FURTHER BY CACHING.
The examplesabove are rather counter-intuitive, yet very importantresults. Currently
we are conduction further research to determine under what configuration or parameters a
given programwould approach0% remote accessratio.
7.1.4. Class 4: Random Distribution
The final class is the random distribution (RD). RD covers loops that access various
parts of the linearaddress spacein a seemingly randomfashion. This behavior can occur
when multi-dimensional arrays are combinedwith skewedaccesses. The GeneralLinear
Recurrence Equations and A.D.T. Integration are both in thisclass. Innerloop statements
from the A.D.I. Integration are:
DO 8
Ul(kx,ky,2) = Ul(kx,ky,l) + All*DUl(ky) +A12*DU2(ky) + A13*DU3(ky)
+ SIG* (U1 (kx+l,ky,l) - 2. *U1 (kx, ky, 1) + Ul (Icx-l, ky, 1))
U2(kx,ky,2) = U2(kx,ky,l) + A21*DU1(ky) +A22*DU2(ky) + A23*DU3(ky)
+ SIG*(U2(kx+l,ky,l) - 2.*U2(kx,ky,1) + U2(kx-1,ky,1))
13
U3(kx,ky,2) = U3(kx,ky,l) + A31*DU1(ky) +A32*DU2(ky) + A33*DU3(ky)
+ SIG*(U3(kx+1,ky,1) - 2.*U3(kx,ky,1) + U3(kx-1,ky,1))
8 CONTINUE
RD exhibits large remote access ratios regardless of the presence or absence of caching
(see Figure 4). This invariance can be due either to a cycle in the access pattem that is too
large to fit in the cache, or to effectively random page accesses (e.g., permutation lookups).
The effect of the cache is minimal, because no page is being kept until it is needed again.
This is similar in many ways to thrashing in virtual memory systems. It is possible that
increasing the numberof PEs will help only if the access pattems form a cycle that is too
large to fit in the cache. Increasing the cache size will help here by allowing a complete
cycle to residein the cacheor increasing the probability of a cachehit simply by having
more of the remote pages stored locally.
70.00%
0
f 60.00%
R 50.00%
e
a 40.00%
d
s 30.00%
R 20.00%
e
m 10.00%
0
t 0.00%
e
GeneralLinearRecurrence Equations
4 8 16
Number of PEs
Cache, ps 32
•O" No Cache, ps 32
Cache, ps 64
No Cache,ps 64
FIGURE 4. RANDOM ACCESS PATTERN. POOR PERFORMANCEOF RD
CAN BE OVERCOME BY LARGER CACHE SIZES.
7.2o Load Balancing
The previous section showed that automatic partitioning can result in very small ratios
of remote accesses when measuredover the entire processornetwork. Another important
14
aspect ofautomatic partitioning is load balancing (i.e., how evenly distributed are the
computations?).
To consider load balancing behavior we use the number of remote and local reads per
PE as a measure of how well the program is distributed. Figure 5 shows that each of the
sixty-four PEs performs a comparable number of remote reads and local reads, hence the
area-of-responsibility conceptbalances most loops well. In each loop, each PE performs
similar amounts of remote access because each PE was responsible for similar amounts of
the array. In cases where the amount of remote reads depends upon which element is being
written, the load balancecan be skewed. In thesecases the lightly loadedPE can continue
onwith the program or context switch to another program. Wefound thatnearly allof the
Livermore Loops exhibited a loaddistribution pattern like thatin Figure 5.
Load Balance Data of a Typical SD Loop
(2-DExplicitHydrodynamics Fragment, pagesize32)
QJ 400
200
100
T 6950
• 6900
-• 6850
6800 ^
•• 6750 I
-• 6700
-• 6650
66000 llllllllllllllllllllllllllllllllllllllllllllllllll
1 5 9 1317212529 33 37 414549 53 57 61
Processor Numbers (64processors)
Remote with Cache
•Q" Remote with No Cache
Local with No Cache
Local with Cache
FIGURES. TYPICAL REMOTE ACCESS LOADBALANCE. EVENLY
BALANCED LOADS RESULT FROMTHE AREA-OF-RESPONSIBILITY
Concept.
15
8. Conclusions
The combination of single assignment, areas-of-responsibility, and caching leads to
low communication overhead and well-balanced loads when applied to the majority of the
Livermore Loops. Single assignment permits the exploitation of large numbers of PEs
automatically. Synchronization problems are solved through the adoption of the single
assignment policy. By segmenting array writes using the area-of-responsibility concept, all
PEs^performroughly the same number of remote accesses. These two concepts allow
caching to be implemented without extensive communication, and caching is central to
reducing remote accesses in the most common classes.
To answer our primary questions:
• How important is each program's access distribution?
Four different access classes cover the range of scientific computing. The most
common class (SD) exhibits extremely low percentages of remote accesses (1% to 10%).
Other, poorer performing classes (also less common) can be aided by larger cache sizes.
• What is the overall percentage of remote accesses?
For most access distributions, the percentages of remote accesses are less than 10%
when using a cache of 256 elements (fairly small). For certain access distributions (RD)
the remote access percentage can be rather high. We are continuing research into how to
handle this special access distribution class.
• How much can the remote accesses be reduced by adding a data cache?
Depending upon the access distribution class, caching can have anywhere from a
minimal effect to an extremely large effect (e.g., for an SD loop with large skew, we
observed a reduction from 22% remote reads to 1% remote reads). Since SD is by far the
most common class, this reduction is significant in many areas of scientific parallel
computing.
• How well balanced are the remote accesses?
16
Because single assignment and equal partitioning force a nearly equal number of writes
on each processor, the number of remote reads are also fairly equal. Thus the remote
accesses are well balanced for the majority of cases. Our simulation results show this to be
true for almost all of the Livermore Loops. The exceptions are those computations that are
inherently difficult to parallelize under any paradigm and exhibit access patterns that
correspond in many ways to thrashing in virtual memory systems.
Process alignment, as currently being considered by some researchers, ([PGW87] and
[A&N87]) is no longer necessary. The analysis used in process alignment was used to
transform SD loops to decrease communication overhead. By caching the elements in
pages, the localityof reference in SD loops is exploited, and only one remoteread is
necessary for all elements in a page (in real systems, a single page might have to be fetched
more than once if that page is only partially filled at the time of the first request, but the
overall communication overhead will still be much smaller).
9. Future Research
This is the first step in the development of a new approach to distributing arrays. The
concepts presented here play a key role in the design of a parallel model of execution on
which we are currently working. To further understand the advantages and disadvantages
of this approach, we need to examine a variety of issues:
• How will vector to scalar operations be implemented? Current ideas include the
extension of the host processor mechaiusm to allow collection of subrange results.
• A more sophisticatedsimulationwill betterexplore the problems of execution time
and network contention.
• A betterapproach to RD access pattems is needed. Different partitioning schemes
needto beexplored as well as larger cache sizes. We will lookintohow the
techniques developed for handling thrashing in virtual memory systems apply to this
model.
17
• If it turns out that the different classes of access patterns form a nonintersecting set
with respect to performance under different partitioning methods, then we must
explore ways for providing different programmer- or compiler-selectable partitioning
schemes. These would allow the programmer or compiler to select the partitioning
method based on some analysis of the access behavior. For example, we have seen
that our simple modulo partitioning scheme performs worse for certain loops than a
division scheme. If no third scheme can be found that allows all types to perform
well, it may become necessary to allow the selection of one or the other scheme based
on the access distribution class.
• Other parameters might be programmer- or compiler-selectable. For example,
allowing the programmer or compiler to select the page size might prove useful for
reducing communication overhead in some classes of loops. We need to determine if
such variability can be provided efficiently.
We are currently extending our simulation so it provides more information, and we are
adding the mechanism described in this paper to a low level "emulation" of the execution
model we are developing. Based on these preliminary results, we believe that our approach
will eventually answer a difficult question in distributed processing: how can data be
efficiently distributed?
10. References
[A&C86] Arvind and D.E. Culler. Dataflow Architectiures, Annual Reviews in
Computer Science, Vol. 1 1986, pp. 225-253.
[A&N87] A. Aiken and A. Nicolau. Loop Quantization: an Analysis and
Algorithm, Technical Report 87-821, Dept. of Computer Science,
Cornell Univ., March 1987.
[ACK82] W.B. Ackerman. Data Flow Languages, Computer, Feb. 1982, pp.
15-24.
[ANP87] Arvind, R. Nikhil, and K. Pingali. I-structures: Data Structures for
Parallel Computing, Computation Structures Group Memo 269,
Laboratory for Computer Science, MIT, February 1987.
18
[DEN75] Dennis, J. B. First Version of a Dataflow Procedure Language.
MAC Technical Memo 61, MIT, Cambridge, Mass.
[PEI86] J.-K. Peri. Program Partitioning and Synchronization on
Multiprocessor Systems, Ph.D. Thesis, Univ. of Illinois at Urb.-
Champ., Rept. No. UIUCDCS-R-86-1259, Mar. 1986..
[PGW87] J. Peir, D. Gajski, and M. Wu. Programming Environments for
Multiprocessors, Supercomputing, North-Holland, 1987, pp. 73-93
[R&F87] D. A. Reed and R. M. Fujimoto. Multicomputer Networks: Message-
Based Parallel Processing, MIT Press, 1987.
[S&F83] Architecture and Applications of the HEP Multiprocessor Computer
System, Denelcor, Denver, Colorado, 1983.
[SMI81] B.J. Smith. Architecture and Applications of the HEP Multiprocessor
Computer System, SocietyofPhoto-Optical Instrumentation
Engineers, Vol. 298, Rei-time Signal Processing IV, Aug. 1981, pp.
241-248.
19
