Compiler and Software Distributed Shared Memory Support for Irregular Applications by Amza, C. et al.
Compiler and Software Distributed Shared Memory Support
for Irregular Applications
Honghui Lut, Alan L. Cox~, Sandhya D warkadas”,
Ramakrishnan Rajamonyt, and Winy Zwaenepoellt
t Department of Electrical and Computer Engg 0 Department of Computer Science
$ Department of Computer Science
Rice University University of Rochester
{hhl, ale, rrk, willy}@cs.rice.edu sandhya(lcs.rocheste r.edu
Abstract
We investigate the use of a software distributed shared mem-
ory (DSM) layer to support irregular computations on dis-
tributed memory machines. Software DSM supports irreg-
ular computation through demand fetching of data in re-
sponse to memory access faults. With the addition of a
very limited form of compiler support, namely the identi-
fication of the section of the indirection array accessed by
each processor, many of these on-demand page fetches can
be aggregated into a single message, and prefetched prior to
the access fault.
We have measured the performance of this approach for
two irregular applications, moldyn and nbf, using the Tread-
Marks DSM system on an 8-processor IBM SP2. We find
that it has similar performance to the inspector-executor
method supported by the CHAOS run-time library, while
requiring much simpler compile-time support. For moldyn,
it is up to 23~0 faster than CHAOS, depending on the input
problem’s characteristics; and for nbf, it is no worse than
14% slower. If we include the execution time of the inspec-
tor, the software DSM-based approach is always faster than
CHAOS. The advantage of this approach increases as the
frequency of changes to the indirection array increases. The
disadvantage of this approach is the potentiaJ for false shar-
ing overhead when the data set is small or has poor spatial
locality,
1 Introduction
Inspector-executor methods have been proposed as a way
to efficiently execute irregular computations on distributed
memory machines [18]. A separate loop, the inspector-, pre-
cedes the actual computational loop (called the ezecutor).
The inspector loop determines the data read and written
by the individual processors executing the computational
loop. This information is then used to compute a commu-
nication schedule, moving the data from the producers to
the consumers at the beginning and/or end of each loop.
Communication aggregation is used to reduce the number
of messages exchanged. In order to further reduce over-
Permission to make digitallhard copy of par! or all this work for
personal or classroom use is gran~ed without fee provided that
copies are not made or diswibuted for profit or commercial advan-
tage, the copyright notice, the tide of the publication and its date
appear, and notice is given that copying is by permission of ACM,
Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists, requires prior specific permission and/or a fee.
PPoPP “97 Las Vegas, NV
0 1997 ACM 0-89791 -906 -8/97 /0006 . ..$3.50
head, an attempt is made to execute the inspector loop only
once for a large number of iterations of the executor loop.
It has been argued that part or all of the above procedure
can be automated by a compiler [21]. The compiler analysis
involved can, however, be quite complicated [1, 6, 20].
In this paper we propose an alternative approach. We
use a software distributed shared memory (DSM) layer that
provides a shared memory interface on top of the message
layer [13]. In its simplest form, the DSM layer supports
irregular computations by demand-driven fetching of data.
Our approach involves, in addition, a simple compiler front-
end that generates data access information, enabling the
run-time system to efficiently precompute the set of pages
that will be accessed by each processor during the next it-
eration. These pages are then requested, prior to that itera-
tion, in a single message exchange with each processor from
which data is needed. In other words, our approach extends
the base software DSM layer by enabling it to aggregate the
communication of data for irregular programs.
The compiler support required for our approach is very
simple: it suffices to determine the indirection array, and
the part of the indirection array being accessed by each pro-
cessor. This is usually a regular section [4]. In contrast,
the inspector-executor approach requires complex analysis
to determine whether the inspector loop can be hoisted out
of the main loop [1, 20].
This paper presents our approach in detail. In order to
gather experimental results, we use a modified version of
TreadM arks [2] that supports prefetching and aggregation
in the manner described above. Furthermore, we have aug-
mented the Parascope parallel programming environment [12]
to carry out the required compiler analysis. We present
performance results for two irregular applications, moldyn
and nbf. The results were obtained on an 8-processor IBM
SP2 using base Tread Marks and Tread Marks with aggrega-
tion support. We compare these results to measurements of
hand-coded inspector-executor versions of the same applica-
tions that use the CHAOS run-time library [7].
We find that TreadMarks augmented with compiler sup-
port for communication aggregation in irregular programs
has similar performance to the inspector-executor method
supported by the CHAOS run-time library. For moldyn, it
is up to 23~o faster than CHAOS, depending on the input
problem’s characteristics; and for nbf, it is no worse than
14% slower. In addition, it is up to 38% faster than the
base TreadMarks system. If we include the execution time
of the inspector, our approach is always faster than CHAOS.
The advantage of our approach incre.mes as the frequency of
changes to the indirection array increases. Its disadvantage
48
is the potential for false sharing overhead when the data set
is small or has poor spatial locality.
The outline of the rest of the paper is as follows. Sec-
tion 2 presents some background on the basic run-time pro-
tocol used to implement shared memory, specifically, the
TreadMarks implementation. Section 3 presents the run-
time and compiler support for irregular applications, Sec-
tion 4 provides a summary of CHAOS, the leading run-time
system for support of irregular applications on message pass-
ing platforms. Section 5 presents the results of our evalua-
tion of the shared memory run-time support, and compares
the results to those from CHAOS. Section 6 provides a sum-
mary of related work. Finally, we conclude in Section 7,
2 Background - TreadMarks
TreadMarks [2] is a software DSM system built at Rice Uni-
versity. It is an efficient user-level DSM system that runs
on commonly available IJnix systems. We use TreadM arks
version 1.0.1 as the base shared memory run-time system in
our experiments.
Tread Marks provides programming primitives similar to
those used in hardware shared memory machines, namely,
mocess creation. shared memorv allocation. and lock and
~arrier synchronization. Shared “memory m&t be allocated
dynamically using TreadMarks primitives that have the same
syntax as conventional memory allocation calls. A barrier
stalls th~ calling processor until all processors in the system
have arrived at the same barrier. Locks are used to control
access to critical sections. No processor can acquire a lock
if another processor is holding it.
Tread Marks uses a lazy incralidate [2] version of release
consistency (RC) [9] and a multiple-writer protocol [5] to
reduce the overhead involved in implementing the shared
memory abstraction.
RC is a relaxed memory consistency model. In RC, or-
dinary shared memory accesses are distinguished from syn-
chronization accesses, with the latter category divided into
acquire and reiease accesses. RC requires ordinary shared
memory updates by a processor p to become visible to an-
other processor q only when a subsequent release by p be-
comes visible to q via some chain of synchronization events.
In other words, to ensure that changes to shared data are
visible, a program must use explicit synchronization. In
practice, this model allows a processor to buffer multiple
writes to shared data in its local memory untiJ a synchro-
nization point is reached.
The virtual memory hardware is used to detect accesses
to shared memory. Consequently, the consistency unit is a
virtual memory page. The rnrdtipie-writer protocol reduces
the effects of false sharing with such a large consistency unit.
With this protocol, two or more processors can simult ane-
ously modify their own copy of a shared page. Their modifi-
cat ions are merged at the next synchronizeation operation in
accordance with the definition of RC, thereby reducing the
effects of false sharing. The merge is accomplished through
the use of difls. A cliff is a run-length encoding of the mod-
ifications made to a page, generated by comparing the page
to a copy saved prior to the modifications (called a twin).
TreadMarks implements a lazy invalidate version of RC [2].
A lazy implementation delays the propagation of consistency
information until the time of an acquire. Furthermore, the
releaser notifies the acauirer of which Da~es have been mod-
.U
ified, causing the acq~irer to invalidate its local copies of
these pages. A processor incurs a protection violation on
the first access to an invalidated page, and gets cliffs for
that page from the most recent modifiers of the page.
3 Optimization for Irregular Applications
The run-time system described in Section 2 performs com-
munication purely on demand. This can result in extra
messages due to the fact that data is brought in at a page
granularity, and additional run-time overheads in terms of
page faults and interrupts in order to trigger the commu-
nication [8, 14]. This section describes the enhancements
to the TreadMarks run-time system as well as the compiler
analysis necessary in order to optimize performance for pro-
grams with irregular access patterns.
Our approach involves a simple compiler front-end that
identifies the indirection array(s) to the run-time system.
Specifically, the compiler performs a source-t~source trans-
formation of the program, inserting calls to the (augmented)
run-time DSM library before the indirect accesses. These
calls identify the base addresses of the data arrays, and the
sections of the indirection arrays accessed by a particular
processor. The run-time system then uses this information
to determine the set of shared pages that this processor ac-
cesses. The pages are requested in a single message exchange
with each of the processors from which data is required.
In order to avoid having to recompute the set of pages ac-
cessed on every iteration, the run-time system subsequently
write-protects the shared pages containing the indirection
array. If no memory protection violation occurs for these
pages, then the same set of pages are requested in the next it-
eration. Otherwise, the indirection array has been changed,
and the set needs to be recomputed.
3.1 Example
We first illustrate our approach with an example using the
moldyn program [3] (see also Section 5). Figure 1 illustrates
the program structure of moldyn, and the force computa-
tion subroutine in which the indirect accesses occur. Each
molecule i has two attributes: its position, x(i), and the
force, forces (i), acting on it.
Figure 2 shows the program transformations applied to
the force computation subroutine. In the transformed ver-
sion, each processor first accumulates its contributions to
forces in the localfiorces array that is stored in private
memory. After this computation, the processors update the
shared forces in a pipelined fashion. This reduces commu-
nication by eliminating the need ta synchronize on every
access to forces and by aggregating the updates to forces.
The compiler-inserted code consists of a Validate at the
start of the ComputeYorces subroutine. The Validate ini-
tializes the data structures for the fetch. Then, if necessary,
it computes the pages accessed through the indirection array
interaction-list. Finally, it requests the updates to each
page of x that will be accessed by the executing processor,
To improve performance, Validate aggregates requests for
multiple pages from the same processor.
3.2 Augmented Run-Time System
The run-time system was augmented in order to take advan-
tage of the access information provided by the compiler. We
concentrate here on the support for communication aggre-
gation for irregular accesses. Support for regular accesses
and other optimizations was described in earlier work [8].
49
PROGRAMHOLDY1
DOstep = 1, HSTEPS
IF (mod(step, UPDATE.INTERVAL) .eq. O) then
call build-interaction-list ()
EFDIF
. . . . . .
call ComputeForces ( )
. . . . . .
EIDDO
. . . . .
E#D
SUBROUTINEComputeForces( )
DOi = 1, num-interactions
ni = interact ion-l ist(l , i)
n2 = interact i0nJist(2, i)
force = x(nl) - x(n2)
forces (nl) = forces (nl) + force
forces(n2) = forces (n2) - force
EIDDO
EID
Figure 1: Moldyn - main program and the subroutine ComputeForces
SUBROUTINEComputeForces ( )
Validate (l, IEDIRECT, x, interact ion-list[l :2, 1 :numinteractions] , READ, 1)
DO i = 1, num-interactions
ni = interact ion-liat(l, i)
n2 = interaction-list(2, i)
force = x(nl) - x(n2)
local_fOrces(nl) = local -forces(nl) + force
10cal-f0rces(n2) = local-forces(n2) - force
EIDDO
EID
Figure 2: Transformations for the Subroutine ComputeForces in Moldyn
50
/* fetch-pages is the list of pages to be fetched
pages [sch] is the list of pages associated with each schedule */
Validate{ va.alist ) /* Handles variable number of arguments l/
{
va-list ldesc-ptr
int number /* number of descriptors */
int descriptor
va-start (desc-ptr)
fetch. pages = 101E
number = va-arg(desc-ptr, int)
for (descriptor = i; descriptor <= number; descriptor++)
{
int type = va-arg(desc-ptr, int ) /.
char lbase = va-arg(desc_ptr, char l ) /*
RSD eection = va-arg(desc_ptr, RSD) /.
int access_t ype = va-arg(desc-ptr, int ) /.
int sch = va-arg(desc-ptr, int ) /*
if (type == IXDIRECT)
{
if (modif ied(section) )
{
descriptor type - DIRECT or IIDIRECT l/
base address of shared data *I
sect ion of shared data or indirection array l/
READ,WRITE, READMRITE, URITE-ALL, or READtWRITE-ALL */
schedule number */
pages [sch] = Read-indices (base, section)
Write-protect (section)
}
}
else
pages [sch] = pages in section
fetch-pages += pages [sch] that are invalid
}
Fetch-d iffs(f etch-pages)
Apply-cliff s(f etch-pages)
for (descriptor = 1; descriptor <= number; descriptor++)
{
if (access-type == URITE I I access-type == READ&URITE)
Create-t wins (pages [sch] )
}
va-end(desc-ptr)
}
Figure 3: Augmented Run-Time Interface for Indirect Accesses
51
Figure 3 provides a summary of the Validate interface for
both regular and irregular accesses.
To support aggregated communication, Validate can
fetch multirie data obiects at the same time. Thus. it
. .
takes a variable number of arguments. The first argument
is the number of access descriptors that follow. There is
an access descriptor for each data object. An access de-
scriptor consists of type, base, section, access-type, and
schedulenumber. The type specifies the descriptor type.
It is either DIRECTfor regular accesses or INDIRECTfor ac-
cesses through an indirection array. The base is the ad-
dress of the shared data structure being accessed. The
section is the section of the indirection array used to access
the shared data, or the section of shared data itself in the
case of regular accesses. The access-type is one of READ,
WRITE,or READ&WRITE.Direct accesses have two additional
access types, WRITEALLand READ&WRITEALL,which indicate
when everv element in the section is known to be written at
compile-time. The run-time system can use this information
to reduce consistency maintcmance overheads by eliminating
twinning on those pages that are completely written. The
schedulemumber is an identifier for the set of pages to be
fetched.
If the descrir)tor tvue is INDIRECTand the section of
. . .
the indirection array has been modified since the last cafl to
Validate, the modif iedfunction returns true, and pages [sch
is recomputed. Both locaf and remote modifications cause
the modified function to return true. The Read_indices
procedure recomputes the list of pages, pages [sch], using
base and sect ion. After pages [sch] has been computed,
the pages in section are write protected. A more sophis-
ticated version of this approach could use difing (compar-
ing an old version of the pages containing the indirection
array to the current one) to incrementally recompute the
page sets, but our current implementation does not do so.
Those pages in pages [sch] that are invalid are added to
fetch-pages, the list of pages to be fetched.
Fetchdiffs requests the difls required to update the
pages in fetch-pages. All of the difl requests to the same
processor are aggregated into a single message. Apply cliffs
waits for the cliffs to arrive, and applies them to the appro-
priate page.
After updating the pages of shared data, consistency ac-
tions are performed preempt ively in order to avoid write de-
tection overhead during execution. The Tread Marks multiple-
writer protocol requires an unmodified copy (a twin) of the
page to be maintained for every page that is modified, un-
less the page is guaranteed to be modified in its entirety.
Validate performs Create _twins on pages [sch] if the
corresponding descriptor has an access_type of URITEor
READMIRITE.Create-twins makes a twin of each page in
pages [sch], and enables write access to these pages. This
avoids the memory protection violation to create the twin.
3.3 Compiler Analysis
The compiler support required for our approach involves de-
termining the indirection array used to access shared data,
and the part of the indirection array being accessed. This is
usually a regular section [4], and hence can be handled by
the existing compiler framework for regular accesses. Our
approach SLSOnaturally extends to multiple levels of the indi-
rection in the access pattern without additional mechanisms.
In contract, the inspector-executor approach requires several
inspector loops to be generated for such access patterns [6].
Furthermore, the inspector-executor approach also requires
1
sophisticated compiler analysis to pull the inspector as far
forward as possible in the program.
We concentrate here on the additions to the analysis nec-
essary for handling indirect accesses (see [8] for details on
how regular accesses are handled). Let V be the set of shared
variables, let S be the set of all synchronization operations in
the program, and let F be the set of “possible fetch points”,
the locations in the program where a Validate may be in-
serted. If “perfect” analysis were possible, the set F would
be equaf to the set S. Indeed, under lazy release consis-
tency, invalidations only occur at a synchronization point,
and hence synchronization points are the only places in the
program where it makes sense to insert a Validate. In prac-
tice, F includes the set S, but in addition includes condi-
tional statements, loop boundaries, and, in the absence of
interprocedural analysis, procedure calls.
Access analysis generates a summary of shared data ac-
cesses associated with each element of F, and the type of
such accesses. Our main tool is regular section analysis [11].
Regular section descriptors (RSDS) are used to concisely
characterize the array accesses in a loop nest. RSDS rep-
resent the accessed data as linear expressions of the upper
and lower loop bounds along each dimension, and include
stride information.
For each statement p in the program, for each definition
or reference in p to an indirection array, a section is con-
structed. A {READ}, {WRITE}, or {READ&URITE}tag is asso-
ciated with the section depending on the access type. This
section is associated with each element of F that directly
precedes p.
During the program transformation phase, for each j
in F, if there are access descriptors associated with ~, a
Validate is inserted at ~. Each access descriptor is then
suppfied as a parameter to Validate with either a DIRECTor
INDIRECTtype. If the type is INDIRECT,the base address of
the shared data is supplied to the Validate calf, along with
the RSD for the indirection array as the section parameter.
See Figure 2 for the results of the analysis and the trans-
formations on the moldvn rmomam. Since we do not have
..”
interprocedural analysis, the relevant fetch point is entry to
the procedure ComputeForces. The sections of the indirec-
tion array interact ion_list are used to fetch the corre-
sponding page sets of the data array x. After the initial ex-
ecution of Validate, interact ionlist is write protected.
When the interaction list is modified UPDATE-INTERVALitera-
tions later, a memory protection violation occurs. The han-
dler for this memory protection violation sets a flag. During
the next execution of Vali.dat e, if the flag is set, modified
clears the flag and returns true, and Validate recomputes
the set of pages that must be fetched.
4 CHAOS
CHAOS [19] is a run-time library designed to handle irregu-
lar applications on distributed memory machines. There are
three steps in solving irregular problems in CHAOS, namely,
data and iteration partitioning, the inspector, and the ex-
ecutor.
CHAOS supports a number of parallel partitioners that
partition data arrays using heuristics based on spatial po-
sition, computational load, etc. The partitioned returns a
translation table, which contains an irregular assignment of
array elements to processors. A translation table lists the
home processor and offset address of each data array ele-
ment. Depending on storage requirements, the translation
table can be replicated, distributed regularly, or stored in
52
a paged fashion. This table is used by the inspector to
create the communication schedule. If the translation ta-
ble is not replicated, communication may be necessary in
the inspector. The loop iterations are partitioned by the
almost-owner-computes rule, which assigns an iteration to
the processor that owns a majority of data array elements
accessed in that iteration. The data array can be remapped,
so that data elements owned bv a rmocessor are adiacent in
memory. Remapping has the ~ote;tial advantage “that the
memory requirement on a processor is proportional to the
size of the data partitions assigned to it.
The Recursive Coordinate Bisection (RCB) partitioned is
one specialized partitioned supported by the CHAOS library
that partitions nodes according to their physical positions.
When simulating physical systems, particles close to each
other in the physical space are more likely to interact with
each other, or to be connected with each other. RCB results
in less communication than a simple BLOCK or CYCLIC
partition on these applications.
Each processor executes the inspector to construct its
communication schedule. A communication schedule speci-
fies which data is communicated and which processors are in-
volved. The inspector constructs the communication sched-
ule by first determining the data read and written on each
processor and then consulting the translation table to de-
termine the global placement of this data according to the
partition. An important optimization in the inspector is to
eliminate duplication. Duplication occurs when a data array
element is pointed to by many elements in the indirection
array. Removing duplication can dramatically reduce the
amount of data communicated. A hash table whose size is
proportional to the size of the data array is employed to
eliminate duplicates. Because of the time to hash the indi-
rection array, and the time to look up the translation table,
the inspector can be expensive. However, this overhead can
be amortized if the indirection array remains unchanged for
a long period of time.
The executor uses the communication schedule gener-
ated by the inspector to gather and scatter data. Gather
fetches off-processor data, and scatter propagates modifi-
cations to off-processor data back to their owners.
5 Experimental Evaluation
We use an 8-processor IBM SP2 running AIX version 3.2.5.
Each processor is a thin node with 64Kbytes of data cache
and 128M bytes of main memory. Interprocessor communi-
cation is accomplished over the IBM SP2 high-performance
switch. Unless indicated otherwise, all results are for 8-
processor runs.
We compare the compiler-optimized Tread Marks pro-
grams with the hand coded CHAOS programs, as well as the
base Tread Marks programs. The compiler-optimized Tread-
Marks programs include optirnizations for both regular and
irregular access patterns. Tables 1 and 2 present the execu-
tion times, speedups, number of messages and the amount of
data communicated at 8 processors for the two applications
discussed in this paper.
5.1 Moldyn
Moldyn is a molecular dynamics simulation. Its computa-
tional structure resembles the non-bonded force calculation
in CHARMM [3], which is a wefl-known molecular dynamics
code used at NIH to model macromolecular svstems. Non-
bonded forces are long-range interactions exi~ting between
each pair of molecules. CHARMM approximates the non-
bonded calculation by ignoring all pairs which are beyond a
certain cutoff radius. The cutoff approximation is achieved
by maintaining an interaction list of all the pairs within the
cutoff distance, and iterating over this list at each timestep.
The interaction list is used as an indirection array to identify
interacting partners. Since molecules change their spatial
location every iteration, the interaction list must be period-
ically updated. Figure 1 illustrates the program structure
of moldyn, and the force computation subroutine.
The CHAOS program uses the RCB partitioned to resign
molecules to processors. This partition lasts through the ex-
ecution. When the interaction list is updated, the program
must again call the inspector to identify interacting part-
ners. This call is inserted in the main program, right after
the call to subroutine buildint eract ionlist. In Com-
puteForces, each processor uses the schedule created by the
insuector to Rather remote values of x and forces before the
m& loop. Both x and forces are modified elsewhere, ne-
cessitating the gather. After the main loop, the processors
again use the schedule to scatter values of forces that will
be read by other processors.
The Tread Marks program also uses the RCB partitioned.
The coordinate array x and the forces array are allocated
in shared memory. A Validate on x that was inserted by the
compiler appears at the beginning of the subroutine Com-
puteForces. That is, the Validate is before the loop over the
interaction list. Changes to the interaction list are detected
by write protecting the pages it occupies (inside Validate).
An explicit inspector call is hence not needed. In Compute-
Forces, each processor first accumulates its contributions
to forces in the localtiorces array (see Figure 2) that
is stored in mivate memorv. After local fiorces is com-
puted, the pr~cessors updat~ the shared forces in a pipelined
fashion in nprocs steps. In each step, a processor updates
l/nprocs of the total data.
In TreadMarks, the localtiorces array is indexed by
the molecule number without anv translation. Thus the
localfiorces array is proportion~ in size to the total num-
ber of molecules. In CHAOS, remapping creates an analog
to the local_forces array that is proportional in size to
the molecules assigned to that processor plus the molecules
they interact with. For the default data set, which we used
in our experiments, between 3170 and 53% of the molecules
interact. Consequently, remapping has little effect on the
memory utilization of the CHAOS program.
5.1.1 Results
We simulated 16384 particles for 40 iterations, varying the
number the times the interaction list is updated from 1
throueh 3. The results are rmesented in Table 1. The data
.
initialization (and the data partitioning for the parallel pro-
grams) are not timed for either the sequential or parallel
versions.
We first present results for the case where the interaction
list is updated once, at the 20~h iteration. The sequential
program without any calls to CHAOS or TreadMarks runs
for 267 seconds. The TreadMarks execution time on a sin-
gle processor is almost identical to that of the sequential
program, spending only 0.4 seconds to check the indirection
lists. On the other hand, the CHAOS program runs longer
on a single processor than the sequential program, because
it spends 6.2 seconds in the inspector.
At eight processors, the CHAOS program runs for 44.9
seconds. We were unable to use a replicated translation ta-
53
Update frequency Time (sec. ) Speedup Messages Data (MB)
CHAOS 44.9 6.0 15704 190
Every 20 iterations Tmk baae 42.3 6.3 62149 160
(seq = 267.2 see) Tmk optimized 37.7 7.1 14528 137
E
CHAOS 61.7 5.9 16255 243
Every 15 iterations Tmk base 56.4 6.5 70230 179
(seq = 365.8 see) Tmk optimized 48.9 7.5 14687 141
CHAOS 78.2 6.0 16806 296
Every 11 iterations Tmk base 68.1 6.9 71788 190
(seq = 467.3 see) Tmk optimized 60.4 7.7 14871 145
Table 1: Moldyn -8 processor results. The interaction list is updated at varying intervals.
ble, owing to the amount of memory that it required. The
translation table is hence distributed, necessitating commu-
nication in the calls to the inspector. In the case where the
interaction list is updated at the 20~hiteration, the inspec-
tor is called twice, including once at the beginning of the
program. Each processor spends 4.6 seconds in the inspec-
tor. Exchanging the translation tables causes the transfer
of 85Mbytes of data in 878 messages.
The base TreadMarks program (without any compiler
support) runs for 42.3 seconds on eight processors. Tread-
Marks is able to achieve a performance comparable to CHAOS
because of the large problem size, and the good data locality
provided by the RCB partitioned. However, the number of
messages sent in Tread Marks is three times more than that
in CHAOS. The re~on is that TreadMarks obtains data one
page at a time, while CHAOS sends all the data needed by
a processor in a single message.
With the compifer optimizations, the TreadMarks run-
ning time comes down to 37.7 seconds, which is an 1l~o
improvement over the base TreadM arks. Of this improve-
ment, 7 percentage points come from the communication
aggregation for regular accesses. The remaining 4 percent-
age points come from the compiler inserted call to Validate
for the indirect accesses. The optimized TreadMarks pro-
gram sends 23 Mbytes less data than the base TreadMarks
program because the reductions in the base program cause
multiple overlapping di#s to be sent for each cliff request.
In the optimized program, cm encountering a reduction, the
compiler recognizes read-write accesses to an entire regular
section. It then flags via Validate that the entire page, and
not the cliff, must be sent on a cliff request. This reduces the
amount of data sent as compared to the base TreadMarks
program. The optimized TreadMarks program spends 0.6
seconds in Validate to cherk the indirection array.
When the interaction list is updated more often, the run-
ning times increase because of the time taken to rebuild the
interaction list. CHAOS suffers from having to rerun the
inspector. When the interaction list is updated every 11
iterations, CHAOS spends 9.2 seconds per processor on av-
erage in the inspector, while TreadMarks spends only 0.8
seconds in scanning the indirection list. As a result, the
optimized TreadMarks program is 23% fder than CHAOS.
5.2 NBF
NBF is the kernel of a molecular dynamics simulation. It
is taken from the GROMOS benchmark [10]. It was pre-
viously used as an example to demonstrate compiler gener-
ated message passing programs [22]. Instead of keeping a
list of pairs of interacting molecules like moldyn, nbf keeps
a list of interacting partners for each molecule. The lists of
partners are concatenated together, with a per molecule list
pointing to the end of each molecule’s partners in the part-
ner list. For each molecule, the program goes through the
list of partners, and updates the forces on both a molecule
and its partner based on the distance between them. In our
experiments, the partner list is static. Each molecule haa
approximately the same number of partners, and the part-
ners of each molecule spread evenly in about 2/3 of the total
space. Because each molecule has about the same number
of neighbors, a simple BLOCK partition suffices to balance
the load.
In the CHAOS program, the inspector is called at the
beginning of the program, outside the loop simulating the
time steps. At the start of each time step, a gather is called
to collect the updated vrdues of coordinates from remote
processors. A scatter is invoked at the end of each time
step to propagate the modifications to the force array.
The TreadMarks program allocates both the coordinate
array and the force array in shared memory. A Validate is
performed at the start of each time step to fetch the updated
values of the coordinate array. Like moldyn, updates to
the forces are accumulated in private memory. After this
computation, the processors update the shared forces in a
pipefined fashion. The update is performed in nprocs steps.
In each step, a processor updates l/nprocs of the total data.
For the data set which we used in our experiments, 84%
of the molecules interact. Consequently, remapping yields
little reduction in the memory utilization of the CHAOS
program.
5.2.1 Results
We ran nbf with varying numbers of molecules for the in-
put problem size (see Table 2). Each molecule is represented
by a double precision floating point number. Each molecule
has 100 partners. The distance between two adjacent part-
ners of a molecule is about 47o molecules. The test runs
for 11 iterations, of which the last 10 iterations are timed.
Thus, the results include neither the time to perform the
inspector in the CHAOS version nor the time for checking
the partner array in the TreadMarks program.
The unmodified (original) sequential program runs for
78.3 seconds with a problem size of 64 x 1024. The single-
processor TreadMarks execution time is almost identical to
that of the sequential program, spending only 0.001 seconds
in scanning the indirection array, On the other hand, the
CHAOS program runs longer on a single processor than the
54
Problem Size Time (sec. ) Speedup Messages Data (MB}
CHAOS 10.9 7.2 2014 60
64 X 1024 Tmk base 19.6 4.0 34421 212
(seq = 78.3 see) Tmk optimized 12.1 6,5 4817 68
CHAOS 10.6 7.2 2014 59
64 X 1000 Tmk base 19.4 3.9 36278 209
(seq = 76.5 see) Tmk optimized 12.3 6.2 4920 76
CHAOS 5.5 7.1 2014 30
32 X 1024 Tmk base 9.1 4.3 18095 106
(seq = 39.1 see) Tmk optimized 6.2 6,3 3851 34
Table 2: NBF Kernel -8 processor results.
sequential program, because it spends 7.3 seconds in the
inspector.
At eight processors, the CHAOS program and the opti-
mized Tread Marks program run for 10.9 seconds and 12.1
seconds, respectively. The inspector is not included in the
timing for CHAOS. The main reason for the 10% differ-
ence is that CHAOS pushes the data to the processors that
will use it in one message, while TreadMarks uses request–
response communication (necessitating two messages). The
13% extra data sent in TreadMarks is due to false sharing.
Although we excluded the time to run the inspector from
the timing, it is important to note that at eight processors,
the CHAOS program spends 5.2 seconds per processor to
create the schedule. In contrast, the TreadMarks program
only spends 0.3 seconds going through the indirection array.
The compiler optimizations reduce the execution time of
the base TreadMarks version by 38%. Of this reduction, 34
percentage points come from optimizations in the regular
part of the code, such as the pipelined reduction. These
optimizations reduce both the number of messages and the
amount of data sent in the program. The remaining 4 per-
centage points come from prefetching the data for the irreg-
ular accesses at the beginning of each time step.
Reducing the problem size to 32 x 1024 does not affect the
relative performance of Tread Marks and CHAOS much. The
difference in performance comes from TreadMarks having to
request data, as in the case of the 64 x 1024 problem size.
Changing the data set size to 64 x 1000, we introduce false
sharing at the boundary between pairs of processors. In
this case, the optimized TreadMarks program is 14?70slower
than the CHAOS program, because of the extra messages
and data caused by false sharing. However, the cost of the
inspector in CHAOS overshadows the performance loss from
false sharing in TreadMarks.
6 Related Work
A large number of studies have been published on the perfor-
mance of distributed shared memory and inspector-executor
systems, but, to the best of our knowledge, only one paper
has been published comparing the two approaches. Mukher-
jee et al. [16] compare the CHAOS inspector-executor sys-
tem to the TSM (transparent shared memory) and the XSM
(extendible shared memory) systems, both implemented on
the Tempest interface [17]. Three applications are used:
moldyn, unstructured, and DSMC, and the comparison is
done on a 32-processor CM-5. They conclude that TSM
is not competitive with CHAOS, while XSM achieves per-
formance comparable to CHAOS after introducing several
special-purpose protocols.
Our study differs from the cited paper in several aspects.
First, our transparent shared memory system (TreadMarks)
performs significantly better than TSM. We attribute this
difference in performance to TreadMarks’ use of lazy re-
lease consistency and multiple writer protocols, in contrast
to the sequential consistency and single writer protocols
used in TSM. Second, we use a compiler to optimize the
shared memory programs, rather than relying on handcoded
special-purpose protocols. As indicated in our study, the
compiler analysis necessary is relatively straightforward.
Our study is also related to the many papers on prefetch-
ing and aggregation. In particular, Mowry et al. [15] use a
somewhat similar strategy to prefetch and aggregate disk
requests for sequential programs, and Dwarkadss et al. [8]
study prefetching and aggregation for regular applications
in software distributed shared memory systems.
7 Conclusions
We have described an integrated compile-time/run-time ap-
proach for executing irregular computations on distributed
memory machines. This approach is based on a modified
software distributed shared memory layer, and fairly sim-
ple compile-time support. The only required compile-time
support ia regular section analysis of the indirection arrays.
Run-time support for dynamic detection of changes to the
indirection array, as well as to the shared data, eliminates
any unnecessary computation and communication. Further-
more, the communication by each processor is aggregated
into fewer message exchanges.
We measured this approach for two irregular applica-
tions, moldyn and nbf, using the Tread Marks DSM system
on an 8-processor IBM SP2. We find that it has similar per-
formance to the inspector-executor method supported by
the CHAOS run-time library, while requiring much simpler
compile-time support. For moldyn, it is up to 23% faster
than CHAOS, depending on the input problem’s character-
istics; and for nbf, it is no worse than 1470 slower. The
advantage of the software DSM- based approach increases as
the frequency of changes to the indirection array increases.
The disadvantage of this approach is the potential for false
sharing overhead when the data set is small or has poor spa-
tial locality. In addition, in both moldyn and nbf, the soft-
ware DSM-based approach eliminated substantial inspector
overheads. For both applications, the software DSM-based
approach is always faster than CHAOS if we include the
execution time of the inspector.
55
Acknowledgements
This work is supported in part by the National Science Foun-
dation under Grants CCR-9410457, BIR-9408503, CCR-
9457770, CCR-9502500, CCR-9521735, CDA-9502791, and
MIP-9521386, by the Texas TATP program under Grant
003604-017, and by grants from IBM Corporation and from
Tech-Sym, Inc. Ram Rajamony is also supported by an IBM
Cooperative Fellowship.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
G. Agarwal and J. Saltz. Interprocedural compilation of
irregular applications for distributed memory machines.
In Proceedings of Supercornputing ’95, December 1995.
C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu,
R. Rajamony, W. Yu, and W. ZwaenepoeL Tread-
Marks: Shared memory computing on networks of
workstations. IEEE Computer, 29(2): 18–28, February
1996.
B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J.
States, S. Swaminathan, and M. Karplus. Charmm:
A program for macromolecular energy, minimization,
and dynamics calculations. Journal oj Computational
Chemistry, 4:187, 1983.
D. Callahan and K. Kennedy. Analysis of interprocedu-
ral side effects in a parallel programming environment.
Journal of Parallel and Distributed Computing, 5:517-
550, 1988.
J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Tech-
niques for reducing consistency-related information in
distributed shared memory systems. ACM Transactions
on Computer Systems, 13(3):205–243, August 1995.
R. Das, P. Havlak, J. Saltz, and K. Kennedy. Index
array flat tening through program transformation. In
Proceedings of Supercomputing ’95, December 1995.
R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Com-
munication optimizations for irregular scientific compu-
tations on distributed memory architectures. Journal
of Parallel and Distributed Computing, 22(3):462479,
September 1994.
S. Dwarkadas, A.L. COX, and W. Zwaenepoel. An
integrated compile-time/run-time software distributed
shared memory system. In Proceedings of the ‘M Sym-
posium on Architectural Support for Programming Lan-
guages and Operating Systems, October 1996.
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons,
A. Gupta, and J. Hennessy. Memory consistency and
event ordering in scalable shared-memory multiproces-
sors. In Proceedings of the 1‘lth Annual International
Symposium on Computer Architecture, pages 15-26,
May 1990.
W.F. van Gunsteren and H.J.C. Berendsen. GROMOS:
GROningen MOlecular Simulation software. Technical
report, Laboratory of Physical Chemistry, University of
Groningen, 1988.
P. Havlak and K. Kennedy. An implementation
of interprocedural bounded regular section analysis.
IEEE Transactions on Parallel and Distributed Sys-
tems, 2(3):350-360, July 1991.
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
K. Kennedy, K, S. McKinley, and C. Tseng. Analysis
and transformation in an interactive parallel program-
ming tool. Concurrency: Practice and Experience, 5(7),
October 1993.
K. Li and P. Hudak. Memory coherence in shared vir-
tual memory systems. ACM Transactions on Computer
Systems, 7(4):321-359, November 1989.
H. Lu, S. Dwarkadas, A.L. Cox, and W. ZwaenepoeL
Message passing versus distributed shared memory on
networks of workstations. In Proceedings SuperComput-
ing ’95, December 1995.
T.C. Mowry, A.K. Demke, and O. Krieger. Automatic
compiler-inserted 1/0 prefetching for out-of-core appli-
cations. In Proceedings of the Second USENIX Sympo-
sium on Operating System Design and Implementation,
pages 3-17, November 1996.
S.S. Mukherjee, S.D. Sharma, M.D. Hill, J.R. Larus,
A. Rogers, and J. Saltz. Efficient support for irregular
applications on distributed memory machines. In Pm.
ceedings of the 5th Symposium on the Principles and
Practice oj Parallel Progmmming, July 1995.
Steven K. Reinhardt, James R. Larus, and David A.
Wood. Tempest and Typhoon: User-level shared mem-
ory. In Proceedings of the Zlth Annual International
Symposium on Computer Architecture, pages 325-337,
April 1994.
J. Saltz, H. Berryman, and J. Wu. Multiprocessors
and run-time compilation. Concurrency: Practice and
Experience, 3(6):573-592, December 1991.
S. Sharma, R. Ponnusamy, B. Moon, Y. Hwang, R. Das,
and J. Saltz. Interprocedural compilation of irregular
applications for distributed memory machines. In Pro.
ceedings Super Computing ’95, dec 1995.
R. von Hanrdeden and K. Kennedy. Give-N-Take – a
balanced code placement framework. In Proceedings
of the ACM SZGPLAN 91 Conference on Progmmming
Language Design and Implementation, June 1994.
R. von Hanxleden, K. Kennedy, C. Koelbel, R, Das,
and J. Saltz. Compiler analysis for irregular problems
in Fortran D. In Proceedings of the 5th Workshop on
Languages and Compilers jor Parallel Computing, Au-
gust 1992.
Reinhard von Hanxleden. Handling irregular problems
with Fortran D – a preliminary report. In Proceed-
ings of the Fourth Workshop on Compilers for Parallel
Computers, December 1993.
56
