Combining Compile-Time and Run-Time Support for Efficient Distributed Shared Memory by Zwaenepoel, Willy et al.
Combining CompileTime and RunTime Support for
Ecient Software Distributed Shared Memory
Sandhya Dwarkadas
y
 Honghui Lu
z
 Alan L Cox


Ramakrishnan Rajamony
z
 and Willy Zwaenepoel

y
Department of Computer Science University of Rochester
z
Department of Electrical and Computer Engineering Rice University

Department of Computer Science Rice University
Abstract
We describe an integrated compiletime and runtime
system for ecient shared memory parallel computing
on distributed memory machines The combined system
presents the user with a shared memory programming
model with its wellknown benets in terms of ease of use
The runtime system implements a consistent shared mem
ory abstraction using memory access detection and auto
matic data caching The compiler improves the eciency
of the shared memory implementation by directing the run
time system to exploit the message passing capabilities of
the underlying hardware To do so the compiler analyzes
shared memory accesses and transforms the code to insert
calls to the runtime system that provide it with the access
information computed by the compiler The runtime sys
tem is augmented with the appropriate entry points to use
this information to implement bulk data transfer and to
reduce the overhead of runtime consistency maintenance
In those cases where the compiler analysis succeeds for
the entire program we demonstrate that the combined sys
tem achieves performance comparable to that produced by
compilers that directly target message passing If the com
piler analysis is successful only for parts of the program
for instance because of irregular accesses to some of the
arrays the resulting optimizations can be applied to those
parts for which the analysis succeeds If the compiler anal
ysis fails entirely we rely on the runtimes maintenance
of shared memory and thereby avoid the complexity and
the limitations of compilers that directly target message
passing The result is a single system that combines e
cient support for both regular and irregular memory access
patterns
I Introduction
Parallel programming using a shared memory platform
has the advantage of easeofuse In contrast to message
passing the user does not have to worry about data loca
tion or have to explicitly manage communication Unfortu
nately as parallel computers move away from the uniform
memory access model in order to improve scalability this
This work was supported in part by NSF grants CCR	
CCR

	 CCR
	 CCR
	 CCR

	 CCR


	 CDA
	 and MIP
	 by the Texas TATP pro
gram under Grant 	 and by grants from IBM Corporation
and from TechSym	 Inc Ram Rajamony is also supported by an
IBM Cooperative Fellowship
transparency of shared memory comes into question Mes
sage passing programs tuned to nonuniform memory ac
cess latencies often produce better performance Our goal
is to develop a system that continues to provide the user
with a transparent shared memory programming model
but underneath is capable of exploiting the hardwares mes
sage passing capabilities We focus our work on distributed
memory machines in which the shared memory abstraction
is provided entirely in software
A software distributed shared memory SDSM	 system
eg 
	 provides a shared memory abstraction on a dis
tributed memory machine using purely runtime mecha
nisms During execution a SDSM system detects shared
memory accesses handles faults by fetching the missing
data and caches data for future reference Such a system
can handle any kind of data access pattern However when
the access patterns are predictable the ondemand data
fetching causes extra messages and consistency actions in
creasing overheads and resulting in reduced performance
compared to message passing
Research and commercial compilers for parallel comput
ing on distributed memory machines have to date targeted
the underlying message passing layer directly eg 


	 The compiler analyzes memory access patterns to
generate message passing code which is then optimized to
aggregate communication and minimize data movement
For programs with regular access patterns that can be pre
cisely analyzed these compiletime systems provide supe
rior performance since they avoid the runtime overhead
present with SDSM systems However when the access
patterns cannot be analyzed precisely the message passing
code generated by the compiler becomes inecient In the
case of irregular accesses for example a simplistic compiler
approach would result in a broadcast of all data produced
by a processor causing large amounts of communication
Inspectorexecutor methods have been proposed to deal
with this problem of irregular computations on distributed
memory machines 
 A separate loop the inspector pre
cedes the actual computational loop called the executor
The inspector precomputes the data that will be accessed
by the individual processors when executing the compu
tational loop This information is used to create a com
munication schedule which is then used to aggregate the
movement of data from the producers to the consumers at
the beginning andor end of each loop The high cost of the
inspector is amortized when possible by executing it only
once for a set of executor iterations A compiler algorithm
to automate this procedure is described in von Hanxleden
et al 
 However the required compiler analysis can be
quite complex  
 
 
	
Our goal is to combine the benets of SDSM systems
with those of compilerbased approaches for generating
code for distributed memory systems In the combined
system the runtime library remains the basic vehicle for
implementing shared memory while the compiler performs
optimization rather than implementation Instead of gener
ating a message passing program directly the compiler gen
erates a shared memory program augmented with runtime
calls that describe the data access patterns By informing
the runtime system of future shared access patterns these
calls allow the runtime system to avoid memory access de
tection and ondemand fetching of missing data Further
more they permit the aggregation of several data fetches
into a single message
An interesting aspect of this combined system is that
it eciently supports programs with regular accesses pro
grams with both regular and irregular accesses and pro
grams with completely irregular accesses If the accesses
are completely regular then the compiler can analyze all of
them and the resulting code is as ecient as that of hand
coded or compilergenerated message passing If the pro
gram contains code in which an array is accessed indirectly
through an indirection array we can still analyze the usu
ally regular	 accesses to the indirection array and derive
considerable performance improvement from that analysis
If the compiler analysis fails the program is unmodied
and handled solely by the runtime system The combina
tion of a shared memory compiler and an SDSM system
thus avoids the complexity of the inspectorexecutor ap
proach for irregular access patterns without compromising
eciency for regular access patterns
We extended the Parascope parallel programming en
vironment 
 to analyze and transform explicitly parallel
programs We use regular section analysis 
 to determine
the shared data access patterns The resulting regular sec
tion descriptors RSDs	 describe the accesses to the data
array in the case of regular accesses	 or to the indirection
array in the case of irregular accesses	 We also extended
the interface 
 
	 to the TreadMarks 
 runtime
SDSM system to take advantage of the compiler analysis
We have measured the performance of these techniques
on an node IBM SP for applications with both regu
lar and irregular access patterns Compiler optimization in
conjunction with the augmented runtime system achieves
substantial execution time improvements in comparison to
the base runtime system ranging from  to  on 
processors Performance is also comparable to that us
ing compiletime alternatives such as Applied Parallel Re
searchs XHPF compiler for regular access patterns	 and
the CHAOS 
 inspectorexecutor based system for ir
regular access patterns	
The outline of the rest of this paper is as follows
Section II describes the combined compiletime runtime
shared memory system Section III presents the perfor
mance results In Section IV we outline the applicability
of our techniques to other platforms and architectures Fi
nally we survey related work in Section V and conclude in
Section VI
II The Combined CompileTime RunTime Shared
Memory System
We rst provide some background on TreadMarks 

the runtime system we used in our implementation We
then discuss how the compiler analyzes the shared data
accesses in TreadMarks programs The runtime primitives
by which the compiler informs TreadMarks of the results
of its analysis are discussed next We are then ready to
describe the transformation from TreadMarks source code
into code augmented by calls to these primitives Finally
we illustrate the entire process with two sample programs
A The Base RunTime Shared Memory System
TreadMarks 
 is an SDSM system built at Rice Uni
versity It is an ecient userlevel SDSM system that runs
on commonly available Unix systems We use TreadMarks
version  as the base shared memory runtime system
in our experiments
TreadMarks provides explicitly parallel programming
primitives similar to those used in hardware shared mem
ory machines namely process creation shared memory al
location and lock and barrier synchronization The system
supports a release consistent RC	 memory model 
 re
quiring the programmer to use explicit synchronization to
ensure that changes to shared data become visible
TreadMarks uses a lazy invalidate 
 version of RC and
a multiplewriter protocol 
 to reduce the overhead in
volved in implementing the shared memory abstraction
The virtual memory hardware is used to detect accesses
to shared memory Consequently the consistency unit is a
virtual memory page The multiplewriter protocol reduces
the eects of false sharing with such a large consistency
unit With this protocol two or more processors can simul
taneously modify their own copy of a shared page Their
modications are merged at the next synchronization op
eration in accordance with the denition of RC thereby
reducing the eects of false sharing The merge is accom
plished through the use of dis A di is a runlength
encoding of the modications made to a page generated
by comparing the page to a copy saved prior to the modi
cations called a twin	
With the lazy invalidate protocol a process invalidates
at the time of an acquire synchronization operation 

those pages for which it has received notice of modica
tions by other processors On a subsequent page fault the
process fetches the dis necessary to update its copy
B Compiler Analysis
The purpose of our compiler analysis is to provide ac
cess pattern information to the runtime system This in
volves not only analyzing the program statements to deter
mine what data is accessed but also determining at which
statement in the program to supply this information to the
runtime system
To answer the latter question we take advantage of the
special role that synchronization points play in release
consistent parallel programs First they are the points in
the execution of a program where shared data needs to be
made consistent Second they are also the points at which
it is determined what data modied on other processors
needs to be reected locally for memory to be consistent
We therefore analyze code segments between consecutive
synchronization statements and provide the runtime sys
tem with a description of the accesses in a segment at that
segments initial synchronization statement
In practice limitations of the analysis tool may restrict
the extent to which we can implement this general princi
ple For instance the presence of conditional statements or
 in the absence of interprocedural analysis  procedure
calls may limit the region of code for which we can summa
rize the shared memory access patterns In those cases we
may need to limit analysis accordingly and place the calls
that provide the access information to the runtime system
at procedure entry points or at control ow statements
Our main tool for access analysis is regular section anal
ysis 
 Regular section descriptors RSDs	 concisely rep
resent the array accesses in a loop nest The RSDs repre
sent the accessed data as linear expressions of the upper
and lower loop bounds along each dimension and include
stride information When indirection arrays are involved
the RSDs can be used recursively with each indirection
representing the regular section for the indirection array
used The access patterns that can be analyzed are how
ever limited to linear expressions of the loop indices In
addition to the memory locations accessed our RSDs also
contain a tag indicating among other things whether the
accesses are read or write or both Figure  outlines the
steps in our algorithm
C The Augmented RunTime System
In addition to the original TreadMarks primitives the
augmented runtime system provides two primary inter
faces for use by the compiler Validate and Push
Validate and its variant Validate w sync sup
port aggregated communication They can fetch dis
for multiple pages with a single message exchange
Validate w sync in addition piggybacks the request for
dis on the next synchronization operation The calls pro
vide a set of access descriptors corresponding to the RSDs
obtained in the analysis see Section IIB	 The runtime
system uses these descriptors to determine the set of in
valid pages that will be accessed The data for the invalid
pages can then be requested in a single message exchange
per processor An additional access type parameter in the
Validate interface allows further optimizations to avoid
communication and to reduce the overhead of consistency
maintenance
Details of the interface are provided in Figure  An ac
cess descriptor consists of the section access type and
schedule number The section or RSD contains the fol
 Create V 	 the set of shared variables in the program Create S	
the set of all synchronization operations in the program Initialize
F 	 the set of all transformation points	 to S
 For each statement p in the program
a By traversing the abstract syntax tree AST in all possible
control ow directions along which p can be reached	 create the
set F
prec
p of all the directly preceding synchronization points If
no synchronization statements are found along any one direction	
include the control ow statement along that direction in F and
F
prec
p
b By traversing the AST in all possible control ow directions
starting from p	 create the set S
succ
p of all possible synchronization
points that directly succeed the statement
c For each statement f in the set F
prec
p	
i Determine the location of the outermost loop that encloses p but
not f or any member of the set S
succ
p Intuitively	 this corresponds
to determining the code segment between consecutive synchronization
statements for which accesses must be summarized
ii Construct a regular section for each denition or reference	 both
regular and irregular	 in p to a variable in V  Add a freadg or
fwriteg tag to the section Determine the reaching denitions for
each reference to a variable in V this can be done during the AST
traversal to create F
prec
p If these denitions occur after f 	 add
the writefirst attribute to the tag
iii Perform a union of the resulting section	 with the other sec
tions that have already been generated for f  A union of the tags
freadg and fwriteg is fread writeg A union of the tags fread
writefirstg and fwriteg	 is fread writefirstg
Fig  Access Pattern Determination
lowing information about the accesses  the base address
the dimension or number of indices followed by the type
of access DIRECT or INDIRECT	 and either the DIRECT in
formation lower bound upper bound and stride	 or the
RSD for the indirection array along each dimension This
basic structure allows us to handle any recursive indirec
tions that might be used in a program The access type
is one of READ WRITE or READWRITE Shared arrays ac
cessed directly along every dimension have two additional
access types WRITE ALL and READWRITE ALL which are
used when the compiler analysis can determine that every
element in the section will be written WRITE ALL indicates
that all data in the section will be written but not read
READWRITE ALL indicates that all data will be both read
and written The runtime system uses this information to
reduce consistency maintenance overheads by eliminating
the creation of twins for such pages In addition since ac
cesses marked WRITE ALL are not read the runtime system
can also avoid the communication that would make such
data consistent before the write
The schedule number is an identier for the schedule
or the set of shared pages accessed in the section For
INDIRECT accesses this set is recomputed by retraversing
the indirection array only if it has changed since the last
time it was examined The runtime system uses the vir
tual memory protection mechanism to detect any modi
cations to the indirection array This eliminates the need
for compiletime knowledge of when the indirection array
will be modied
Push is used to replace a barrier synchronization and to
send data to a processor in advance of when it is needed
The arguments to Push are the sections of data that are
written by individual processors before the barrier and read
after the barrier Details of the Push interface are also
provided in Figure  A Push on processor P computes the
intersection of the sections written by P with those that
will be read by another processor and sends the data in
the intersection to the corresponding processor P then
computes the intersection of the sections written by other
processors with the sections that will be read by P  and
posts a receive for that data
Unlike Validate which does not change the un
derlying consistency guarantees unless a WRITE ALL or
READWRITE ALL access is specied	 Push guarantees con
sistency only for the sections of data received through the
Push The rest of the shared address space may be incon
sistent until the next barrier Hence Push can be used
only if the compiler has determined with certainty that
the processors do not read the regions of shared data left
inconsistent Given the large consistency unit the Push
directive can be useful in eliminating data communication
due to false sharing Push provides the capabilities of a
message passing interface within a shared memory envi
ronment However unlike pure compiletime approaches
Push can be used selectively by restricting its use to a pro
gram phase where complete analysis is possible The run
time system ensures that the entire address space is made
consistent at the barrier that must terminate such a phase
Validateint numdescs  number of descriptors 
RSD section  section of shared data
through indirection array
if necessary 
int accesstype  READ WRITE READ	WRITE
WRITEALL or READ	WRITEALL 
int schednum  schedule number 



 
 Similar to Validate except that the request for data is
piggybacked on a synchronization 
Validatewsync 


 
 does not preserve consistency
 N is the number of processors 
Pushrsection

N  Sections of data read 
wsection

N  Sections of data written 
Fig  Augmented RunTime Interface
D Compiler Transformations
Following the analysis described in Section IIB the com
piler transforms the program using the augmented runtime
interface discussed in Section IIC The compiler rst at
tempts to nd opportunities for using the Push interface
because this interface results in the largest performance
gains Subsequently it tries to nd opportunities to use
Validate Figure  describes the decision process used to
determine whether Push or Validate can be applied
For each statement f in F
 If f is a barrier	 create the set F
prec
f of elements of F that
immediately precede f by traversing the AST as before	 and the
set F
succ
f of elements of F that immediately succeed f 
 If  can a Push be applied 
  F
prec
f contains one and only one barrier	
  F
succ
f is nonempty and contains only barriers	
  the sections associated with F
prec
f and f are all precise the
compiler is able to analyze all data accesses made between the two
consecutive synchronization points	 and
  the sections associated with F
prec
f contain write accesses	
then  apply the Push transformation 
  replace f with a Push	 passing as arguments	 the read sections of
f 	 and the write sections of F
prec
f in terms of processor identiers
in practice	 this transformation will involve the creation of functions
that take the processor number as a parameter	 and return the
section of data accessed by that processor
 else if  can a Validate be applied 
  there are precise sections associated with f
then
  if
  f is a synchronization statement
  then
  insert a Validate w sync
  else
  insert a Validate
  for each precise section associated with f
  if
 the analysis for this variable is precise no unanalyzable accesses	
 tagged as fread writeg but not fread write writefirstg	
 and refers to a contiguous range of addresses	
  then
 supply the section with access type READ WRITE ALL
  else if
 the analysis for this variable is precise	
 the tag contains the attribute writefirst	 and
 the section refers to a contiguous range of addresses	
  then
 supply the section with access type WRITE ALL
  else
 supply the section with access type READ	 WRITE	 or READ	WRITE
depending on the tag
Fig  Program Transformation
E Examples
We illustrate our analysis and transformation with two
examples one with regular accesses and one with irregular
accesses through an indirection array
E Jacobi
Jacobi is an iterative method for solving partial dieren
tial equations with nearestneighbor averaging as the main
computation See Figure 	 The array b is shared while a
is a local scratch array To simplify the discussion we as
sume that there is no false sharing ie boundary columns
start on page boundaries and their length is a multiple of
the page size Our methods work in the presence of false
sharing This simplication is for explanatory purposes
only	 Processes arrive at Barrier at the end of each
iteration resulting in n   	 messages with n proces
sors At the departure from the barrier an acquire	 pages
containing elements of the boundary columns are invali
dated since they have been modied on the neighboring
processors When a processor accesses a page in one of its
do k  
do j  beginend
do i M
aij  
 
bijbijbijbij
enddo
enddo
call Barrier
do j  beginend
do i M
bij  aij
enddo
enddo
call Barrier
enddo
Fig  Pseudocode for the TreadMarks Jacobi program The vari
ables begin and end are used to partition the work among the
processors	 with each processor working on a dierent partition
of the shared array b
do k  
do j  beginend
do i M
aij  
 
bijbijbijbij
enddo
enddo
call Barrier
call ValidatebDIRECT
bMbeginendWRITEALL
do j  beginend
do i M
bij  aij
enddo
enddo
call PushbMbeginpendp
bMbeginpendp
enddo
Fig 
 Pseudocode for the transformed Jacobi program A Validate
has been inserted	 and Barrier has been replaced by Push In
the arguments to Push	 the dependence of begin and end on the
processor number p has been made explicit
neighbors boundary columns in the rst half of the next
iteration it takes a page fault which causes TreadMarks
to fetch a di from its neighbor With m pages in a bound
ary column the result is mn  	 messages In addition
there are another n   	 messages at Barrier that
ends the rst half of the iteration Finally there is consis
tency overhead for write detection during the second half
of the iteration including page faults memory protection
operations and creating twins and dis
In a message passing version of Jacobi whether hand
coded or compilergenerated at the end of an iteration
each processor sends two messages one to each of its neigh
bors containing the boundary column to be used by that
neighbor in the next iteration It waits to receive the
boundary columns from its neighbors and proceeds with
the next iteration The result is only n 	 messages per
iteration for the message passing program
Compiler analysis and transformation can virtually elim
inate the extra overhead in the SDSM version of the pro
gram Figure  shows the transformed program
First by examining the sections of data written by indi
vidual processors before Barrier and read afterwards
the compiler recognizes that Barrier can be replaced
by a Push The sections of data accessed are supplied as ar
guments to the Push runtime call in reality functions that
will compute these perprocessor sections are passed	 In
this case the runtime will perform a pointtopoint mes
sage exchange among neighboring processors after inter
secting the sections of data read and written by the indi
vidual processors The Push eliminates barrier overhead
and pushes the data rather than requesting or pulling it
Second by examining the accesses during the second half
of each iteration the compiler can determine that between
Barrier and Barrier a processor writes all ele
ments of the pages in its assigned section of the array
without reading the data Hence it inserts a Validate
for that section with a WRITE ALL argument which causes
the runtime not to make twins and dis for these pages
eliminating consistency overhead
The only extra overhead that now exists is Barrier
This barrier cannot be eliminated due to the anti
dependence across it and remains because shared memory
semantics are assumed
In this particular example analysis is precise the com
piler can determine exactly what data is read or written
as a function of the processor identier In such a case
it is also possible for the compiler to directly generate a
message passing program As will be seen in Section III
the performance of this strategy and ours are very similar
However our methods can also be applied to applications
for which the analysis cannot be made precise or for which
only some phases can be analyzed
E Moldyn
Moldyn is a molecular dynamics simulation Its com
putational structure resembles the nonbonded force cal
culation in CHARMM 
 which is a wellknown molec
ular dynamics code used at NIH to model macromolecu
lar systems Nonbonded forces are longrange interactions
existing between each pair of molecules CHARMM ap
proximates the nonbonded calculation by ignoring all pairs
which are beyond a certain cuto radius The cuto ap
proximation is achieved by maintaining an interaction list
of all the pairs within the cuto distance and iterating over
this list at each timestep The interaction list is used as
an indirection array to identify interacting partners Since
molecules change their spatial location every iteration the
interaction list must be periodically updated Figure  il
lustrates the program structure of Moldyn and the force
computation subroutine
Due to implementation limitations no interprocedural
analysis	 the compiler inserts a Validate call at the begin
ning of ComputeForces The compiler analyzes the access
patterns for each statement in the subroutine In this case
the access pattern consists of reads to x the only shared ar
ray through the interaction list indirection array The
accesses to the indirection array are themselves regular and
determinable at compiletime Hence the compiler can de
termine the section of the indirection array through which
the shared array x is accessed This information is con
veyed through the Validate call
The runtime system traverses the section of the indi
rection array supplied through the Validate call to deter
mine the pages in x that will be accessed or the schedule
This traversal is performed only if the indirection array
has changed since the last time the schedule has been up
dated Requests for the invalid pages in the schedule are
then sent out and the data is aggregated before being sent
back to the requesting processor This results in a reduced
number of messages compared to the base system
program moldyn
do step   nsteps
if modstepUPDATEINTERVAL 
eq
  then
call buildinteractionlist
endif



 



call ComputeForces



 



enddo
subroutine ComputeForces
Validate x  INDIRECT
interactionlist numinterREAD
do i   numinter
n  interactionlist i
n  interactionlist i
force  xn  xn
localforcesn  localforcesn  force
localforcesn  localforcesn  force
enddo
Fig  Transformed Moldyn program
III Results
Our experimental environment is an processor IBM
SP running AIX version  Each processor is a 
MHz RS thin node with  KBytes of data cache and
 Mbytes of main memory Interprocessor communica
tion is accomplished over the IBM SP highperformance
twolevel crossbar switch using IBMs MPL message pass
ing layer Unless indicated otherwise all results are for
processor runs
The minimum roundtrip time using send and receive for
the smallest possible message is  seconds including
an interrupt
 
The time for a remote Kbyte page fetch
is  seconds In TreadMarks the minimum time to
acquire a free lock is  seconds The minimum time to
perform an processor barrier is  seconds Under AIX
 the time for both page faults and memory protection
operations is a linear function of the page number and the
number of pages in use For instance the memory protec
tion operation time can vary between  and  seconds
with  pages in use
 
Although substantially faster roundtrip times are possible if in
terrupts are disabled	 interrupts are required to implement lock and
page requests in TreadMarks For XHPF and CHAOS	 interrupts
were disabled
Application Data set size Time
secs	
Jacobi  KxK x 
Jacobi  KxK x 
DFFT  xx 

 

 


DFFT  xx 

 

 


Shallow  KxK x 
Shallow  KxK x 
IS   N  

 B
max
 
 

IS   N  

 B
max
 
 

Gauss  KxK x 
Gauss  KxK x 
MGS  KxK x 
MGS  KxK x 
Tomcatv  KxK x 
Tomcatv  KxK x 
Grid  KxK x 
Grid  KxK x 
Moldyn   iter  
Moldyn   iter  
NBF  x x 
NBF  x x 
TABLE I
Applications data set sizes and uniprocessor execution
times
We separate our results in terms of regular and irregular
applications Our aim is to compare performance against
stateoftheart compiler techniques currently available to
optimize performance for these types of applications
A Overall Results for Regular Applications
We used eight Fortran programs IS and DFFT from
the NAS benchmark suite 
 the Shallow benchmark from
the National Center for Atmospheric Research Tomcatv
from the SPEC benchmark suite 
 Grid from Applied
Parallel Research Inc and Jacobi Gauss and Modied
GrammSchmidt MGS	 three locally developed bench
marks For each application we use two data set sizes
to illustrate any eects from changing the computation
to communication ratio as well as due to false sharing
Table I describes the data set sizes and the correspond
ing uniprocessor execution times

Uniprocessor execution
times were obtained by removing all synchronization from
the TreadMarks programs these times were used as the
basis for the speedup gures
We present the performance of these applications in three
dierent versions
 The base TreadMarks program executing with the base
TreadMarks runtime system  Tmk
 The compileroptimized TreadMarks program execut
ing with the augmented TreadMarks runtime system 

All measurements for Tomcatv and Grid were made on  MHz
thin nodes
OptTmk
 A message passing version automatically generated by
the Forge XHPF compiler 
 from Applied Parallel Re
search Inc APR	  XHPF
The results for the XHPF compiler are provided in order
to compare performance against a commercial parallelizing
compiler for dataparallel programs
Figure  shows the speedups achieved for all applica
tions using the three dierent environments The numbers
for the compileroptimized TreadMarks version reect the
gains achieved by the most sophisticated level of analysis
possible for each application There are no entries for IS
using XHPF in the gure XHPF cannot parallelize IS
because of an indirect access to the main array in the com
putation
Compiler optimization achieves substantial execution
time improvements in comparison to the base TreadMarks
ranging from  to 

For programs for which base
TreadMarks achieves relatively good speedups Jacobi
Shallow Gauss Tomcatv Grid and MGS	 the execution
time improvements are moderate  to  For the two
programs IS and DFFT	 for which base TreadMarks per
forms poorly compared to XHPF execution time improve
ments are quite large ranging from  to  These
gains are mainly due to communication aggregation and
elimination of consistency overhead The execution times
achieved by the compileroptimized shared memory pro
grams are within  of XHPF except for Tomcatv with
the KxK dataset where cache eects result in the XHPF
version showing signicant performance degradation	
The compileroptimized version of Jacobi from our ex
ample in Figure 	 shows a  improvement in execu
tion time over the base TreadMarks and is within  of the
execution times of the XHPF version For the x
data set Jacobi derives most of its improvement from com
munication aggregation because of a signicant reduction
in the number of messages fold	 For the x data
set communication aggregation does not improve execu
tion time because the boundary rows are exactly one page
Eliminating Barrier through the use of a Push provides
most of the benet With a smaller data set the cost of
the barrier becomes proportionally higher and hence its
elimination results in some improvement in running time
	 Correspondingly in comparison to XHPF while
the performance of the x data set is similar there
is a slight drop in performance for the x data set
This is because of the extra Barrier which was not
eliminated
Performance gains for DFFT for the larger problem
size come mainly from communication aggregation and
twindi creation elimination The gains from the smaller
problem size however also come from the elimination of
data communication due to false sharing by the use of the
Push directive an additional 	 The Push directive
only updates those sections of data specied as being read

Percentage improvements are calculated by the formula base 
opt base
by the processor thereby resulting in reduced data com
munication in the presence of false sharing
IS has a migratory access pattern The use of dis in
TreadMarks results in extra data communicated due to the
di accumulation 
 problem  that of multiple overlap
ping dis being communicated due to multiple processors
successively modifying the same data With the compiler
based directives this overhead can be eliminated The per
formance gains of  in comparison to Tmk for OptTmk
come from the above optimization reduced data commu
nication	 in addition to communication aggregation
Shallow Gauss Tomcatv and MGS benet mainly from
communication aggregation There are also some ad
ditional gains from combining synchronization and data
transfer when the amount of data transferred is small The
performance of all three versions of Grid is similar due to
the high computation to communication ratio resulting in
near perfect speedups in all cases
B Overall Results for Irregular Applications
In the case of the irregular applications we compare the
compileroptimized TreadMarks programs OptTmk	 with
the handcoded CHAOS inspectorexecutor based 
	
programs CHAOS	 as well as the base TreadMarks pro
grams Tmk	 Our intent in presenting the CHAOS per
formance numbers is to compare performance with state
oftheart compiler technology for irregular applications
The compileroptimized TreadMarks programs include op
timizations for both regular and irregular access patterns
Figure  presents the speedups at  processors for two
programs Moldyn from CHARMM 
 and NBF from the
GROMOS benchmark 
 both molecular dynamics sim
ulation kernels Table I presents the sequential execution
time and data set sizes used In the case of Moldyn we
vary the frequency with which the indirection array is re
computed In the case of NBF we vary the data set size
to introduce false sharing
For Moldyn from which our example in Figure  is
taken	 our optimized system is  faster than base
TreadMarks a result of an almost fold reduction in
the number of messages due to communication aggrega
tion Our optimized system is also up to  faster than
CHAOS depending on the frequency with which the indi
rection array is updated The cost of access pattern compu
tation the inspector	 which in our case consists of travers
ing the indirection array is lower than in the inspector
executor approach In the inspectorexecutor approach
global communication of data schedules is required since
the communication is not requestresponse in nature
To separate the eects of inspector computation for
NBF we do not include the time to execute the inspector
in the measured computation In this case our optimized
system is no worse than  slower than CHAOS and is
up to  faster than the base TreadMarks system If we
include the execution time of the inspector our approach
is faster than CHAOS by up to  for  iterations of
the program loop Changing the data set from x
to x introduces false sharing resulting in the two
JAC
OB
I-4K
x4K
JAC
OB
I-1K
x1K
3D
-FF
T-6
x6x
6
3D
-FF
T-5
x6x
5
SH
AL
LO
W-
1K
x1K
SH
AL
LO
W-
1K
x.5
K
IS-
23-
19
IS-
20-
15
GA
US
S-2
Kx
2K
GA
US
S-1
Kx
1K
MG
S-2
Kx
2K
MG
S-1
Kx
1K
TO
MC
AT
V-1
.4K
x1.
4K
TO
MC
AT
V-1
Kx
1K
GR
ID-
2K
x2K
GR
ID-
1.5
Kx
1.5
K0
1
2
3
4
5
6
7
8
Spe
edu
p
Tmk
Opt-Tmk
XHPF
Fig  Speedup at  processors for TreadMarks	 CompilerOptimized Version of TreadMarks	 and XHPF The IS bar is missing for XHPF
because it cannot parallelize IS
Moldyn-20 iter Moldyn-11 iter NBF-64x1024 NBF-64x1000
0
1
2
3
4
5
6
7
8
S
pe
ed
up
Tmk
Opt-Tmk
CHAOS
Fig  Speedup at  processors for TreadMarks	 Compiler Opti
mized Version of TreadMarks	 and CHAOS
TreadMarks versions sending more data than CHAOS
Our compiletime optimizations successfully reduce the
number of messages used during program execution mak
ing performance comparable to a system such as CHAOS
The advantage of our approach increases as the frequency
of changes to the indirection array increases Its disadvan
tage is the potential for false sharing overhead when the
data set is small or has poor spatial locality
IV Applicability to Other Platforms
While the experimental results presented here are
specic to the TreadMarks SDSM system the tech
niques described generalize to other SDSM systems such
as Cashmere 
 homebased lazy release consistency
HLRC	 
 or Shasta 
 In these systems each co
herence unit has a home where modications are collected
or where directory information is maintained While care
ful placement of the home can result in a prefetching ef
fect such placement using purely runtime information
does not capture either phase changes or complex ac
cess patterns and can result in additional overhead The
compilerprovided access information can be used to op
timize the migrationplacement of the home Writerst
accesses something the runtime has no knowledge of can
avoid data communication merely by changing the cur
rent home The benets of communication aggregation
and consistency overhead elimination continue to apply
in such systems although the runtime mechanisms will
dier For virtual memorybased systems such as Cash
mere and HLRC memory protection operations are elimi
nated Also the Push interface can avoid extra data com
munication as a result of false sharing For variablegrain
instrumentationbased systems such as Shasta the instru
mentation overhead can be further reduced
Our experimental results have also been presented in the
context of a fairly highlatency communication subsystem
If a lowlatency network were to be used the benets of
aggregation would shift from being purely due to a reduc
tion in the number of messages to being able to overlap
communication with computation
Our compiler framework was implemented for explicitly
parallel programs However the general principle is also
applicable to automatic parallelization with the SDSM sys
tem as the target The access pattern information can be
folded into the shared memory parallelization directives
These directives identify all data races and hence perform
a similar function to the synchronization in the explicitly
parallel programs in terms of identifying the appropriate
points at which to supply the access pattern information
This information can be utilized by the runtime not only
to optimize communication but also to balance load 

Several recent proposals for hardware shared memory
machines include a message passing subsystem designed in
part to allow applications to take advantage of bulk data
transfer 
 
 Woo et al 
 evaluate one such de
sign in the context of the Flash system While Woo et
al focus on establishing the magnitude of the performance
benets of bulk data transfer with hardwarebased shared
memory we have explored in addition ways for the com
piler to automate the use of the bulk data transfer facility
in a software shared memory environment The same ac
cess pattern information can be used in a hardware shared
memory environment to exploit the bulk transfer features
The information can also be used for optimal page place
ment and remapping in machines such as the Origin
V Related Work
Mowry et al 
 examine the eect of combining
prefetching and multithreading in a software DSM system
Their prefetching strategy involves fetching data in advance
of synchronization operations Our strategy involves lever
aging the program synchronization in order to reduce re
dundant messages as well as eliminating consistency over
head where possible
Jeremiassen et al 
 present a static algorithm for com
puting perprocess memory references to shared data in
coarsegrained parallel programs We use a similar anal
ysis in terms of processor identiers in order to replace a
barrier with a Push
Mukherjee et al 
 compare the CHAOS inspector
executor system to the TSM transparent shared mem
ory	 and the XSM extendible shared memory	 systems
both implemented on the Tempest interface 
 They
conclude that TSM is not competitive with CHAOS while
XSM achieves performance comparable to CHAOS after in
troducing several specialpurpose protocols In our work
we use a fairly straightforward compiler to optimize the
shared memory programs rather than relying on hand
coded specialpurpose protocols
Keleher and Tseng 
 describe a runtime interface and
compiletime system that couples the compiler and the run
time in a manner similar to our system Their interface
and implementation are however more runtime intensive
Chandra and Larus 
 also describe a combined compiler
and runtime system that is similar in spirit to our system
but in the context of negrained software shared memory
VI Conclusion
We have described an integrated compiletimeruntime
approach for executing regular and irregular computations
on distributed memory machines This approach is based
on a modied software distributed shared memory layer
and fairly simple compiletime support Our compiler com
putes data access summaries using regular section analy
sis and feeds that information to the TreadMarks runtime
SDSM system Improvements in execution time range from
 to  on an processor IBM SP in comparison to the
base runtime system for the applications analyzed The
combination of static prediction of shared memory accesses
by the compiler with dynamic detection of accesses by the
runtime allows the combined system to approach the per
formance of compilergenerated message passing within
 of XHPF for regular programs and up to  bet
ter than CHAOS for irregular programs	 It does so with
out incurring the programming diculties of message pass
ing or the limitations on automatic parallelization of data
parallel programs for message passing targets A combined
compiletime runtime system of this nature retains the
ease of programming of shared memory while exploiting
the message passing capabilities of the underlying hard
ware
References
 G Agarwal and J Saltz Interprocedural compilation of irregular
applications for distributed memory machines In Proceedings of
Supercomputing 	 December 

 C Amza	 AL Cox	 S Dwarkadas	 P Keleher	 H Lu	 R Raja
mony	 and W Zwaenepoel TreadMarks Shared memory com
puting on networks of workstations IEEE Computer	 
	 February 
 Applied Parallel Research FORGE High Performance Fortran
Users Guide	 version  edition
 D Bailey	 J Barton	 T Lasinski	 and H Simon The NAS par
allel benchmarks Technical Report 	 NASA	 July 

 BR Brooks	 RE Bruccoleri	 BD Olafson	 DJ States	
S Swaminathan	 and M Karplus Charmm A program for
macromolecular energy	 minimization	 and dynamics calcula
tions Journal of Computational Chemistry	 	 
 JB Carter	 JK Bennett	 and W Zwaenepoel Techniques for
reducing consistencyrelated information in distributed shared
memory systems ACM Transactions on Computer Systems	

	 August 

 S Chandra and J R Larus Optimizing communication in hpf
programs for negrain distributed shared memory In Proceed
ings of the th Symposium on the Principles and Practice of
Parallel Programming	 June 
 R Das	 P Havlak	 J Saltz	 and K Kennedy Index array at
tening through program transformation In Proceedings of Su
percomputing 	 December 

 K M Dixit The spec benchmarks Parallel Computing	 pages

	 
 S Dwarkadas	 AL Cox	 and W Zwaenepoel An integrated
compiletimeruntime software distributed shared memory sys
tem In Proceedings of the th Symposium on Architectural Sup
port for Programming Languages and Operating Systems	 Octo
ber 
 K Gharachorloo	 D Lenoski	 J Laudon	 P Gibbons	 A Gupta	
and J Hennessy Memory consistency and event ordering in
scalable sharedmemory multiprocessors In Proceedings of the
th Annual International Symposium on Computer Architec
ture	 pages 
	 May 
 WF van Gunsteren and HJC Berendsen GROMOS
GROningen MOlecular Simulation software Technical report	
Laboratory of Physical Chemistry	 University of Groningen	

 P Havlak and K Kennedy An implementation of interproce
dural bounded regular section analysis IEEE Transactions on
Parallel and Distributed Systems	 
	 July 
 S Hiranandani	 K Kennedy	 and C Tseng Compiling Fortran
D for MIMD distributedmemory machines Communications of
the ACM	 
	 August 

 S Ioannidis and S Dwarkadas Compiler and runtime support
for adaptive load balancing in software distributed shared mem
ory systems In Fourth Workshop on Languages	 Compilers	 and
Runtime Systems for Scalable Computers	 May 
 TE Jeremiassen and S Eggers Computing perprocess sum
mary sideeect information In U Banerjee	 D Gelernter	
A Nicolau	 and D Padua	 editors	 Fifth Workshop on Lan
guages and Compilers for Parallelism	 pages 
	 August

 P Keleher	 A L Cox	 and W Zwaenepoel Lazy release consis
tency for software distributed shared memory In Proceedings of
the th Annual International Symposium on Computer Archi
tecture	 pages 	 May 
 P Keleher and C Tseng Enhancing software DSM for compiler
parallelized applications In Proceedings of the th Interna
tional Parallel Processing Symposium	 April 
 K Kennedy	 K S McKinley	 and C Tseng Analysis and trans
formation in an interactive parallel programming tool Concur
rency
 Practice and Experience	 
	 October 
 D Kranz	 K Johnson	 A Agarwal	 J Kubiatowicz	 and B Lim
Integrating messagepassing and sharedmemory Early experi
ence In Proceedings of the  Conference on the Principles
and Practice of Parallel Programming	 May 
 J Kuskin and D Ofelt et al The Stanford FLASH multiproces
sor In Proceedings of the st Annual International Symposium
on Computer Architecture	 April 
 K Li and P Hudak Memory coherence in shared virtual
memory systems ACM Transactions on Computer Systems	

	 November 
 H Lu	 AL Cox	 S Dwarkadas	 R Rajamony	 and
W Zwaenepoel Compiler and software distributed shared mem
ory support for irregular applications In Proceedings of the th
Symposium on the Principles and Practice of Parallel Program
ming	 pages 
	 June 
 H Lu	 S Dwarkadas	 AL Cox	 and W Zwaenepoel Message
passing versus distributed shared memory on networks of work
stations In Proceedings SuperComputing 	 December 


 TC Mowry	 CQC Chan	 and AKW Lo Comparative eval
uation of latency tolerance techniques for software distributed
shared memory In Proceedings of the Fourth High Performance
Computer Architecture Symposium	 February 
 SS Mukherjee	 SD Sharma	 MD Hill	 JR Larus	 A Rogers	
and J Saltz Ecient support for irregular applications on dis
tributed memory machines In Proceedings of the th ACM Sym
posium on the Principles and Practice of Parallel Programming	
July 

 Steven K Reinhardt	 James R Larus	 and David A Wood Tem
pest and typhoon Userlevel shared memory In Proceedings of
the st Annual International Symposium on Computer Archi
tecture	 April 
 DJ Scales	 K Gharachorloo	 and CA Thekkath Shasta A
low overhead	 softwareonly approach for supporting negrain
shared memory In Proceedings of the th Symposium on Ar
chitectural Support for Programming Languages and Operating
Systems	 pages 
	 October 
 S D Sharma	 R Ponnusamy	 B Moon	 Y Hwang	 R Das	
and J Saltz Runtime and compiletime support for adaptive
irregular problems In SuperComputing	 
 R Stets	 S Dwarkadas	 N Hardavellas	 G Hunt	 L Kon
tothanassis	 S Parthasarathy	 and ML Scott Cashmerel
Software coherent shared memory on a clustered remotewrite
network In Proceedings of the th ACM Symposium on Oper
ating Systems Principles	 pages 	 October 
 R von Hanxleden and K Kennedy GiveNTake  a balanced
code placement framework In Proceedings of the ACM SIG
PLAN  Conference on Programming Language Design and
Implementation	 June 
 R von Hanxleden	 K Kennedy	 C Koelbel	 R Das	 and J Saltz
Compiler analysis for irregular problems in Fortran D In Pro
ceedings of the th Workshop on Languages and Compilers for
Parallel Computing	 August 
 SC Woo	 JP Singh	 and JL Hennessy The performance
advantages of integrating block data transfer in cachecoherent
multiprocessors In Proceedings of the th Symposium on Ar
chitectural Support for Programming Languages and Operating
Systems	 pages 	 October 
 Y Zhou	 L Iftode	 and JP Singh Performance evaluation of
two homebased lazy release consistency protocols for shared vir
tual memory systems In Proceedings of the Second USENIX
Symposium on Operating System Design and Implementation	
October 
