An Evaluation of Software Distributed Shared Memory for Next-Generation Processors and Networks by Cox, A.L. et al.
Evaluation of Release Consistent
Software Distributed Shared Memory on
Emerging Network Technology
Sandhya Dwarkadas  Pete Keleher  Alan L Cox  and Willy Zwaenepoel




We evaluate the e ect of processor speed network char
acteristics and software overhead on the performance
of releaseconsistent software distributed shared mem
ory We examine ve di erent protocols for implement
ing release consistency eager update eager invalidate
lazy update lazy invalidate and a new protocol called
lazy hybrid This lazy hybrid protocol combines the
benets of both lazy update and lazy invalidate
Our simulations indicate that with the processors and
networks that are becoming available coarsegrained
applications such as Jacobi and TSP perform well more
or less independent of the protocol used Medium
grained applications such as Water can achieve good
performance but the choice of protocol is critical For
sixteen processors the best protocol lazy hybrid per
formed more than three times better than the worst
the eager update Finegrained applications such as
Cholesky achieve little speedup regardless of the pro
tocol used because of the frequency of synchronization
operations and the high latency involved
While the use of relaxed memory models lazy imple
mentations and multiplewriter protocols has reduced
the impact of false sharing synchronization latency re
mains a serious problem for software distributed shared
memory systems These results suggest that future
work on software DSMs should concentrate on reducing
the amount of synchronization or its e ect
  Introduction
Although several models and algorithms for software
distributed shared memory DSM have been pub
lished performance reports have been relatively rare
The few performance results that have been published
 
This work was supported in part by NSF Grants CCR 
and CCR  Texas ATP Grant No	  and by a
NASA Graduate Fellowship	
consist of measurements of a particular implementation
in a particular hardware and software environment 	

  	 Since the cost of communication is very im
portant to the performance of a DSM these results are
highly sensitive to the implementation of the commu
nication software Furthermore the hardware environ
ments of many of these implementations are by now ob
solete Much faster processors are commonplace and
much faster networks are becoming available
We are focusing on DSMs that support release consis
tency  ie where memory is guaranteed to be consis
tent only following certain synchronization operations
The goals of this paper are twofold  to gain an
understanding of how the performance of release con
sistent software DSM depends on processor speed net
work characteristics and software overhead and  to
compare the performance of several protocols for sup
porting release consistency in a software DSM
The evaluation is done by executiondriven simula
tion  The application programs we use have been
written for hardware shared memory multiproces
sors Our results may therefore be viewed as an in
dication of the possibility of porting shared memory
programs to software DSMs but it should be recog
nized that better results may be obtained by tuning
the programs to a DSM environment The applica
tion programs are Jacobi Traveling Salesman Prob
lem TSP and Water and Cholesky from the SPLASH
benchmark suite  Jacobi and TSP exhibit coarse
grained parallelism with little synchronization relative
to the amount of computation whereas Water may be
characterized as mediumgrained and Cholesky as ne
grained
We nd that with current processors the bandwidth
of the megabit Ethernet becomes a bottleneck lim
iting the speedups even for a coarsegrained application
such as Jacobi to about 
 on  processors With a 
megabit pointtopoint network representative of the
ATM LANs now appearing on the market we get good
speedups even for small sizes of coarsegrained prob
lems such as Jacobi and TSP moderate speedups for
Water and very little speedup for Cholesky Regard
less of the considerable bandwidth available on these
networks Choleskys performance is constrained by the
very high number of synchronization operations
Among the protocols for implementing software re
lease consistency we distinguish between eager and lazy
protocols Eager protocols push modications to all
cachers at synchronization variable releases 
 In con
trast lazy protocols  pull the modications at syn
chronization variable acquires and communicate only
with the acquirer Both eager and lazy release con
sistency can be implemented using either invalidate or
update protocols We present a new lazy hybrid proto
col that combines the benets of update and invalidate
few access misses low data and message counts and low
lock acquisition latency
Our simulations indicate that the lazy algorithm
and the hybrid protocol signicantly improve the per
formance of mediumgrained programs those on the
boundary of what can be supported eciently by a
software DSM Communication in coarsegrained pro
grams is suciently rare that the choice of protocols
becomes less important The eager algorithms perform
slightly better for TSP because the branchandbound
algorithm benets from the early updates in the eager
protocols see Section  For the negrained pro
grams lazy release consistency and the hybrid proto
col reduce the number of messages and the amount of
data drastically but the communication requirements
are still beyond what can be supported eciently on
a software DSM For these kinds of applications tech
niques such as multithreading and code restructuring
may prove useful
The outline of the rest of this paper is as follows
Section  briey reviews release consistency and the
eager and lazy implementation algorithms Section 	
describes the hybrid protocol Section  details the im
plementation of the protocols we simulated Section 

discusses our simulation methodology and Section 
presents the simulation results We briey survey re
lated work in Section  and conclude in Section 
 Release Consistency
For completeness we reiterate in this section the main
concepts behind release consistency RC  eager re
lease consistency ERC 
 and lazy release consis
tency LRC 
RC  is a form of relaxed memory consistency that
allows the e ects of shared memory accesses to be
delayed until selected synchronization accesses occur
Simplifyingmatters somewhat shared memory accesses
are labeled either as ordinary or as synchronization ac
cesses with the latter category further divided into ac 
quire and release accesses Acquires and releases may
be thought of as conventional synchronization opera
tions on a lock but other synchronization mechanisms
can be mapped on to this model as well Essentially
RC requires ordinary shared memory accesses to be per
formed only when a subsequent release by the same pro
cessor is performed RC implementations can delay the
e ects of shared memory accesses as long as they meet
this constraint
For instance the DASH  implementation of RC
bu ers and pipelines writes without blocking the pro
cessor A subsequent release is not allowed to per
form ie the corresponding lock cannot be granted
to another processor until acknowledgments have been
received for all outstanding invalidations While this
strategy masks latency in a software implementation
it is also important to reduce the number of messages
sent because of the high per message cost
In an eager software implementation of RC such as
Munins multiplewriter protocol 
 a processor delays
propagating its modications of shared data until it ex
ecutes a release see Figures  and  Lazy implemen
tations of RC further delay the propagation of modi
cations until the acquire At that time the last releaser
piggybacks a set of write notices on the lock grant mes
sage sent to the acquirer These write notices describe
the shared data modications that precede the acquire
according to the happened before  partial order 
The happened before  partial order is essentially the
union of the total processor order of the memory ac
cesses on each individual processor and the partial order
of releaseacquire pairs The happened before  partial
order can be represented eciently by tagging write
notices with vector timestamps  At acquire time
the acquiring processor determines the pages for which
the incoming write notices contain vector timestamps
larger than the timestamp of its copy of that page in
memory For those pages the shared data modica
tions described in the write notices must be reected
in the acquirers copy either by invalidating or by up
dating that copy The tradeo s between invalidate and
update and a new hybrid protocol are discussed in the
next section
 A Hybrid Protocol for LRC
A lazy invalidate protocol invalidates the local copy of a
page for which a write notice with a larger timestamp is
received see Figure 	 The lazy update protocol never
invalidates pages to maintain consistency Instead ac




acq  w(x)  rel
acq  w(y)  rel








acq  w(x)  rel
acq  w(y)  rel
acq    r(y)
upd(x) upd(y)
upd(y)




acq  w(x)  rel
acq  w(y)  rel
acq             r(y)
inv(x)
inv(x,y) m(y)





acq  w(x)  rel
acq  w(y)  rel
acq                r(y)
upd(y)




acq  w(x)  rel
acq  w(y)  rel




Figure  Lazy Hybrid
incoming write notices for any page that is cached lo
cally see Figure  As an optimization the releaser
piggybacks the modications it has available locally on
the lock grant message
In the lazy hybrid protocol as in the lazy update
protocol the releaser piggybacks on the lock grant mes
sage in addition to write notices the modications to
those pages that it believes the acquirer has a copy of
in its memory However unlike in the lazy update pro
tocol the acquirer does not make any attempt to ob
tain any other modications Instead it invalidates the
pages for which it received write notices but for which
no modications were included in the lock grant mes
sage
Previous simulations  indicate that  the lazy
protocols send fewer messages and less data than the
eager protocols and  the lazy update protocol send
fewer messages in most cases than the lazy invalidate
protocol while the lazy invalidate protocol sends less
data than the lazy update protocol The reduction in
the number of access misses outweighs the extra mes
sages exchanged at the time of synchronization Also
the reduced access misses result in reduced latency thus
favoring the update protocol
However the choice of a lazy or an eager algorithm
and furthermore the choice between an update or an
invalidate protocol also a ects the lock acquisition la
tency We distinguish two cases
 The lock request is pending at the time of the re
lease The lazy invalidate protocol has the short
est lock acquisition latency since a single message
from the releaser to the acquirer suces followed
by the invalidations at the acquirer a purely local
operation In contrast the eager algorithms must
update or invalidate all other cachers of pages that
have been modied at the releaser and the lazy
update protocol must retrieve all the modications
that precede the acquire again potentially a multi
host operation
 The lock request is not yet pending at the time of
the release The eager algorithms have the low
est lock acquisition latency followed closely by the
lazy invalidate protocol All require a single mes
sage exchange between the releaser and the ac
quirer but the lazy invalidate protocol also needs
to invalidate any local pages that have been modi
ed The lazy update protocol potentially requires
a multihost operation resulting in higher lock ac
quisition latency
The lazy hybrid protocol combines the advantages of
lazy update and lazy invalidate protocols First like
the invalidate protocol the hybrid only exchanges a
single pair of messages between the acquiring and the
releasing processor As a result lock acquisition la
tency for the lazy hybrid protocol is close to that of the
lazy invalidate protocol The only additional overhead
comes from the need to send and process the modica
tions piggybacked on the lock grant message Second
the amount of data exchanged is smaller than for the
update protocol Finally the hybrid sends updates for
recently modied pages cached by the acquirer It is
likely that these pages will be accessed by the acquirer
thus reducing the number of access misses and as a
result reducing the latency and the number of miss
messages
 Protocol Implementations
In this section we describe the details of the ve pro
tocols that we simulated lazy hybrid LH lazy invali
date LI lazy update LU eager invalidate EI and
eager update EU
All ve are multiple writer protocols Multiple pro
cessors can concurrently write to their own copy of a
page with their separate modications being merged
at a subsequent release in accordance with the RC
model This contrasts with the exclusivewriter proto
col used for instance in DASH  where a processor
must obtain exclusive access to a cache line before it
can be modied Experience with Munin 
 indicates
that multiplewriter protocols perform well in software
DSMs because they can handle false sharing without
generating large amounts of message trac between
synchronization points
All of the protocols support the use of exclusive locks
and global barriers to synchronize access to shared
memory Processors acquire locks by sending a request
to the statically assigned owner who forwards the re
quest on to the current holder of the lock Locks and
unlocks are mapped onto acquires and releases in a
straightforward manner Barriers are implemented us
ing a barrier master that collects arrival messages and
distributes departure messages In terms of consistency
information a barrier arrival is modeled as a release
while a departure is modeled as an acquire on each of
the other processors
Processes exchange three types of information at
locks and barriers synchronization information consis
tency information and data The consistency informa
tion is a collection of write notices each of which con
tains the processor identication and the vector times
tamp of the modication Consistency information can
be piggybacked on synchronization messages but often
the data comprising the modications to shared mem
ory can not Most shared data exchanged in the proto
cols is in the form of dis which are runlength encod
ings of the modied data of a single page Sending di s
instead of entire pages greatly reduces data trac and
allows multiple concurrent modications to be merged
into a single version
Each shared page has a unique statically assigned
owner Each processor keeps an approximate copyset
for every shared memory page The copyset is initial
ized to the owners copyset when a page is initially re
ceived and updated according to subsequent write no
tices and di  requests The copysets are used in the
eager protocols to ush invalidations or updates to all
other processors at releases Since the copyset is only
approximate multiple rounds are sometimes needed to
ensure that the consistency information reaches every
cacher of the modied pages The copysets are used by
LH to determine which write notices should be accom
panied by di s
Table  summarizes the message counts for locks bar
riers and access misses for each of the protocols In this
table the concurrent last modiers for a page are the
processors that created modications that do not pre
cede according to happened before  any other known
modications to that page
  The Eager Protocols
   Locks
We base our eager RC algorithms on Munins multiple
writer protocol 
 A processor delays propagating its
modications of shared data until it comes to a release
At that time write notices together with di s in the
EU protocol are sent to all other processors that cache
the modied pages possibly taking multiple rounds if
the local copysets are not up to date
A lock release is delayed until all modications have
been acknowledged by the remote cachers An acquire
consists solely of locating the processor that executed
the corresponding release and transferring the synchro
nization variable No consistencyrelated operations oc
cur at lock acquires
  Barriers
At barrier arrivals the EI protocol sends synchroniza
tion and consistency information to the master in a sin
gle message However the EI barrier protocol has a
slight complication in that multiple processors may in
validate the same page at a barrier In order to prevent
all copies of a page from being invalidated the mas
ter designates one processor as the winner for each
page Only the winner retains a valid copy for a given
concurrently modied page The losers forward their
modications to the winner and invalidate their local
copies
In the EU protocol each processor ushes modica
tions to all other cachers of locally modied pages be
fore sending a synchronization message to the barrier
master
  Access Misses
Access misses are treated identically for both proto
cols A message is sent to the owner of the page The
owner forwards the request to a processor that has a
valid copy This processor then sends the page to the
processor that incurred the access miss
  The Lazy Protocols
  Locks
At an acquire the protocol locates the processor that
last executed a release on the same variable The re
leaser sends both synchronization and consistency in
formation to the acquirer in a single message The
consistency information consists of write notices for all
modications that have been performed at the releaser
but not the acquirer While LI moves data only in re
sponse to access misses both the LH and LU protocols
send di s along with the synchronization and consis
tency information However LH moves di s only from
the releaser to the acquirer and hence can append them
to an already existing message The releaser sends all
di s that correspond to modications being performed
at the acquire for the rst time such that for each di 
the acquirer is in the releasers copyset for the page
named by the di  Pages named by write notices that
arrive without di s are invalidated
The LU protocol never invalidates pages An acquire
does not succeed until all of the di s described by the
new write notices have been obtained In general the
acquirer must talk to other processors in order to pick
up all of the required di s However the number of pro
cessors with which the acquirer needs to communicate
can be reduced because of the following observation If
processor p modies a page at time t then all di s of
that page that precede the modication according to
happened before  can be obtained from processor p
 Barriers
At barrier arrivals the LI protocol sends synchroniza
tion information and write notices to the master in a
single message When all processors have arrived the
barrier master sends a single message to each proces
sors that contains the barrier release as well as all the
write notices that it has collected
LH and LU barrier arrivals are handled similarly In
both cases each processor pushes updates to all proces
sors that cache pages that have been modied locally
before sending a barrier arrival message to the master
The only di erence is that in LU the processes must
wait on the arrival of the data before departing from
the barrier
 Access Misses
Access misses are handled identically by LH LI and
LU At a miss a copy of the page and a number of di s
may have to be retrieved The number of sites that
need to be queried for di s can be reduced through the
same logic as in Section  The new di s are then
merged into the page and the processor is allowed to
proceed The lazy protocols determine the location of
a page or updates to the page entirely on the basis of
local information No additional messages are required
unlike in other DSM systems 	
 Methodology
 Application Suite
We simulated four programs from three di erent
classes of applications Jacobi and TSP are coarse
grained programs with a large amount of computa
tion relative to synchronization 		 and 
cycles per processor between o node synchronization
operations respectively at  processors Our Ja
cobi program is a simple Successive OverRelaxation
program that works on grids of 
 by 
 elements
TSP solves the traveling salesman problem for city
tours Water from the SPLASH suite is a medium
grained molecular dynamics simulation  cycles
per processor between o node synchronization oper
ations We ran Water with the default parameters
 molecules for  steps Cholesky performs parallel
factorization of sparse positive denite matrices and
is an example of a program with negrained paral
lelism from the SPLASH benchmark suite  cycles
per processor between o node synchronization opera
tions Cholesky was run with the default input le
bcsstk TSP and Cholesky use only locks for syn
chronization Jacobi uses only barriers and Water uses
both
 Architectural Model
We used two basic architectural models an Ethernet
model and an ATM switch model Both models assume
MHz RISC processors with  Kbyte directmapped
caches and a  cycle memory latency  byte pages
and an innite local memory no capacity misses The
ethernet is modeled as a  MBitsec broadcast net
work while the ATM is modeled as a  MBitsec
crossbar switch
 Protocol Simulation
Each message exchanged by the protocols was mod
eled by the wire time consumed by sending the mes
Access Miss Lock Unlock Barrier
LH m 	  nu
LI m 	  n
LU m 	h  nu
EI  or 	 	 c n  v
EU  	 c n  u
m   concurrent last modiers for the missing page
h   other concurrent last modiers for any local page
c   other cachers of the page
n   processors in system










 excess invalidators of page i
Table   Shared Memory Operation Message Costs
sage any inherent network latency contention for the
network and a software overhead that represents the
operating system cost of calling a userlevel handler
for incoming messages creating and reading the mes
sages in the DSM software and the cost of the DSM
protocol implementation This cost is set at  
message length    
 processor cycles at both the
destination and source of each message These gures
were modeled after the Peregrine  implementation
overheads Peregrine is an RPC system that provides
performance close to optimal by avoiding intermediate
copying The lazy implementations extra complexity
is modeled by doubling the perbyte message overhead
both at the sender and at the receiver Di s are mod
eled by charging four cycles per word per page for each
modied page at the time of di  creation Although
all messages are simulated protocolspecic consistency
information is not reected in the amount of data sent
Only the actual shared data moved by the protocols is
included in message lengths
 Simulation Results
 DSM on an Ethernet
Although prior work 
 showed that Ethernetbased
software DSMs can achieve signicant speedups we nd
that for modern processors the Ethernet is no longer
a viable option Figure  shows the speedup of Ja
cobi a coarsegrained program Jacobis speedup peaks
at 
 for eight processors and declines rapidly there
after While Jacobis communication needs are modest
in comparison with other programs the individual pro
cessors execute identical code and therefore create sig
nicant network contention at each barrier This con
tention is especially signicant for the update protocols
in which each processor sends updates to its neighbors
prior to the barrier In an processor run processors
on average wait more than 	 milliseconds before gaining
control of the Ethernet
 DSM on an ATM
The emerging ATM networks have several advantages
over the Ethernet Foremost among these are increased
bandwidth and reduced opportunity for contention
Unlike the Ethernet in which all processors seeking to
communicate contend with each other processors in an
ATM network can communicate concurrently and in
terfere only when they try to send to a common desti
nation
Figures  summarize the performance of the Jacobi
program on an ATM While the Ethernet simulation of
Jacobi achieved a speedup of about 
 the ATM version
reaches  Part of this increase is due to the increased
bandwidth but much of it is due to the fact that no
more than two competing updates from each of a pro
cessors two neighbors ever arrive at a single destina
tion during one interval The performance of all ve
protocols is roughly the same for this program because
of the regular nearestneighbor sharing The invalidate
protocols fare slightly worse than the update protocols
because pages on the edge of a processors assigned data
are invalidated at barriers and have to be paged across
the network The lazy protocols perform slightly worse
than the eager protocols because of the extra overhead
LH LI LU EI EU








Figure  Speedup for Jacobi on Ethernet
added in the simulation for message processing This
overhead is probably unjustied for Jacobi because of
the nature of communication involved As will be seen
in all of the simulations EI moves signicantly more
data than the other protocols because its access misses
cause entire pages to be transmitted rather than di s
Like Jacobi TSP is a coarsegrained program with
modest amounts of communication Much of TSPs
ineciency results from contention for a global tour
queue Fully  of a processor execution is wasted
waiting for the queue lock In order to prevent repeated
acquires because of unpromising tours each acquirer
holds the queues lock while making a preliminary check
on the topmost tour If the tour is promising the
queues lock is released Otherwise the acquirer re
moves another tour from the queue
Figures  present TSPs performance There is
little variation among the lazy protocols and among the
eager protocols because of the large granularity and the
contention for the queue lock However the speedup
for the eager protocols is better than for the lazy pro
tocols TSP uses a branchandbound algorithm using
a global minimum to prune recursive searches Read
access to the current minimum is not synchronized A
processor may therefore read a stale version of the min
imum The lock protecting the minimum is acquired
only when the length of the tour just explored is smaller
than the potentially stale value of the minimum The
length is then rechecked against the value of the min
imum which is now guaranteed to be up to date and
the minimum is updated if necessary The eager pro
tocols push out the new value of the minimum at each
release and therefore local copies of the minimum are
frequently updated It is thus unlikely that a processor
would read a stale value unlike with the lazy protocols
where the local copy is only updated as a result of an
acquire Since the algorithm uses the global minimum
to prune searches such stale values may cause TSP to
explore more unpromising tours with the lazy protocols
Water is a mediumgrained program that uses both
locks and barriers Waters data consists primarily of
an array of molecules each protected by a lock During
each iteration the force vectors of all molecules with
a spherical cuto  range of a molecule are updated to
reect the molecules inuence In combination with
the relatively small size of the molecule structure in
comparison with the size of a page this creates a large
amount of false sharing The simulation results for Wa
ter can be seen in Figures 	
 LH performs better
than the other protocols because the molecules migra
tory behavior during the force modication phase al
lows the protocol to have far fewer cache misses and
hence messages than the other protocols The lazy
protocols perform better than the eager protocols and
LH LI LU EI EU










Figure  Speedup for Jacobi












Figure  Message Count in Jacobi












Figure 	 Data Kbytes Transmitted in Jacobi
LH LI LU EI EU













 Speedup for TSP














Figure    Message Count in TSP











Figure   Data Kbytes Transmitted in TSP
LH LI LU EI EU











Figure   Speedup for Water












Figure   Message Count in Water











Figure   Data Kbytes Transmitted in Water
LH LI LU EI EU









Figure   Speedup for Cholesky










Figure   Message Count in Cholesky









Figure   Data Kbytes Transmitted in Cholesky
invalidate performs better than update EU sends an
order of magnitudemore messages than any of the other
protocols because releases cause updates to be sent to
many other processors Ninetyone percent of EUs
messages are updates sent during lock releases The
invalidate protocols send fewer messages because fewer
processors cache each page
Cholesky is a program with negrained synchroniza
tion that uses a task queue approach to parallelism
Locks are used to dequeue tasks as well as to protect
access to multiple columns of data Figures  sum
marize Choleskys performance The large amount of
synchronization limits the speedup to no more than
	 for any of the protocols The eager protocols suf
fer from excessive updates and invalidations caused by
false sharing The lazy protocols and in particular LH
fare better because communication is largely localized
to the synchronizing processors leading to much better
handling of false sharing
Our simulations indicate that synchronization is a
major obstacle to achieving good performance on DSM
systems For example 	 of the messages required
by Water running on the processor ATM model un
der the hybrid protocol were for synchronization For
Cholesky running on  processors  of the mes
sages were used for synchronization All but a few of
these synchronization messages were for lock acquisi
tion Moreover  of each processors time was spent
acquiring locks in the processor LH Cholesky run
While approximately one third of the lock acquisition
messages carried data the rest were solely for synchro
nization purposes When a lock is reacquired by the
same processor before another processor acquires it the
lazy protocols have an advantage over the eager proto
cols An eager protocol must distribute di s at every
lock release Lazy release consistency permits us to
avoid external communication when the same lock is
reacquired
 The Eect of Network Characteris
tics
The network is a shared resource that can be a perfor
mance bottleneck We can break down the networks
e ect on performance into three categories bandwidth
serialization and collisions Bandwidth a ects the to
tal amount of data that can be moved Serialization
refers to the processor wait time when other proces
sors have control of the contended network link By
collisions we mean actual network collisions as well as
the e ect of protocols like exponential backo  that are
used to avoid network collisions in the case of an eth
ernet network Table  summarizes speedup for Jacobi
and Water on ve di erent networks
Jacobi Water
 Mbit Ethernet w Coll 
 
 Mbit Ethernet wo Coll  	
 Mbit ATM  
 Mbit ATM 	 	
 GBit ATM 	 
Table  Speedups With Di erent Network
Characteristics LH  processors
Jacobi communicates with neighbors at a barrier
Both the implementation of barriers and the access
pattern regular to xed neighbors benet from a
pointtopoint network that eliminates most serializa
tion Hence most of the benets of ATM for this pro
gram are from the concurrency in the network Waters
access pattern is much less regular because molecules
move The potential for communication to be com
pleted entirely in parallel is signicantly reduced As
a result Water benets as much from network concur
rency as from increased bandwidth Increasing the net
work bandwidth to  Gbitsec does not improve per
formance signicantly with a  MHz processor since
at this point the software overhead is the major per
formance bottleneck
  The Eect of Software Overheads
Software overheads have a signicant impact on per
formance Table 	 shows the simulated performance of
an ATM network in the processor case with no soft
ware overhead with software overhead identical to that
used in the previous simulations and with double that
amount
We rst removed the overhead in order to nd an up
per bound on DSM performance for the given network
and processor architecture regardless of the operating
system and DSM implementation The large speedups
indicate the performance potential for the protocols
and the potential gains to be had from hardware sup
port
With software overhead removed there is no longer a
signicant permessage penalty on a crossbar network
This lessens the importance of access misses and favors
protocols that reduce the amount of data moved for
improved performance For instance the LI protocol
outperforms LH on a processor Cholesky run even
though the LH protocol sends 	 fewer messages and
has 
 fewer access misses than the LI protocol The
reason is that the hybrid protocol attempts to nd a
compromise between low message counts low numbers






Jacobi Normal 	 	 	  	
Double     

Zero    	 	
TSP Normal     
Double    	 	
Zero 	 	  
 

Water Normal 	   	 
Double   
 		 

Zero     	
Chol Normal    
 
Double    	 
Table  Speedups With Varying Software
Overhead  processors
of access misses and low amounts of data but the data
total is more signicant if software overhead is removed
The signicance of software overhead can be seen
most clearly in comparing the speedups of Water with
and without overhead The lazy protocols improve
by an average of  when the overhead is removed
EI still performs badly because the amount of data it
moves ve times more than any of the other protocols
EU which runs three times slower than the LH proto
col when software overhead is included speeds up by
more than  when software overhead is removed
In order to determine the variation in performance
that might occur due to an increase in software over
head we determined speedups when the overhead per
message was doubled The performance decreases by
 to  for Water The decrease in performance is
not as large as when going from zero to normal over
head since the normal overhead includes the per di 
overhead which is signicant In general the lazy pro
tocols and in particular the lazy hybrid perform better
as communication becomes more expensive
 The Eect of Processor Speeds
Processor speeds a ect the ratio of computation time to
communication time However the software overhead
is proportional to the processor speed We varied the
processor speeds from  to  MHz Table  shows the
variation in speedup for the processor case when us
ing the lazy hybrid protocol in the case of Jacobi TSP
and Water and the processor case for Cholesky For
Jacobi and TSP the variations are negligible because
the low message counts for these programs results in lit
tle variation in the computation to communication ra
tio Water and Cholesky show a more signicant varia
tion in speedup due to the larger amount of communica
tion In the latter two cases communication latency is
as much of a bottleneck as the software overheads and
hence an increased processor speed reduces speedup
However some of the improvements are masked by the
corresponding changes in software overheads
 The Eect of Page Size
The large page sizes in common use in software DSMs
result in a high probability of false sharing Prior work
has developed implementations of relaxed memory con
sistency models for DSM that reduce but do not to
tally eliminate the e ects of false sharing For example
Munins eager implementation of release consistency
eliminates the pingpong e ect of a page bouncing
between two writing processors 
 However modica
tions to falsely shared pages still have to be distributed
to all processors caching the page at a release The
lazy hybrid protocol further reduces the e ect of false
sharing because data movement only occurs between
synchronizing processors In other words false sharing
in LH increases the amount of data movement but not
the number of messages
The results we have reported are for a page size
of  bytes To obtain a measure of the e ects of
false sharing we ran simulations using a page size of
 bytes While going to a byte page reduces
false sharing we found that we need to communicate
with approximately the same number of processors to
maintain consistency Furthermore the resulting re
duction in communication is often partially counterbal
anced by the increased number of access misses see
Table 
 which presents data for the lazy hybrid proto
col While reducing the page size has a limited e ect
on performance restructuring the program may prove
more benecial
 Related Work
This work draws on the large body of research in re
laxed memory consistency models eg     We
Pr Spd MHz Jacobi TSP Water Chol
 	 
  
 	  	 	
 	   
Table  Speedups with Di erent Processor
Speeds LH  processors
Procs Page Size Jac TSP Wat Chol
bytes
     
    
  	  	 
 	   





  	 
  
 	  	 
Table  E ect on Speedup of Reducing the
Page Size to  bytes LH
have chosen as our basic model the release consis
tency model introduced by the DASH project at Stan
ford  because it requires little or no change to ex
isting shared memory programs An interesting alter
native is entry consistency EC dened by Bershad
and Zekauskas  EC di ers from RC because it re
quires all shared data to be explicitly associated with
some synchronization variable On a lock acquisition
EC only needs to propagate the shared data associated
with the lock EC however requires the programmer
to insert additional synchronization in shared memory
programs to execute correctly on an EC memory Typ
ically RC does not require additional synchronization
Ivy 	 and Munin 
 are two implementations of
software DSMs for which performance measurements
have been published Both achieve good speedups on
many of the applications studied The slow proces
sors used in the implementations prevented the net
work from becoming a bottleneck in achieving these
speedups With faster processors faster networks are
needed and more sophisticated methods are required
In addition synchronization latency becomes a major
issue Performance measurements are also available for
the DASH hardware DSM multiprocessor Compari
son between these numbers and our simulation results
indicates the benets of a dedicated highspeed inter
connect for negrained parallel applications
 Conclusions
With the advent of faster processors the performance
of DSM that can be achieved on an Ethernet network
is limited Serialization of messages collisions and
low bandwidth severely constrain speedups even for
coarsegrained problems Higherbandwidth pointto
point networks such as the ATM LANs appearing on
the market allow much better performance with good
speedups even for mediumgrained applications Fine
grained applications still perform poorly even on such
networks because of the frequency and cost of synchro
nization operations
Lazy hybrid is a new consistency protocol that com
bines the benets of invalidate protocols relatively lit
tle data and update protocols fewer access misses and
fewer messages In addition the lazy hybrid shortens
the lock acquisition latency considerably compared to
a lazy update protocol The hybrid protocol outper
forms the other lazy protocols under a model that takes
into account software overhead for communication For
mediumgrained applications the di erences are quite
signicant
The latency of synchronization remains a major prob
lem for software DSMs Without resorting to broad
cast it appears impossible to reduce the number of mes
sages required for lock acquisition Therefore the only
possible approach may be to hide the latency of lock
acquisition Multithreading is a common technique for
masking the latency of expensive operations but the
attendant increase in communication could prove pro
hibitive in software DSMs Program restructuring to
reduce the amount of synchronization may be a more
viable approach
References
 S V Adve and M D Hill A unied formaliza
tion of four sharedmemory models Technical Re
port CS
 University of Wisconsin Madison
September 
 M Ahamad PW Hutto and R John Im
plementing and programming causal distributed
shared memory In Proceedings of the th In 
ternational Conference on Distributed Computing
Systems pages  May 
	 HE Bal and AS Tanenbaum Distributed pro
gramming with shared data In Proceedings of the
 International Conference on Computer Lan 
guages pages  October 
 BN Bershad and MJ Zekauskas Midway
Shared memory parallel programming with entry
consistency for distributed memory multiproces
sors Technical Report CMUCS Carnegie
Mellon University September 

 JB Carter JK Bennett and W Zwaenepoel
Implementation and performance of Munin In
Proceedings of the th ACM Symposium on Oper 
ating Systems Principles pages 
 October

 JS Chase FG Amador ED Lazowska HM
Levy and RJ Littleeld The Amber system
Parallel programming on a network of multiproces
sors In Proceedings of the th ACM Symposium
on Operating Systems Principles pages 

December 
 R G Covington S Dwarkadas J R Jump
S Madala and J B Sinclair The Ecient Simu
lation of Parallel Computer Systems International
Journal in Computer Simulation 	
 January

 M Dubois and C Scheurich Memory access
dependencies in sharedmemory multiprocessors
IEEE Transactions on Computers 	
June 
 K Gharachorloo D Lenoski J Laudon P Gib
bons A Gupta and J Hennessy Memory con
sistency and event ordering in scalable shared
memory multiprocessors In Proceedings of the
th Annual International Symposium on Com 
puter Architecture pages 
 May 
 DB Johnson and W Zwaenepoel The Peregrine
highperformance RPC system Software	 Practice
and Experience 	 February 	
 P Keleher A L Cox and W Zwaenepoel Lazy
release consistency for software distributed shared
memory In Proceedings of the th Annual In 
ternational Symposium on Computer Architecture
pages 	 May 
 D Lenoski J Laudon K Gharachorloo
A Gupta and J Hennessy The directorybased
cache coherence protocol for the DASH multipro
cessor In Proceedings of the th Annual In 
ternational Symposium on Computer Architecture
pages 
 May 
	 K Li and P Hudak Memory coherence in shared
virtual memory systems ACM Transactions on
Computer Systems 		
 November 
 JP Singh WDWeber and A Gupta SPLASH
Stanford parallel applications for sharedmemory
Technical Report CSLTR Stanford Uni
versity April 
