Techniques for Reducing Consistency-Related Communication in Distributed Shared Memory System by Zwaenepoel, W et al.
Techniques for Reducing ConsistencyRelated
Communication in Distributed Shared Memory Systems
John B Carter
 
 John K Bennett and Willy Zwaenepoel
Computer Systems Laboratory
Rice University
Houston TX 	
Abstract
Distributed shared memory 
DSM is an abstraction of shared memory on a distributed
memory machine Hardware DSM systems support this abstraction at the architecture
level software DSM systems support the abstraction within the runtime system One of
the key problems in building an ecient software DSM system is to reduce the amount
of communication needed to keep the distributed memories consistent In this paper
we present four techniques for doing so 
 software release consistency 
 multi
ple consistency protocols 
 writeshared protocols and 
 an updatewithtimeout
mechanism These techniques have been implemented in the Munin DSM system We
compare the performance of seven Munin application programs rst to their perfor
mance when implemented using message passing and then to their performance when
running on a conventional software DSM system that does not embody the above tech
niques On a processor cluster of workstations Munins performance is within 
of message passing for four out of the seven applications For the other three per
formance is within 	 to  Detailed analysis of two of these three applications
indicates that the addition of a function shipping capability would bring their perfor
mance to within  of the message passing performance Compared to a conventional
DSM system Munin achieves performance improvements ranging from a few to several
hundred percent depending on the application
 
Present address Department of Computer Science University of Utah  Merrill Engineering Building Salt Lake
City UT 	

This research was supported in part by the National Science Foundation under Grants CDA CCR
CCR	 by the IBM Corporation under Research Agreement No 
	 by the Texas Advanced Technology
Program under Grants 		 and 	
 and by a NASA Graduate Fellowship
 Introduction
   Background
There are two fundamental models for parallel programming and for building parallel machines
shared memory and distributed memory or message passing The shared memory model is a direct
extension of the conventional uniprocessor model wherein each processor is provided with the
abstraction that there is but a single memory in the machine A update to shared data therefore
becomes visible to all the processors in the system In contrast in the distributed memory model
there is no single shared memory Instead each processor has a private memory to which no other
processor has direct access The only way for processors to communicate is through explicit message
passing
Distributed memory machines are easier to build especially for large congurations because
unlike shared memory machines they do not require complex and expensive hardware cache con
trollers  The shared memory programming model is however more attractive since most ap
plication programmers nd it dicult to program machines using a message passing paradigm
that requires them to explicitly partition data and manage communication Using a programming
model that supports a global address space an applications programmer can focus on algorithmic
development rather than on managing partitioned data sets and communicating values
A distributed shared memory 
DSM system provides a shared memory programming model on
a distributed memory machine Hardware DSM systems eg DASH  support this abstraction
at the architecture level software DSM systems such as Ivy  and Munin  support this
abstraction within the runtime system Software DSM systems consist of the same hardware as
that found in a distributed memory machine with the addition of a software layer that provides the
abstraction of a single shared memory In practice each memory remains physically independent
and all communication takes place through explicit message passing performed by the DSM software
layer DSM systems combine the best features of shared memory and distributed memory machines
They support the convenient shared memory programming model on distributed memory hardware
which is more scalable and less expensive to build However although many DSM systems have
been proposed and implemented 
eg        achieving good performance on DSM
systems for a sizable class of applications has proven to be a major challenge
This challenge can be best illustrated by considering how a conventional DSM system is im
plemented  The global shared address space is divided in virtual memory pages The local
memory of each processor is used as a cache on the global shared address space When a processor
attempts to access a page of global virtual memory for which it does not have a copy a page fault
occurs This page fault is handled by the DSM software which retrieves a copy of the missing page
from another node If the access is a read then the page becomes replicated in readonly mode
If the access is a write then all other copies of the pages are invalidated Throughout the rest of
this paper the term conventional DSM  refers to a DSM system that employs a pagebased
writeinvalidate consistency protocol such as the one just described
The primary source of overhead in a conventional DSM system is the large amount of communi
cation that is required to maintain consistency or put another way to maintain the shared memory
abstraction Ideally the amount of communication for an application executing on a DSM system
should be comparable to the amount of communication for the same application executing directly
on the underlying message passing system Conventional DSM systems have found it dicult to
achieve this goal because of restrictive memory consistency models and inexible consistency proto
cols The false sharing problem is an example of this phenomenon False sharing occurs when two
threads on dierent machines concurrently update dierent shared data items that lie in the same

virtual memory page In conventional DSM systems this false sharing can cause a page to ping
pong back and forth between dierent machines In contrast in a message passing system each
thread would independently update its own copy of the data without unnecessary communication
Some of these problems can be overcome by carefully restructuring the shared memory programs
to reect the way that the DSM system operates For example one could decompose the shared
data into small pagealigned pieces or one could introduce new variables to reduce the amount of
false sharing However this restructuring can be as tedious and dicult as using messagepassing
directly
  Summary of Results
In this paper we present the following four techniques for reducing the amount of communication
needed for keeping the distributed memories consistent
 Software release consistency is a software implementation of release consistency  speci
cally aimed at reducing the number of messages required to maintain consistency in a software
DSM system Roughly speaking release consistency requires memory to be consistent only
at specic synchronization points
 Multiple consistency protocols are used to keep memory consistent in accordance with the
observation that no single consistency protocol is the best for all applications or even for all
data items in a single application  
 Writeshared protocols address the problem of false sharing in DSM by allowing multiple
processes to write concurrently into a shared page with the updates being merged at the
appropriate synchronization point in accordance with the denition of release consistency
 An updatewithtimeout mechanism which is in essence an update protocol that causes re
mote copies of shared data to be updated rather than invalidated However copies that are
not referenced during the last timeout interval are deleted eliminating the need for further
updates and thus reducing the total amount of communication
These techniques have been incorporated in the Munin DSM system Munin has been imple
mented on a network of SUN workstations running the VSystem 	 The Munin programming
interface is the same as that of conventional shared memory parallel programming systems except
that it requires 
i all synchronization to be visible to the runtime system and 
ii all shared vari
ables to be declared as such and 
optionally annotated with the consistency protocol to be used
Other than that Munin provides thread synchronization and data sharing facilities like those
found in many shared memory parallel programming systems
To evaluate the benets of these optimizations we measured the performance of seven shared
memory parallel programs Matrix Multiplication 
MULT Finite Dierencing 
DIFF both a
coarsegrained and a negrained version of the Traveling Salesman Problem 
TSPC and TSP
F Quicksort 
QSORT Fast Fourier Transform 
FFT and Gaussian Elimination with partial
pivoting 
GAUSS Three versions of each program were written a message passing version a
Munin DSM version and a conventional DSM version The computational aspects of all three
versions of each application were identical The conventional DSM versions use a pagebased write
invalidate protocol as described in Section 
Munins performance is within  of message passing for MULT DIFF TSPC and FFT For
TSPF QSORT and GAUSS performance is within 	 to  Detailed analysis of TSPF and
QSORT indicates that the addition of a function shipping capability would bring their performance

within  of the message passing performance Compared to a conventional DSM system Munin
achieves performance improvements ranging from a few percent for MULT to several hundred
percent for FFT
  Outline of the Paper
The rest of this paper is organized as follows Section  describes the techniques for reducing
consistencyrelated communication Section  summarizes some aspects of the implementation
that are relevant to the performance evaluation Section  describes the applications used in the
evaluation as well as the experimental methodology Section  contains an overview of the results
followed by a programbyprogram comparison of the performance of the Munin message passing
and conventional DSM versions in Section  Section  attempts to isolate the benets of the
dierent techniques used to reduce consistencyrelated communication Section  explores the
additional performance benets that could be achieved by the use of function shipping Related
work is discussed in Section 	 We conclude in Section 
 Techniques for Reducing Communication
This section describes the four techniques employed by the Munin DSM system to reduce consistency
related communication
  Software Release Consistency
Conventional DSM systems employ the sequential consistency model 	 as the basis for their
consistency protocols Sequential consistency essentially requires that any update to shared data
become visible to all other processors before the updating processor is allowed to issue another read
or write to shared data  This requirement imposes severe restrictions on possible performance
optimizations
Among the various relaxed memory models that have been developed we chose the release
consistency model developed as part of the DASH project  Release consistency exploits the fact
that programmers use synchronization to separate accesses to shared variables by dierent threads
The system then only needs to guarantee that memory is consistent at select synchronization points
This ability to allow temporary but harmless inconsistencies is what gives release consistency its
power Consider for example a program where all access to shared data is enclosed in critical
sections Release consistency guarantees that when a thread successfully acquires the critical section
lock it gains access to a version of shared data that includes all modications made before the lock
was last released Similarly for a program where all processes synchronize at a barrier when a
thread departs from the barrier it is guaranteed to see all modications made by all other threads
before they reached the barrier In general if a program is free of data races or in other words if
there is synchronization between all conicting shared memory accesses then the program generates
the same results on a release consistent memory system as it would on a sequentially consistent
memory system  Experience with release consistent memories indicates that because of the
need to handle arbitrary thread preemption most shared memory parallel programs are free of
data races even when written assuming a sequentially consistent memory  
More formally the following constraints on the memory subsystem ensure release consistency
 Before an ordinary read or write is allowed to perform with respect to any other processor
all previous acquire accesses must be performed

 Before a release access is allowed to perform with respect to any other processor all previous
read and write accesses must be performed
 Synchronization accesses must be sequentially consistent with one another
Lock acquires and releases map in the natural way on to acquires and releases A barrier arrival
is treated as a release while a barrier departure is treated as an acquire Release consistency
relaxes the constraints of sequential consistency in three ways 
i ordinary reads and writes can
be buered or pipelined between synchronization points 
ii ordinary reads and writes following a
release do not have to be delayed for the release to complete 
ie a release only signals the state
of past accesses to shared data and 
iii an acquire access does not have to delay for previous
ordinary reads and writes to complete 
ie an acquire only controls the state of future accesses
to shared data The rst point is the primary reason for release consistencys eciency Because
ordinary reads and writes can be buered or pipelined a release consistent memory can mask much
of the communication required to keep shared data consistent
  Buered Update versus Pipelined Invalidate Release Consistency
The hardware implementation of release consistency in DASH  pipelines invalidation messages
caused by writes to shared data This implementation is primarily geared towards masking the
latency of writes rather than reducing the number of messages sent In a software DSM system
where the overhead of sending messages is very high it is more important to reduce the frequency of
communication than it is to mask latency by pipelining messages For this reason we developed an
implementation of release consistency that buers writes instead of pipelining them as illustrated
in Figures  and  These gures illustrate how writes to three shared variables 
x y and z within
a critical section are handled by an implementation of release consistency that uses pipelining and
an implementation that uses buering respectively When a processor writes to several dierent
replicated data items within a critical section the pipelining scheme sends one message per write
while the buering implementation buers writes to shared data until the subsequent release at
which point it transmits the buered writes Ideally the buering implementation reduces the
number of messages transmitted from one per write to one per critical section when there is a
single replica of the shared data The dashed line portion of the execution graph represents the
delay that a processor experiences when releasing a lock Because the buering implementation
delays all writes until the release point it must transmit all buered writes then increasing the
latency of releases Nevertheless the reduction in the number of messages far outweighs the eect
of the higher release latencies
Buering and pipelining reduce the cost of writes but have no eect on the cost of read misses
In software DSM systems the cost of these read misses is very high both in terms of communication
and in terms of the length of time that a thread stalls before resuming after a read miss The impact
of read misses can be partially mitigated by using an update protocol Update protocols based on
sequential consistency may perform poorly because of the large amount of communication required
to send update messages for every write An update protocol based on release consistency can
however buer writes which reduces substantially the amount of communication required
 Multiple Consistency Protocols
Most DSM systems employ a single protocol to maintain the consistency of all shared data The
specic protocol varies from system to system For instance Ivy  supports a pagebased write
invalidate protocol while Emerald  uses objectoriented language support to handle shared

P1
w(x) w(y) w(z)
P2
x y z
release stalled
ack ack ack
Figure  Pipelining Invalidations
P1
w(x) w(y) w(z)
P2
release stalled
yx z
Single update message ack for (x,y,z)
Figure   Buering and Merging Updates
object invocations Each of these systems however treats all shared data the same way The use
of a single protocol for all shared data leads to a situation where some programs can be handled
eectively by a given DSM system while others cannot depending on the way in which shared
data is accessed by the program To understand how shared memory programs characteristically
access shared data we studied the access behavior of a suite of shared memory parallel programs
The results of this study  and others   support the notion that using the exibility of a
software implementation to support multiple consistency protocols can improve the performance
of DSM They also suggest the types of access patterns that should be supported conventional
readonly migratory writeshared and synchronization
 

Conventional shared variables are replicated on demand and are kept consistent using an
invalidationbased protocol that requires a writer to be the sole owner before it can modify the
data When a thread attempts to write to replicated data a message is transmitted to invalidate
all other copies of the data The thread that generated the miss blocks until all invalidation mes
sages are acknowledged This single owner consistency protocol is typical of what existing DSM
systems provide    and is what we use exclusively to represent a conventional DSM system
in our performance evaluation
Once readonly data has been initialized no further updates occur Thus the consistency
protocol simply consists of replication on demand A runtime error is generated if a thread attempts
to write to readonly data
Migratory data is accessed multiple times by a single thread including one or more writes
before another thread accesses the data   This access pattern is typical of shared data that is
accessed only inside a critical section or via a work queue The consistency protocol for migratory
data propagates the data to the next thread that accesses the data provides the thread with read
 
The results of our original study 	 indicated that there were eight basic access patterns private writeonce
migratory writemany producerconsumer result readmostly and synchronization but experience has made it
clear that several of the protocols were redundant Specically the result and producerconsumer access patterns
were subcases of the writeshared access pattern

and write access 
even if the rst access is a read and invalidates the original copy This protocol
avoids a write miss and a message to invalidate the old copy when the new thread rst modies
the data
Writeshared variables are frequently written by multiple threads concurrently without inter
vening synchronization to order the accesses because the programmer knows that each thread
reads from and writes to dierent portions of the data Because of the way that the data is laid
out in memory access to writeshared data suers from the eects of false sharing if the DSM
system attempts to keep these dierent portions of the data consistent at all times This protocol
is discussed in more detail in Section 
We support three types of synchronization variables locks barriers and condition variables
Because synchronization variables are accessed in a fundamentally dierent way than normal data
objects it is important that synchronization not be provided through shared memory but rather
via a suite of synchronization library routines or similarly specialized implementation Doing so
reduces the number of messages required to implement synchronization especially compared to
conventional spinlock algorithms and thereby reduces the amount of time that threads spend
blocked at synchronization points
 WriteShared Protocol
The writeshared protocol is designed specically to mitigate the eect of false sharing as dis
cussed in Sections  and  False sharing is a particularly serious problem for DSM systems for
two reasons 
i the consistency units are large so false sharing is very common and 
ii the la
tencies associated with detecting modications and communicating are large so unnecessary faults
and messages are particularly expensive The writeshared protocol allows concurrent writers and
buers writes until synchronization requires their propagation 
see Figure 
In order to record the modications to writeshared data the DSM system initially write protects
the virtuak memory pages containing the data When a processor rst writes to a page of write
shared data the DSM software makes a copy of the page 
a twin and queues a record for the
page in the delayed update queue 
DUQ as shown in Figure  The DSM them removes write
protection on the shared data so that further writes can occur without any DSM intervention
X
X
Copy on write
Make original writable
Write(X)
twin
X
Delayed Update
Queue
Figure  WriteShared Protocol Creating Twins

Write protect
(if replicated)
X
X
X
Update
Replicas
Compare
& Encode
‘‘Diff’’
twin
Figure  WriteShared Protocol Sending Out Dis
At release time the DSM system performs a wordbyword comparison of the page and its twin
and runlength encodes the results of this di into the space allocated for the twin 
see Figure 
Each encoded update consists of a count of identical words the number of diering words that
follow and the data associated with those diering words Each node that has a copy of a shared
object that has been modied is sent a list of the updates that are available Nodes receiving
update notications request the updates they require

 decode them and merge the changes into
their versions of the shared data A runtime switch allows this comparison to be performed at the
byte level as opposed to the word level if the data is more nely shared
Another runtime switch can be set to check for conicting updates to writeshared data If this
switch is set then when a di arrives at a processor that has a dirty copy of the page the DSM
system checks whether any of the updates in the di conict with any of the local updates and
if so signals an error The ability to detect conicting updates allows Munin to support dynamic
data race detection
 Update Timeout Mechanism
The performance of update protocols suers from the fact that updates to a particular data item
are propagated to all of its replicas including those that are no longer being used This problem is
particularly severe in DSM systems because the main memories of the nodes in which the replicas
are kept are very large and it takes a long time before a page gets replaced if at all Without
special provisions updates to these stale replicas can lead to a large number of unnecessary consis
tency messages resulting in poor performance This eect is one reason that existing commercial
multiprocessors use invalidationbased protocols We address this problem with a timeout algo
rithm similar to the competitive snoopy caching algorithm devised by Karlin  The goal of the
update timeout mechanism is to invalidate replicas of a cached variable that have not been accessed
recently upon receipt of an update
Munins update timeout mechanism is implemented as follows When receiving an update for
a page for which no twin exists locally the page is mapped such that it can only be accessed
in supervisor mode and the time of receipt of this update is recorded A local access causes a

If all of the encoded updates t into a single packet they are sent directly in place of the list of available updates
thus eliminating unnecessary communication in the event that only a small amount of shared data has been modied

fault as a result of which protection is removed and the timestamp is reset If the page is still in
supervisor mode when another update arrives 
meaning it has not been accessed locally since the
rst update and a certain time window  has expired 
 milliseconds in the prototype then the
page is invalidated and a negative acknowledgement is sent to originator of the update causing
it to no longer send updates to this processor In addition to avoiding unnecessary updates the
update timeout mechanism often reduces the number of messages sent in conjunction with updates
to stale data When a node receives an update message from another node that includes stale
updates the recipient node does not request the actual modications associated with the shared
data it is no longer caching Thus unless all of the updates described in the update message are
to stale data no extra work is performed to process the stale updates other than the small amount
of processing necessary to note that the updates are not needed If all of the updates are to stale
data the overhead is only a single packet exchange
The use of update timeouts results in a hybrid updateinvalidate protocol that allows Munin
to gain the benets of an update mechanism ie the reduction in the number of read misses and
subsequent highlatency 
idle reloads while at the same time retaining the superior scalability of
an invalidation protocol by limiting the extent to which stale copies of particular pages are updated
 The Munin DSM Prototype
The techniques described in Section  have been implemented in the Munin DSM system  Munin
was evaluated on a network of SUN workstations running the VSystem 	 connected via an
isolated  megabit per second Ethernet This section provides a brief overview of aspects of the
implementation of Munin that are relevant to its evaluation A more detailed description of the
Munin prototype appears elsewhere 
  Writing A Munin Program
Munin programmers write parallel programs using threads as they would on many shared memory
multiprocessors Synchronization is supported by library routines for the manipulation of locks
barriers and condition variables All of the current applications were written in C
Munin currently supports only statically allocated shared variables although support for dy
namically allocated shared data could easily be added The programmer annotates the declaration
of shared variables to specify what protocol to use to keep shared data consistent eg shared
fprotocolg C type variable name The keyword shared is required to specify that a vari
able will be shared among processes although the protocol can be omitted If the protocol is
omitted the conventional protocol is used Incorrect protocol annotations may result in inecient
performance or in runtime errors that are detected by the Munin runtime system but not in
incorrect results All of the shared data in the test programs was fully annotated
 Compiling and Linking a Munin Program
A preprocessor lters the source code in search of shared variable declarations For each such
declaration the preprocessor removes the Muninspecic shared fprotocolg portion and adds
an entry to an auxiliary le After preprocessing the source le is compiled with the regular
compiler The Munin linker reads the auxiliary le and relocates the shared variables to a shared
segment By default the linker places each shared variable on a separate page In addition the
Munin linker appends to the executable a shared segment symbol table that describes the layout

of the shared memory and the protocols to be used for the shared data These additions to Munin
executables had a negligible impact on program size or startup costs
 Runtime Overview
Figure  illustrates the organization of a Munin program during runtime On each participating
node the Munin library is linked into the same address space as the user program and thus can
access user data directly The two major data structures used by the Munin runtime system are
the delayed update queue 
see Section  and the object directory  which maintains the state of
the shared data being used by local user threads A Munin system thread installs itself as the
page fault handler for the Munin program As a result the underlying V kernel 	 forwards
to this thread all memory exceptions The Munin thread also interacts with the V kernel to
communicate with the other Munin nodes over the network and to manipulate the virtual memory
system as part of maintaining the consistency of shared memory The prototype uses no features
of V for which equivalent features are not commonly available on other platforms 
eg Unix or
Mach In addition we avoided using features that we believed might not be common on future
workstation clusters such as reference bits in the page table or a multicast capability on the
network For the update timeout mechanism references are detected by mapping writeshared
pages to supervisor mode so that the rst reference to a page after it is updated results in a page
fault We thus maintain a reference bit and timestamp for each page without requiring hardware
supported reference bits Although the prototype runs on a collection of workstations connected
via an Ethernet the multicast capability of Ethernet was not used so that our results could be
generalized to platforms without hardware multicast
Object
Directory
Munin
Runtime
DUQ
User
Code
and
Data
SUN 3/60s
Network (10 Mbps Ethernet)
...
V Kernel
Figure  Munin Runtime Organization
	
 The Object Directory
On each node the Munin runtime system maintains a pagelevel object directory containing in
formation on the state of each data item in the global shared memory as shown in Figure  All
shared variables on the same physical page are treated as a single object Variables that are larger
than a page eg a large array are treated as a number of independent pagesized objects Munin
uses variables rather than pages as the basic unit of granularity because this better reects the way
data is used and reduces the amount of false sharing between unrelated variables 
Munins strategies for maintaining the object directory are designed to reduce the number of
messages required to maintain the distributed object directory First in keeping with the goal
of avoiding centralized algorithms Munin distributes the state information associated with write
shared data across the nodes that contain cached copies of the data In many cases this elimination
of the notion of a static owner of data allows nodes to respond to requests completely locally
This is done by allowing directory entries to be inconsistent at times This approach also allows
Munin to exploit locality of reference when maintaining directory information since the need
to maintain a single consistent directory entry as has been proposed for most scalable shared
memory multiprocessors is eliminated Second Munin implements a dynamic ownership protocol
to distribute the task of data ownership across the nodes that use the data In general when a
shared data item is not owned by the local node the information in the local directory entry acts
as a hint to reduce the overhead of performing consistency operations
 Synchronization Support
Synchronization objects are accessed in a fundamentally dierent way than ordinary data  Thus
Munin provides ecient implementations of locks barriers and condition variables that directly use
Vs communication primitives rather than synchronizing through shared memory More elaborate
synchronization mechanisms such as monitors and atomic integers can be built using these basic
mechanisms Each Munin node maintains a synchronization object directory analogous to the
data object directory containing state information for the synchronization data All of Munins
synchronization primitives cause the local delayed update queue to be purged on a release
 Locks
Munin employs a queuebased implementation of locks similar to existing implementations on
shared memory multiprocessors This allows a thread to request ownership of a lock and block
awaiting a reply without repeated queries The system associates an ownership token and a
distributed queue with each lock A probable owner mechanism is used to locate the token or the
end of the queue associated with the lock The token migrates to nodes as they become owners
so no single node is responsible for maintaining the state of a given lock This approach has the
same benets in terms of exploiting locality of reference removing central bottlenecks and reducing
communication as Munins distributed data ownership protocol A frequent situation in which this
scheme works to particular advantage is when a thread attempts to reacquire a lock for which it
was the last owner  In this case the thread nds the associated token to be available locally and
is thus able to acquire the lock immediately 
without any message overhead Similarly if a small
subset of threads continuously reuses the same lock they communicate only with one another
When the lock ownership token is unavailable locally a message is sent along the probable
owner chain to the last lock holder If the lock is free 
the token is available the last lock holder
forwards the token to the requester which acquires the lock and continues executing Otherwise
the thread that was at the end of the queue stores the locking threads identity into a local data

structure without replying Each enqueued thread knows the identity of the thread that follows
it on the queue if any so when a thread releases a lock and the associated queue is nonempty
lock ownership is forwarded directly to the next thread in the queue after all delayed updates are
ushed in accordance with the requirements of release consistency
  Barriers
Barriers are used to simultaneously synchronize multiple threads When a barrier is created the
user species the number of threads that must reach the barrier before it is lowered When a
thread wishes to wait at a barrier it ushes any delayed updates sends a message to the barrier
manager thread 
a wellknown thread located on the root node from where the Munin program was
invoked and awaits a response When all of the threads have arrived at the barrier the barrier
manager replies to each waiting thread to let it resume We considered using a distributed barrier
mechanism similar to those designed for scalable multiprocessor systems but for the small size of
the prototype implementation a simple centralized scheme was more practical and ecient Unlike
locks which are pointtopoint and which exhibit a high degree of locality that makes it benecial
to migrate ownership barriers are most often used to synchronize all of the user threads in the
program In this case locality of reference cannot be exploited because single threads or small
subsets of threads do not tend to access the barrier without intervening accesses by other threads
Thus until the single barrier manager becomes a bottleneck there is no reason to distribute barrier
ownership
 Condition Variables
Munins condition variables are essentially binary semaphores that also support a broadcast wakeup
capability Unlike locks condition variables give threads the capability to synchronize indirectly
Any thread can perform a signal operation while the lock protocol allows only the lock owner
to release the lock While it is possible to build this kind of mechanism using locks we found it
convenient to include condition variables as a primitive In accordance with the requirements of the
release consistency model delayed modications are ushed before the signal or broadcast message
is forwarded to the condition manager thread
 Evaluation
  Application Programs
Seven application programs were used in the evaluation Three dierent versions of each application
were written a Munin DSM version a conventional DSM version that used the conventional
protocol for a sequentially consistent memory and a message passing version Great care was
taken to ensure that the inner loops of each computation the problem decomposition and the
major data structures for each version were identical Except where noted all array elements are
double precision oating point numbers Both the DSM system and the message passing programs
used Vs standard communication mechanisms
The DSM programs were originally written for a shared memory multiprocessor 
a Sequent
Symmetry Our results may therefore be viewed as an indication of the possibility of porting
shared memory programs to software DSM systems but it should be recognized that better results
may be obtained by tuning the programs to a particular DSM environment Table  summarizes
the seven application programs and problem sizes An eort was made to select a suite of programs

that would represent a relatively wide spectrum of shared memory parallel programs varying in
their parallelization techniques granularity degree and nature of sharing and locality of shared
data references Matrix Multiply 
MULT Finite Dierencing 
DIFF and Gaussian Elimination
with partial pivoting 
GAUSS are numeric problems that statically distribute the data across
the threads MULT DIFF and GAUSS exhibit increasing degrees of sharing FFT dynamically
reallocates the data across threads and exhibits an extremely high degree of sharing The Trav
eling Salesman Problem 
TSP and Quicksort 
QSORT programs use the task queue model to
dynamically allocate work to dierent threads The granularity for TSP was varied 
TSPC and
TSPF access data at a coarse and ne grain respectively QSORT exhibits a high degree of
false sharing in the array to be sorted Small to moderate problem sizes were chosen so that the
uniprocessor running times would be in the range of hundreds of seconds and the sixteen processor
running times would be on the order of tens of seconds The uniprocessor running times represent
sequential implementations of the programs with all synchronization and communication removed
 Experimental Methodology
For all three versions of each program a sequential initialization routine is executed on the root
node Then the appropriate number of additional nodes are created which for the DSM versions
gives each node a copy of the nonshared data The nonroot nodes initialize themselves and then
synchronize with the root node by waiting at a barrier for the DSM versions and via an explicit
message in the message passing versions For the DSM versions after the user thread on the root
node has created the required worker threads on each node it reads the clock to get the initial value
and then waits at the barrier which causes the computation to begin For the message passing
versions the root thread waits until it has received the initialization complete message from all
of the worker threads It then reads the initial clock value and sends a message to each of the
workers to start computation At this point the workers read their inputs via page faults for the
DSM versions or via request messages for the message passing versions Once all of the workers
have completed the root thread again reads its clock and calculates the total elapsed computation
time
In addition to execution times the Munin runtime system gathers statistics on the number of
faults the amount of data transferred and the amount of time stalled while performing various
consistency operations The message passing kernel collects similar data Selected portions of
these statistics are used throughout the analysis to highlight the reasons for observed performance
dierences between the dierent versions of the programs
 Overview of Results
The main results we report are the speedup of the various versions of the parallel programs over
the sequential version measured for  to  processors Figures  through  show the speedup
for each of the application programs as a function of the number of processors Table  shows the
speedup achieved on sixteen processors for the three versions of each application The percentages
in parentheses represent the percentage of message passings speedup achieved by Munin and
the percentage of both message passing and Munins speedup achieved by the conventional DSM
implementation Tables  and  show the amount of communication required during execution of
the programs on sixteen processors both in terms of number of messages and kilobytes of data
transmitted
For MULT DIFF TSPC and FFT the Munin versions achieved over 	 of the speedup of
their handcoded message passing equivalents while for TSPF QSORT and GAUSS the Munin

Program Problem size
MULT  by  square matrices
DIFF  by  square matrices
TSPC  cities	 recurse when   

TSPF  cities	 recurse when   
QSORT K items	 recurse when   
FFT 
K elements
GAUSS  by  square matrices
Table  Programs and Problem Sizes Used in Experiments
Message Munin Conventional
Passing DSM DSM
MULT     	 
DIFF  
   	 
TSPC 
   
 	 
TSPF     
	 
QSORT 
    
	 
FFT      	 
GAUSS     	 
Table   Speedups Achieved 
 processors
programs achieved between  and  For the programs with large grain sharing 
MULT and
TSPC the conventional versions achieved 		 and 	 respectively of the speedup of their
Munin counterparts For DIFF TSPF QSORT and GAUSS the performance of the conventional
versions was reduced to  of Munin For FFT there was so much false sharing that the
conventional version slowed down by a factor of ten when run on more than one processor
 Detailed Analysis
In this section we analyze in detail on a perprogram basis the reasons for the performance
dierences between the various versions of each program Unless otherwise noted the numbers in
this section pertain to the processor execution
  Matrix Multiply
 Program Description
The problem is to multiply two N by N input arrays and put the result in an N by N output
array Matrix Multiply is parallelized by giving each worker thread a number of contiguous rows

Program Message Munin Conventional
Passing
MULT   
DIFF   

TSPC   
TSPF   
QSORT  
 
FFT  
 
GAUSS  
 


Table  Number of Messages for Processor Execution
Program Message Munin Conventional
Passing
MULT  
 

DIFF  
 

TSPC  
 
TSPF   

QSORT   
FFT 

  



GAUSS   

Table  Amount of Data 
in Kilobytes for Processor Execution
of the output array to compute After each worker thread has terminated the root thread reads
in the result array and terminates
The DSM versions use a barrier to signal completion each worker thread in the message passing
version sends its result rows to the master when they have been computed The Munin version
declares the input arrays as read only and the output array as write shared
  Analysis
Matrix multiplication is almost completely computebound As a result the three versions achieved
almost identical speedups 
 for conventional DSM  for Munin and  for message pass
ing In all cases the cumulative computation time is roughly 	 seconds while the cumulative
communication time is roughly  seconds Both the Munin and the conventional DSM versions
perform approximately twice as much communications as the message passing version because the
DSM worker threads fault in the empty result array at the beginning of the computation while the
message passing worker threads simply initialize their portion of the result array in place Also
in Munin when a thread arrives at the nal barrier it updates any copies of a page in the re
sult matrix that are cached by neighboring nodes due to false sharing This results in the Munin
version performing more communication than the conventional version The Munin version still
outperforms the conventional version because the extra communication is largely overlapped with
computation while the read misses experienced by the conventional version cause processors to
stall Nevertheless compared to the overall execution time the time spent communicating is mi
nor so both the conventional and Munin versions exhibit near linear speedup

Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Matrix Multiplication 
MULT
Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure 	 Finite Dierencing 
DIFF
 Finite Di	erencing
  Program Description
During each iteration of the nite dierencing algorithm all elements of a matrix are updated to
the average of their nearest neighbors 
above below left and right To avoid overwriting the old
value of a matrix element before it is used an iteration is split up in two halfiterations In the
rst halfiteration the program uses a scratch array to compute the new values In the second it
copies the scratch array back to the main matrix
Each thread is assigned a number of contiguous rows to compute The algorithm requires only
those elements that lie directly along the boundary between two threads subarrays to be commu
nicated at the end of each iteration In the Munin version the matrix is declared as write shared
In the DSM versions the programmer is not required to specify the data partitioning to the runtime

system  it is inferred at runtime based on the observed access pattern After each halfiteration the
DSM worker threads synchronize by waiting at a barrier The message passing workers exchange
results directly between neighboring nodes after each iteration
   Analysis
DIFF has a much smaller computationtocommunication ratio than MULT 
see Tables  and 
but the Munin version still performs within  of the message passing version 
a speedup of 
for Munin versus  for message passing The reason for Munins good performance is its use of
software release consistency and the writeshared protocol Together these techniques result in the
underlying communications patterns for the Munin version and the message passing version being
nearly identical When each thread rst accesses a page of shared data it gets a copy of the page
Thus at the end of the rst halfiteration each node has a readwrite copy of any pages for which
it has the only copy and a readonly copy of any pages that lie along a boundary During the
second halfiteration during which each thread copies the new values from the scratch array to the
shared array each node creates a di of its shared pages When a thread arrives at the barrier after
this halfiteration it sends the di directly to the appropriate neighbors before sending the barrier
message to the barrier master These dis include all of the modied data on each boundary page
and not just the edge elements Since the shared pages are still shared even after they are purged
they are writeprotected again so subsequent writes will be detected For subsequent iterations
each node experiences a protection violation only on the boundary pages and then only perform
local operations 
creating twins except when exchanging the results Thus the data motion
in the Munin version of DIFF is essentially identical to the message passing implementation 
communication only occurs at the end of each iteration and only neighboring nodes exchange
results The only overhead comes from fault handling and from copying encoding and decoding
the shared portions of the matrix As an aside a curious phenomenon can be seen in Table  the
Munin version of DIFF transmits less data than the message passing version This is a result of the
fact that Munin only transmits the words that have been modied during each iteration while the
message passing version ships the entire edge row During the early iterations many of the edge
values have not yet been modied and thus Munin does not transmit any new values for them In
practice this extra transmitted data had a negligible eect on the running times Rather Munins
good performance derived from the fact that it transmits data only during synchronization and
suers no read misses 
after the rst iteration
The conventional DSM version of DIFF achieved a speedup of only  compared to  for
Munin The conventional version suers from 
 frequent read faults and reloads as a result of the
invalidation protocol and 
 blocking on write faults as a result of sequential consistency The
Munin version of DIFF creates and transmits dis at the end of each iteration which results in
shared data being present before it is accessed during the next iteration This eliminates read misses
and reloads on the next iteration In contrast the conventional DSM implementation invalidates
and reloads every shared page in its entirety on each iteration In addition write faults can be
handled completely locally in Munin if the data is already present which is the case for all but
the rst iteration The local node simply makes a twin of the data The conventional DSM
implementation sends an invalidation message and waits for a response The tradeo is that
synchronization under Munin is slowed down because memory needs to be made consistent before
the synchronization operation can complete However the total time that the Munin worker threads
spend blocked while waiting for memory to be made consistent 
 seconds is far less than the
time spent invalidating and reloading the data in the conventional version 
a total of  seconds

The time spent invalidating and reloading seriously impacts execution time 
 seconds of a total
execution time of  seconds
 Traveling Salesman Problem
 Program Description
The Traveling Salesman Problem 
TSP takes as its input an array representing the distances
between cities on a salesmans route and computes the minimum length tour passing through
each city exactly once A tour queue maintains a number of partially evaluated tours If the number
of nodes remaining to complete the tour is below a threshold  for TSPF and  for TSPC
the remainder of the tour is evaluated sequentially If the number of nodes remaining is above this
threshold the partial tour is expanded by one node and the new partial tours are entered on the
tour queue When a partial tour is removed from the queue a lower bound on the remaining part
of the tour is computed and the tour is rejected if the sum of the current length and the lower
bound is higher than the current best tour This check is also performed before a potential new
subtour is put on the task queue The tour queue is a priority queue that orders the remaining
subtours in the inverse order of a lower bound of their total length Thus the most promising
subtours are evaluated rst which tends to prune uninteresting subtours more quickly The major
shared data structures of TSP are the current shortest tour and its length an array of structures
that represent partially evaluated tours a priority queue that contains indices into the tour array
of partially evaluated tours and a stack of indices of unused tour array entries TSPC and TSPF
dier only in the problem granularity TSPC sequentially solves subtours of length  or less
while TSPF sequentially solves subtours of length  or less Depending on the particular input
data set the computation to communication ratio of TSPC can be as much as ten times higher
than that of TSPF
In the DSM versions locks protect the priority queue the current shortest tour and its length
A condition variable is used to signal when there is work to be performed Worker threads acquire
the lock and continue to remove partial tours from the queue until a promising tour has been
found that can be expanded sequentially at which time the lock is released In Munin the priority
queue and the stack of unused tours are declared migratory while the other shared data structures
are declared write shared For the message passing version themastermaintains a central priority
queue that contains the indices of subtours to be solved The slaves send request messages to the
master which responds either with a subtour to be solved sequentially or an indication that there
is no more work Workers tell the master when they nd a new global minimum and the master
is responsible for propagating it
  Analysis 
Coarse Grain TSP
The Munin version achieved a speedup of  within  of the  achieved by the message
passing version TSPC is rather computebound under  seconds of communication for the
Munin version compared to a total execution time of  seconds The performance dierence
between the message passing version and the Munin version comes from the cost of accessing the
priority queue In Munin each time a thread tries to remove a tour from the queue the queue data
structure needs to be shipped to that thread This behavior had two adverse eects on performance
First worker threads cumulatively spent  seconds waiting on the task queue lock Second the
Munin version shipped  megabytes of data compared to only 	 kilobytes in the message passing
version

Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  CoarseGrained Traveling Salesman Problem 
TSPC
Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  FineGrained Traveling Salesman Problem 
TSPF
The dierence in performance between the Munin and conventional DSM versions of TSPC 
a
speedup of  for Munin versus  for conventional DSM stems from 
 the use of a migratory
protocol for the task queue and 
 the use of an update instead of an invalidate protocol for
the minimum tour length The slightly higher overhead caused by loading and invalidating rather
than simply migrating the task queue had the eect of causing more processors to idle themselves
waiting for work This was because access to the task queue was the primary bottleneck 
a total of
	 seconds for the conventional version versus only  in the Munin version The minimum tour
length is an example of a shared data item for which an update protocol is better than an invalidate
protocol because it is read much more frequently than it is written With the conventional protocol
running on N processors a thread that needs to update the minimum tour length typically sends
N   invalidations and then wait for N   acknowledgements All other threads in turn incur an
access miss and its associated latency to obtain a new copy of the minimum tour length

 Analysis 
Fine Grain TSP
The Munin version of TSPF achieved a speedup of   less than the 	 speedup achieved by
the message passing version The reasons for the reduction in performance are the same as for TSP
C but their relative importance is increased In TSPF worker threads spent a cumulative 
seconds waiting for the priority queue and a total of  seconds performing useful computation
In addition 	 megabytes of data were transmitted in the Munin version compared to only 	
kilobytes for the message passing version Similar arguments apply for the conventional DSM
version resulting in a speedup of only 
 Quicksort
 Program Description
Quicksort 
QSORT is a recursive sorting algorithm that operates by repeatedly partitioning an
unsorted input lists into unsorted sublists such that all of the elements in one of the sublists are
strictly greater than the elements of the other The Quicksort algorithm is then recursively invoked
on the two unsorted sublists The base case of the recursion occurs when the lists are suciently
small 
 kilobyte in our case at which time they are sorted sequentially
Quicksort is parallelized using a work queue that contains descriptors of unsorted sublists
from which worker threads continuously remove unsorted lists In the DSM versions of QSORT
the major data structures are the array to be sorted a task queue that contains range indices of
unsorted subarrays and a count of the number of worker threads blocked waiting for work Like
TSP the task queue is declared to be migratory while the array being sorted is declared to be
write shared A lock protects the queue and a condition variable is used to signal the presence
of work to be performed QSORT diers from TSP in that when QSORT releases control of the
task queue it may need to further subdivide the work by partitioning the subarray and placing
the new subarrays back into the task queue In contrast TSP workers never relinquish control of
the task queue until they have removed a subtour that can be solved sequentially Therefore the
task queue in QSORT is accessed more frequently per unit of computation Osetting this is the
fact that the threads in TSP hold the lock protecting the priority queue for a longer time as they
perform the expansion
For the message passing version of QSORT the master maintains the work queue The slaves
send request messages to the master which responds either with the sublist to be sorted sequentially
or an indication that there is no more work Along with the requests the slaves ship the sorted
results from their previous request if any
  Analysis
The Munin version of QSORT achieves only  of the speedup of the message passing version

	 versus to  As with TSPC and TSPF most of Munins overhead comes from shipping
the work queue each time a node tries to perform a queue insertion or deletion Compounding
this problem is the fact that the threads do not retain sole ownership of the work queue while
subdividing the work into pieces suciently small to solve directly so they repeatedly need to
reacquire the task queue and partition their subarray until it contains at most  elements As
a result the threads spent a cumulative  seconds waiting on the task queue lock out of a total
execution time of  seconds Furthermore the Munin version transmitted  megabytes of data
compared to  kilobytes for the message passing implementation
	
Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Quicksort 
QSORT
Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Fast Fourier Transform 
FFT
For the conventional DSM version speedup drops to  In addition to the cost of invalidating
and reloading the task queue rather than simply migrating it the dierence in performance between
the conventional DSM version and the Munin version is primarily due to the presence of false sharing
when two threads simultaneously attempt to sort subarrays that reside on the same page As a
result communication goes from  megabytes in about  messages for the Munin version to
 megabytes in  messages for the conventional version
 Fast Fourier Transform
 Program Description
The Fast Fourier Transform 
FFT program used in the evaluation is based on the CooleyTukey
Radix  Decimation in Time algorithm It recursively subdivides the problem into its even and

odd components until the input is of length  For this base case the output is an elementary
function known as a Buttery a linear combination of its inputs For an input array of size N  the
FFT algorithm requires log

N passes On pass K the width of each buttery is N
K 
 Thus
for the rst pass the width of the buttery is N and on each subsequent iteration the width of
each buttery halves By starting with the wide butteries the result array is a permutation of
the desired value but this is rectied with an O
N cleanup phase
If P processors are used to solve an N point FFT where P is power of  then a reasonable
initial decomposition of the work allows processor p to work with xp xp  P  xp  P  
xp  N  P  This allows all processors to perform the rst log

N  log

P passes without any
interprocessor communication Before executing the last log

P iterations the processors exchange
data and reallocate themselves to dierent 
contiguous subarrays
Both the DSM and message passing programs are parallelized by dynamically allocating threads
to data as described above The array on which the FFT is being performed is declared to be
write shared in the Munin version By carefully allocating processors to data as described above
it is possible to only reallocate the processors and exchange data at the end of the rst log

N 
log

P phases The DSM programs use a barrier to synchronize at this point The DSM system
automatically reallocates the data on demand The message passing version manually encodes and
shu es the data using a master process to collect and redistribute all of changes This manual
redistribution made the message passing version much harder to write than the DSM versions The
processor reallocation is built in to the algorithm itself
  Analysis
The FFT algorithm used has a very high degree of sharing which results in it being bus bandwidth
limited to a speedup of approximately ten on a twenty processor singlebus multiprocessor like the
Sequent Symmetry Because of the way that the data is distributed every page is referenced 
and
modied by every thread during the rst log

N  log

P iterations the worst possible behavior
for any DSM system The conventional DSM version slows down by a factor of ten for two or more
processors while the Munin version achieved a speedup of  on sixteen processors The cause for
this dramatic dierence in performance is Munins ability to eciently support multiple concurrent
writers to a shared page of data The message passing version of FFT performed slightly better

speedup of  on  processors than the Munin version
The conventional DSM implementation takes over  faults requires  gigabytes of
data to be shipped and  million messages to be transmitted and cumulatively spends over
 seconds waiting for requests to be satised While not devoid of overhead the Munin version
requires orders of magnitude less communication It only takes  faults and reloads a total of
 megabytes of data The primary source of overhead for the Munin program comes from sending
out the updates during the data exchange phase after the rst log

N  log

P phases At the
beginning of the update phase every processor is caching every page of shared data This causes
each processor to attempt to send updates for every page to every other processor which adds
two seconds of synchronization overhead Munins update timeout mechanism keeps the processors
from actually shipping most of the data to every node resulting in the Munin version shipping only
slightly more data than the message passing version

Ideal Mesg Passing Munin DSM Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure   Gaussian Elimination with Partial Pivoting 
GAUSS
 Gaussian Elimination with Partial Pivoting
 Program Description
Gaussian Elimination 
GAUSS decomposes a square matrix into upper and lower triangular sub
matrices by repeatedly eliminating the elements of the matrix under the diagonal one column at
a time The basic algorithm for an N by N matrix is shown in Figure  For each iteration of the
iloop the algorithm subtracts the appropriate multiple of the i
th
row of the matrix from the rows
below it so that the elements below the diagonal in the i
th
column are zeroed Partial pivoting
improves the numerical stability of the basic algorithm by interchanging the i
th
row with the row
in the range i  N   containing the largest 
in absolute value element of the i
th
column
Algorithmically this involves inserting a phase between the i and j loops that searches the i
th
column for the pivot element and swapping that row and the i
th
row
We decomposed the computation by column so that the pivoting phase which can be a syn
chronization bottleneck can be performed on a single processor Each thread gets roughly bNPc
columns striped across the matrix and any extra columns are spread evenly across the worker
threads The computation itself involves N iterations one per column each iteration consisting of
a pivoting phase and a computation phase
The DSM versions are parallelized as follows The shared data structures are the array on which
the elimination is being performed a vector into which the pivot row is copied and an integer that
contains the number of the pivot row  all of which are declared to be write shared in the Munin
version Each iteration starts with a barrier After the barrier falls the thread responsible for the
for i   to N do
for j  i to N do
for k  N downto i do
ajk  ajk  aik	aji
aii
Figure  Basic 
wo pivoting Gaussian Elimination Algorithm

current column performs the necessary pivoting sets a shared pivot row variable to indicate the row
that needs to be pivoted with the current one and copies the current column to a shared variable
to be used by the other threads during the computation phase A barrier is used to separate the
pivoting and computation phases After the barrier is passed each thread performs the actual
computation which involves performing the local pivoting followed by the elimination step shown
in Figure 
The messagepassing version works similarly except that the barrier is replaced by messages
from the slaves to the central master and the pivot column and pivot row number are explicitly
sent to the workers rather than faulted in asynchronously
  Analysis
The DSM versions of Gaussian Elimination require two barriers per iteration for synchronization
The Munin version achieves a speedup of   of the message passing versions speedup of 
on sixteen processors The reason for this reduced performance is that the relatively small amount
of work done per iteration particularly during the latter stages of the algorithm when there are very
few nonzero elements left upon which to operate accentuates the overhead imposed by both the
general purpose barrier mechanism and the need to update shared data during synchronization
On average each thread spends over  seconds waiting for barriers which includes the time spent
exchanging data
The conventional DSM version of GAUSS achieves a speedup of  on sixteen processors 
of the message passing version In addition to the synchronization issues noted in the Munin im
plementation the conventional DSM implementation also suers from frequent read misses caused
by accesses to invalidated data While the Munin implementation experiences 	 read misses the
conventional DSM implementation experiences  This is caused by the use of an invalidation
based consistency protocol in the conventional DSM system Since all of the modications are made
to shared data that is being actively shared 
and constantly used on all sixteen processors the
updatepruning advantage of an invalidation protocol is not relevant while the increased number of
read misses is a signicant problem Each thread stalls for an average of  seconds for read misses
to be serviced In addition because the last thread to have its read miss satised must wait until
fourteen other threads have successfully acquired their data the computations tend to complete at
noticeably dierent times This causes the average time spent waiting at barriers to increase from
 to  seconds These two phenomena explain the lower performance of the conventional DSM
implementation
The performance times reported for the Munin version of all applications including GAUSS
were with the update timeout mechanism enabled For GAUSS disabling the update timeout
mechanism results in a slight performance advantage 
a speedup of 	 instead of  on 
processors This is because in GAUSS all of the modied data is accessed every iteration thus
it is best to propagate the updates and not selectively invalidate In this case the  millisecond
default update timeout time was too short to ensure that no updates were timed out Enabling the
timeout mechanism thus resulted in unnecessary invalidations and subsequent reloads
	 E
ect of Communication Reduction Techniques
In this section we try to isolate the eects on performance of each of the techniques for reducing
communication that were described in Section  This isolation is made somewhat dicult because
of the synergistic eect on performance of using the techniques in conjunction with one another
In particular writeshared protocols cannot be used in the absence of release consistency or some

other mechanism to relax memory consistency Therefore we rst compare Munins buered write
update implementation of release consistency to a pipelined writeinvalidate implementation of
release consistency Then we compare the use of multiple protocols versus using a single protocol
writeshared Finally we determine the value of the update timeout mechanism in connection with
the update protocol

  Bu	ered Update versus Pipelined Invalidate Release Consistency
In Section  we described the motivation for using a buered update protocol for implementing
release consistency in software and the advantages of doing so over using a pipelined invalidate
protocol To evaluate the performance impact of this decision we implemented a pipelined write
invalidate consistency protocol and compared it to the buered update protocol that is in normal
use in Munin In the pipelined writeinvalidate protocol a write fault causes ownership to be trans
ferred to the faulting processor Then invalidations are sent out in separate messages Multiple
invalidations can be outstanding concurrently but no synchronization operation is allowed to com
plete until all outstanding invalidations have been acknowledged We compared the performance of
this implementation of release consistency with the Munin implementation using bueredupdate
and with the conventional DSM system For MULT TSPC TSPF and GAUSS there is little dif
ference between the pipelined writeinvalidate and buered writeupdate implementations of release
consistency For DIFF and QSORT the buered writeupdate scheme performs  better for 
processors while for FFT it performs orders of magnitude better For the latter three applications
the pipelined writeinvalidate protocol performs slightly better than a conventional writeinvalidate
protocol Figures  and  depict these results for DIFF and FFT The performance of QSORT
is similar to that of DIFF
These results demonstrate that while the pipelined writeinvalidate protocol oers some per
formance gain over a conventional sequentially consistent writeinvalidate protocol in a software
DSM system a buered writeupdate protocol outperforms both Pipelining invalidations allows
useful computation to be overlapped with invalidations which reduces the cost of writes However
it does not reduce the penalty associated with read misses which are very expensive in a software
DSM system Furthermore the pipelinedinvalidate protocol suers from false sharing much in
the same way that a conventional DSM system does When read misses dominate or when there
is substantial false sharing Munins buered update implementation is superior

 Multiple Consistency Protocols
To evaluate the importance of Munins support for multiple consistency protocols we compared
the performance of two versions of Munin 
i a version in which multiple consistency protocols
were used and 
ii a version that labeled all shared data as writeshared thus employing Munins
most versatile protocol Figure  presents the results of this experiment for TSPF similar results
were obtained for the other multiprotocol test programs 
TSPC and QSORT For TSPF using
multiple protocols leads to a  improvement in speedup for  processors The reason is that
the multiple protocol version of the program declares the task queue to be migratory resulting in
the advantages described in Section  Although a  improvement in performance is modest
the cost associated with implementing multiple protocols in a software DSM system is essentially
zero

Ideal Pipelined Invs. Buffered Updates Conv DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Buered WURC versus Pipelined WIRC 
DIFF
Ideal Pipelined Invs. Buffered Updates Conv. DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Buered WURC versus Pipelined WIRC 
FFT

 Update Timeout Mechanism
To test the value of the timeout mechanism in connection with the update protocol we compared
the performance of versions with and without the timeout enabled For MULT DIFF and TSPC
there is no dierence For TSPF and QSORT the version with the timeout enabled is  and
 faster for  processors respectively The dierence is the largest for FFT Speedup with
 processors drops from  to  when the timeout was disabled 
see Figure  Finally for
GAUSS the timeout causes a  dropo in performance for  processors
In terms of the underlying DSM operation without the timeout mechanism the processor
FFT sends  messages and 	 megabytes of data while with the timeout mechanism
enabled the processor FFT sends only  messages and  megabytes of data The reason
that the amount of data shipped does not drop as dramatically as the number of messages is that

Ideal Multiple Protocols All Write-Shared
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Multiprotocol versus All WriteShared 
TSPF
Ideal With Timeouts No Timeouts Conv. DSM
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure 	 Eect of Update Timeout Mechanism on FFT
after a page of data has been speculatively invalidated future accesses require an kilobyte page
to be transferred rather than just a di 
The other two programs in which each processors working set changes dynamically over the
course of the program execution TSP and QSORT are also aided by the use of the timeout
mechanism For TSP each page of the shared tour array tends to be used by many dierent
processors over time but each processor only uses it for a very short period of time and only a
few processors use a particular page at a time Without the timeout mechanism eventually almost
every processor receives updates for almost every page The shared sort array in QSORT exhibits
a similar phenomenon
With GAUSS all of the modied data are accessed every iteration The slight dropo in
performance for GAUSS is caused by the fact that the default update timeout time of  milliseconds
is too short to ensure that no valid updates are timed out

 Function Shipping
For TSPF and QSORT the two programs that use the task queue model of parallelism and that
have a signicant amount of sharing the Munin sixteen processor versions achieves speedups of
only  and 	 respectively compared to 	 and  for the message passing versions The
conventional DSM versions performed even worse achieving speedups of  and  respectively
As shown in Table  the major source of overhead for these DSM versions 
with the exception of
the conventional version of QSORT is the amount of time spent waiting on the lock protecting the
work queues For the conventional version of QSORT false sharing within the array being sorted
is the dominant source of overhead
These lock waiting times are large because the DSM versions must ship the work queue a
sizable data structure to the acquiring thread before that thread can perform any operation on the
work queue In comparison the actual time spent performing operations on the work queue is very
small The message passing versions do not suer from this phenomenon since the work queue is
kept at the root node and worker threads perform remote procedure calls 
RPCs containing only
a small amount of data to the root node in order to operate on the queue
In order to evaluate the feasibility and potential value of using a mixed datashipping and
functionshipping mechanism in a DSM system we modied the DSM versions of TSPF and
QSORT such that the task queue remains attached to the root node and all access to the task queue
by other nodes is performed using RPC These modications were done in an ad hoc manner but
research is ongoing to extend Munin to support both DSM and function shipping in an integrated
fashion The results of functionshipping access to the task queue for the TSPF and QSORT are
shown in Figures  and 	 These gures show the speedups achieved by Munin and conventional
DSM both with and without function shipping for the task queue
For TSPF function shipping causes both DSM versions to perform almost as well as the
message passing version 
on  processors a speedup of 	 for conventional DSM 	 for Munin
and  for message passing In contrast without function shipping Munin achieves a speedup
of only  and the conventional DSM a speedup of only  For the Munin version without
function shipping communication is substantially more 
		 messages and 		 kilobytes of data
than the Munin version with function shipping 
 messages and  kilobytes of data Perhaps
more importantly the reduced communication of the function shipping version nearly eliminates
the time that threads are idle waiting for access to the task queue
For QSORT improvements are similar to those in TSPF for the Munin version but no im
provement is achieved for the conventional DSM version The addition of functionshipping for the
task queue raises the processor speedup for Munin from 	 to 	 compared to  for the
message passing version The conventional DSM version both with and without function shipping
Program Average lock waiting Execution time
time 
per processor 
per processor

seconds 
seconds
Munin TSPF 	 
Conventional TSPF  
Munin QSORT  
Conventional QSORT  
Table  Lock waiting times for TSPF and QSORT

Ideal
Mesg Passing
Munin DSM
Conv DSM
Munin w/ RPC
Conv w/ RPC
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Eect of Function Shipping on Finegrained TSP
Ideal
Mesg Passing
Munin DSM
Conv DSM
Munin w/ RPC
Conv w/ RPC
Number of Processors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sp
ee
du
p
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Figure  Eect of Function Shipping on Quicksort
for the task queue achieves only a speedup of  As explained in Section  false sharing is
the primary obstacle to good performance for the conventional version While the average time
waiting for locks is reduced from  seconds to below  second the average time a process waits
for fresh copies of data increases from  to  seconds so the addition of function shipping has
no benecial eects
These experiments show that the addition of function shipping for accessing some shared data
can signicantly improve the performance of some programs In addition the QSORT experiment
further illustrates the value of Munins writeshared protocol for dealing with false sharing

 Related Work
This section compares our work with a number of existing software and hardware DSM systems
focusing on the mechanisms used by these other systems to reduce the amount of communication
necessary to provide shared memory We limit our discussion to those systems that are most related
to the work presented in this paper
  Software DSMs
Ivy was the rst software DSM system  It uses a singlewriter writeinvalidate protocol for
all data with virtual memory pages as the units of consistency This protocol is used as the
baseline conventional protocol in our experiments The large size of the consistency unit and the
singlewriter protocol makes the system prone to large amounts of communication due to false
sharing It is up to the programmer or the compiler to lay out the program data structures in the
shared address space such that false sharing is reduced The directory management scheme in our
implementation is largely borrowed from Ivys dynamic distributed manager scheme
Both Clouds  and Mirage  allow part of shared memory to be locked down at a particular
processor In Clouds the programmer can request that a segment of shared memory be locked on a
processor In Mirage a page remains at a processor for a certain ! time window after it is modied
by that processor In both cases the goal is to avoid extensive communication due to false sharing
The combination of software release consistency and writeshared protocols addresses the adverse
eects of false sharing without introducing the delays caused by locking parts of shared memory to
a processor
Mether  supports a number of special shared memory segments in xed locations in the
virtual address space of each machine in the system In an attempt to support ecient memory
based spinlocks Mether supports several dierent shared memory segments each with dierent
protocol characteristics Two segments are for small objects 
up to  bytes while two are for
large objects 
up to 	 bytes One of each pair is demanddriven which means that the
memory is shipped when it is read as in a conventional DSM The other is datadriven which
means that it is shipped when it is written A thread that attempts to read the data will block until
the next thread writes it This latter form of data can support spinlocks and messagepassing fairly
eectively Our support for multiple protocols is more general without added cost and Munins
separate synchronization package removes the need to support datadriven memory
Lazy release consistency as used in TreadMarks  is an algorithm for implementing release
consistency dierent from the one presented in this paper Instead of updating every cached copy
of a data item whenever the modifying thread performs a release operation only the cached copies
on the processor that next acquires the released lock are updated Lazy release consistency reduces
the number of messages required to maintain consistency but the implementation is more expensive
in terms of protocol and memory overhead 
A variety of systems have sought to present an objectoriented interface to shared memory We
describe the Orca  as an example of this approach In general the objectoriented nature allows
the compiler and the runtime system to carry out a number of powerful optimizations but the
programs have to be written in the particular object model supported
The Orca language requires that 
i all access to objects is through welldened perobject
operations 
ii only one operation on an object can be performed at a time and 
iii there are no
global variables or pointers This programming model allows the compiler to detect all accesses
to an object directly without the use of page faults Programmers must however structure their
programs so that objects are accessed in a way that does not limit performance For example
	
an Orca implementation of DIFF requires that the edge elements be specied as shared buers
 the entire array should not be declared as a single object However once a program has been
structured appropriately Orca can transparently choose whether to replicate an object or force
all accesses to be made via RPCs to a master node If it chooses to replicate an object it can
support both invalidate and update consistency protocols It remains to be seen how well Orcas
optimizations can be integrated into a less restrictive language On an orthogonal issue Orcas
consistency management uses an ecient reliable ordered broadcast protocol For reasons of
scalability Munin does not rely on broadcast although support for ecient multicast could improve
the performance of some aspects of Munin
Midway  proposes a DSM system with entry consistency  a memory consistency model weaker
than release consistency The goal of Midway is to minimize communication costs by aggressively
exploiting the relationship between shared variables and the synchronization objects that protect
them Entry consistency only guarantees the consistency of a data item when the lock associated
with it is acquired To exploit the power of entry consistency the programmer must associate each
individual unit of shared data with a single lock For some programs making this association is
easy However for programs that use nested data structures or arrays it is not clear if making a
onetoone association is feasible without forcing programmers to completely rewrite their programs
For example the programmer of an entry consistent DIFF program would have to hand decompose
the shared array to exploit the power of entry consistency The designers of Midway recognized
this problem and proposed to give programmers the ability to increase and decrease the strength of
the consistency model supported Thus programs for which the datasynchronization association
required by entry consistency is convenient can exploit its exibility while programs for which this
association is inconvenient can use either release consistency 
when adequate synchronization is
performed or sequential consistency Unlike Munin Midway exploits the power of a sophisticated
compiler The Midway compiler inserts code around data accesses so that the Midway runtime
system can determine whether a particular shared variable is present before it is accessed Thus
Midway is able to detect access violations without taking page faults which eliminates the time
spent handling interrupts
 Hardware DSMs
Several designs for hardware distributed shared memory systems have been published recently of
which DASH  GalacticaNet  and APRIL  are representative
We have adopted from the DASH project  the concept of release consistency The dierences
between DASHs implementation of release consistency and Munins implementation of release
consistency were explained in detail in Section  DASH uses a writeinvalidate protocol for
all consistency maintenance We instead use the exibility of its software implementation to also
attack the problem of read misses by using update protocols and migration when appropriate The
GalacticaNet system  also demonstrated that support for an updatebased protocol that exploits
the exibility of a relaxed consistency protocol can improve performance by reducing the number
of read misses and attendant processor stalls The GalacticaNet design includes a provision to time
out updates to stale data which is shown to have a signicant eect on performance when there is
a large number of processors
The APRIL machine addresses the problem of high latencies in distributed shared memory
multiprocessors in a dierent way  APRIL provides sequential consistency but relies on ex
tremely fast processor switching to overlap memory latency with computation For APRIL to be
successful at reducing the impact of read misses there must be several threads ready to run on
each processor Because APRIL performs many lowlevel consistency operations in very fast trap

handling software it would be possible to adopt several of our techniques to their hardware cache
consistency mechanism
 Conclusions and Directions for Further Work
Software distributed shared memory 
DSM systems provide a shared memory abstraction on hard
ware with physically distributed memory This approach is appealing because it combines the desir
able features of distributed and shared memory machines Distributed memory machines are easier
to build but shared memory provides a more convenient programming model It has however
proven dicult to achieve performance on DSM systems that is comparable to what can be achieved
with handcoded message passing programs In particular conventional DSM implementations have
suered from excessive amounts of communications engendered by sequential consistency and false
sharing
In this paper we have presented and evaluated a number of techniques to reduce the amount of
communication necessary to maintain consistency In particular we replaced sequential consistency
by release consistency as our choice of consistency model We developed a buered updatebased
implementation of release consistency suitable for software systems The update protocol has a
timeout feature preventing large numbers of unnecessary updates to copies of pages that are no
longer in use Furthermore we allow the use of multiple protocols to maintain consistency Of
particular interest among these protocols is the writeshared protocol that allows several processes
to write to a page concurrently with the individual modications merged at a later point according
to the requirements of release consistency
We have implemented these techniques in the Munin DSM system The resulting system runs
on a network of workstations and provides an interface that is very close to a conventional shared
memory programming system For programs that are free of data races releaseconsistent memory
produces the same results as sequentiallyconsistent memory All synchronization operations must
be performed through systemsupplied primitives and shared variables may optionally be anno
tated with the desired consistency protocol For the applications that we have looked at these
requirements proved to be a very minor burden
The use of these techniques has substantially broadened the class of applications for which DSM
on a network of workstations is a viable vehicle for parallel programming For very coarsegrained
applications conventional DSM performs satisfactorily However as the granularity of parallelism
decreases conventional DSM performance falls behind while Munins performance continues to
track that of handcoded message passing The addition of a function shipping ability further
improves the performance of DSM
Hardware technology has improved dramatically since the experiments reported here were per
formed and there are no signs that the current rate of performance improvement will abate soon
In particular both processor and network speeds have improved by a factor of fteen to twenty
in the past four years Interprocessor communication is still a high latency operation but there
are indications that latencies can be improved by an order of magnitude through careful protocol
implementation  At the same time DRAM latencies are improving very slowly so some form
of cache will be present on essentially all future highperformance platforms Finally hardware
DSM systems are becoming more common An important issue to address is the applicability of
the techniques introduced in this paper to future DSM system both hardware and software
We believe that there are two basic requirements that DSM systems hardware or software must
satisfy to provide acceptably high performance Both the latency and the frequency of processor
stalling DSM operations 
eg cache misses or synchronization events must be kept low It appears

that despite improvements in networking and operating system designs the latency of remote
operations will slowly increase compared to processor cycle times However because memory
speeds are not increasing very rapidly the ratio of remote memory access to local memory access

not satised by the cache will decrease This observation would seem to indicate that a simple
implementation of DSM that ships entire pages 
or cache lines on demand and uses invalidation to
maintain consistency would suce as processor and network technology improves We believe that
this will not be the case because of our second requirement for ecient DSM a low frequency of
processorstalling DSM operations As processor cycle times continue to decrease dramatically it is
becoming increasingly important to avoid stalling the processor As described in Section  using
a conventional invalidationbased consistency protocol can increase the number of highlatency read
misses dramatically Also as the size of memories and caches increase page and cache line sizes are
also increasing which indicates that false sharing will become an increasingly important problem
These observations indicates that some form of update protocol that supports multiple concurrent
writers such as Munins writeshared protocol will be useful in future DSM systems
Our current DSM work focuses on techniques required to implement DSM on current high
performance platforms with faster processors and networks than the ones used for the experiments
in this paper In particular we are studying a more aggressive implementation of release consistency
lazy release consistency and compiler techniques to further optimize performance We are also
studying the value of the techniques described here in the context of hardwaresupported distributed
shared memory multiprocessors
References
 A Agarwal BH Lim D Kranz and J Kubiatowicz APRIL A processor architecture for
multiprocessing In Proceedings of the th Annual International Symposium on Computer
Architecture pages  May 		
 J Archibald and JL Baer Cache coherence protocols Evaluation using a multiprocessor
simulation model ACM Transactions on Computer Systems 
	 November 	
 HE Bal MF Kaashoek and AS Tanenbaum Orca A language for parallel programming
of distributed systems IEEE Transactions on Software Engineering pages 	 March
		
 JK Bennett JB Carter and W Zwaenepoel Adaptive software cache management for
distributed shared memory architectures In Proceedings of the th Annual International
Symposium on Computer Architecture pages  May 		
 BN Bershad MJ Zekauskas and WA Sawdon The Midway distributed shared memory
system In COMPCON  pages  February 		
 JB Carter E	cient Distributed Shared Memory Based On MultiProtocol Release Consis
tency PhD thesis Rice University August 		
 JB Carter JK Bennett and W Zwaenepoel Implementation and performance of Munin
In Proceedings of the th ACM Symposium on Operating Systems Principles pages 
October 		
 JS Chase FG Amador ED Lazowska HM Levy and RJ Littleeld The Amber sys
tem Parallel programming on a network of multiprocessors In Proceedings of the 
th ACM
Symposium on Operating Systems Principles pages  December 		

	 DR Cheriton The V distributed system Communications of the ACM 
 March
	
 P Dasgupta RC Chen S Menon M Pearson R Ananthanarayanan U Ramachandran
M Ahamad R LeBlanc Jr W Applebe JM BernabeuAuban PW Hutto MYA Khalidi
and CJ Wileknloh The design and implementation of the Clouds distributed operating
system Computing Systems Journal  Winter 		
 SJ Eggers and RH Katz A characterization of sharing in parallel programs and its ap
plication to coherency protocol evaluation In Proceedings of the th Annual International
Symposium on Computer Architecture pages  May 	
 B Fleisch and G Popek Mirage A coherent distributed shared memory design In Proceedings
of the 
th ACM Symposium on Operating Systems Principles pages  December 		
 K Gharachorloo A Gupta and J Hennessy Performance evaluations of memory consis
tency models for sharedmemory multiprocessors In Proceedings of the th Symposium on
Architectural Support for Programming Languages and Operating Systems April 		
 K Gharachorloo D Lenoski J Laudon P Gibbons A Gupta and J Hennessy Memory
consistency and event ordering in scalable sharedmemory multiprocessors In Proceedings of
the th Annual International Symposium on Computer Architecture pages  Seattle
Washington May 		
 E Jul H Levy N Hutchinson and A Black Finegrained mobility in the Emerald system
ACM Transactions on Computer Systems 
	 February 	
 AR Karlin MS Manasse L Rudolph and DD Sleator Competitive snoopy caching In
Proceedings of the th Annual IEEE Symposium on the Foundations of Computer Science
pages  	
 P Keleher A L Cox and W Zwaenepoel Lazy consistency for software distributed shared
memory In Proceedings of the th Annual International Symposium on Computer Architec
ture pages  May 		
 P Keleher S Dwarkadas A Cox and W Zwaenepoel Treadmarks Distributed shared
memory on standard workstations and operating systems In Proceedings of the  Winter
Usenix Conference pages  January 		
	 L Lamport How to make a multiprocessor computer that correctly executes multiprocess
programs IEEE Transactions on Computers C
			 September 		
 D Lenoski J Laudon K Gharachorloo A Gupta and J Hennessy The directorybased
cache coherence protocol for the DASH multiprocessor In Proceedings of the th Annual
International Symposium on Computer Architecture pages 	 May 		
 K Li and P Hudak Memory coherence in shared virtual memory systems ACM Transactions
on Computer Systems 
	 November 		
 RG Minnich and DJ Farber The Mether system A distributed shared memory for SunOS
 In Proceedings of the Summer  USENIX Conference pages  June 		

 AC Thekkath and H Levy Limits to lowlatency communications on highspeed networks
acm Transactions on Computer Systems 
	 May 		
 WD Weber and A Gupta Analysis of cache invalidation patterns in multiprocessors In
Proceedings of the rd Symposium on Architectural Support for Programming Languages and
Operating Systems pages  April 		
 A Wilson and R LaRowe Hiding shared memory reference latency on the GalacticaNet
distributed shared memory architecture Journal of Parallel and Distributed Computing

 August 		

